Lecture Notes in Artificial Intelligence Edited by J. G. Carbonell and J. Siekmann
Subseries of Lecture Notes in Computer Science
2639
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Guoyin Wang Qing Liu Yiyu Yao Andrzej Skowron (Eds.)
Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing 9th International Conference, RSFDGrC 2003 Chongqing, China, May 26-29, 2003 Proceedings
13
Volume Editors Guoyin Wang Chongqing University of Posts and Telecommunications Institute of Computer Science and Technology Chongqing, 400065, P.R. China E-mail:
[email protected] Qing Liu Nanchang University Department of Computer Science Nanchang, 330029, P.R. China E-mail:
[email protected] Yiyu Yao University of Regina Department of Computer Science Regina, Saskatchewan, S4S 0A2, Canada E-mail:
[email protected] Andrzej Skowron Warsaw University Institute of Mathematics Banacha 2, 02-097 Warsaw, Poland E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
. CR Subject Classification (1998): I.2, H.2.4, H.3, F.4.1, F.1, I.5, H.4 ISSN 0302-9743 ISBN 3-540-14040-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10928639 06/3142 543210
Preface This volume contains the papers selected for presentation at the 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC 2003) held at Chongqing University of Posts and Telecommunications, Chongqing, P.R. China, May 26–29, 2003. There were 245 submissions for RSFDGrC 2003 excluding for 2 invited keynote papers and 11 invited plenary papers. Apart from the 13 invited papers, 114 papers were accepted for RSFDGrC 2003 and were included in this volume. The acceptance rate was only 46.5%. These papers were divided into 39 regular oral presentation papers (each allotted 8 pages), 47 short oral presentation papers (each allotted 4 pages) and 28 poster presentation papers (each allotted 4 pages) on the basis of reviewer evaluations. Each paper was reviewed by three referees. The conference is a continuation and expansion of the International Workshops on Rough Set Theory and Applications. In particular, this was the ninth meeting in the series and the first international conference. The aim of RSFDGrC2003 was to bring together researchers from diverse fields of expertise in order to facilitate mutual understanding and cooperation and to help in cooperative work aimed at new hybrid paradigms. It is our great pleasure to dedicate this volume to Prof. Zdzislaw Pawlak, who first introduced the basic ideas and definitions of rough sets theory over 20 years ago. Rough sets theory has grown to be a useful method in soft computing. It has also been applied in many artificial intelligence systems and research fields, such as data mining, machine learning, pattern recognition, uncertain reasoning, granular computing, intelligent decision-making, etc. Many international conferences now include rough sets in their lists of topics. The main theme of the conference was centered around rough sets theory, fuzzy sets theory, data mining technology, granular computing, and their applications. The papers contributed to this volume reflect advances in these areas and some other closely related research areas, such as, − Rough sets foundations, methods, and applications − Fuzzy sets and systems − Data mining − Granular computing − Neural networks − Evolutionary computing − Machine learning − Pattern recognition and image processing − Logics and reasoning − Multi-agent systems − Web intelligence − Intelligent systems We wish to express our gratitude to Profs. Zdzislaw Pawlak, Bo Zhang, and Ling Zhang for accepting our invitation to be keynote speakers at RSFDGrC 2003. We also wish to thank Profs. Hongxing Li, Tsau Young Lin, Sankar K. Pal, Lech Polkowski, Andrzej Skowron, Hideo Tanaka, Shusaku Tsumoto, Shoujue Wang, Michael Wong,
VI
Preface
Yiyu Yao, and Yixin Zhong, who accepted our invitation to present plenary papers at this conference. We wish to express our thanks to the Honorary Chairs, General Chairs, Program Chairs, and the members of the Advisory Board, Zdzislaw Pawlak, Lotfi A. Zadeh, Tsau Young Lin, Andrzej Skowron, Shusaku Tsumoto, Guoyin Wang, Qing Liu, Yiyu Yao, James Alpigini, Nick Cercone, Jerzy Grzymala-Busse, Akira Nakamura, Sankar Pal, James F. Peters, Lech Polkowski, Zbigniew Ras, Roman Slowinski, Lianhua Xiao, Bo Zhang, Ning Zhong, Yixin Zhong, and Wojciech Ziarko, for their kind contribution to and support of the scientific program and many other conferencerelated issues. We also acknowledge the help in reviewing papers from all reviewers. We want to thank all individuals who submitted valuable papers to the RSFDGrC 2003 conference and all conference attendees. We also wish to express our thanks to Alfred Hofmann at Springer-Verlag for his support and cooperation. We are grateful to our sponsors and supporters: the National Natural Science Foundation of China, Chongqing University of Posts and Telecommunications, the Municipal Education Committee of Chongqing, China, the Municipal Science and Technology Committee of Chongqing, China, and the Bureau of Information Industry of Chongqing, China for its financial and organizational support. We also would like to express our thanks to the Local Organizing Chair, the President of Chongqing University of Posts and Telecommunications, Prof. Neng Nie for his great help and support in the whole process of preparing RSFDGrC 2003. We also want to thank the secretaries of the conference, Yu Wu, Hong Tang, Li Yang, Guo Xu, Lan Yang, Hongwei Zhang, Xinyu Li, Yunfeng Li, Dongyun Hu, Mulan Zhang, Anbo Dong, Jiujiang An, Zhengren Qin, and Zheng Zheng, for their help in preparing the RSFDGrC 2003 proceedings and organizing the conference. May 2003
Guoyin Wang Qing Liu Yiyu Yao Andrzej Skowron
RSFDGrC 2003 Conference Committee
Honorary Chairs: Zdzislaw Pawlak, Lotfi A. Zadeh General Chairs: Tsau Young Lin, Andrzej Skowron, Shusaku Tsumoto Program Chairs: Guoyin Wang, Qing Liu, Yiyu Yao Local Chairs: Neng Nie, Guoyin Wang Advisory Board :
James Alpigini Akira Nakamura Lech Polkowski Lianhua Xiao Yixin Zhong
Nick Cercone Sankar Pal Zbigniew Ras Bo Zhang Wojciech Ziarko
Jerzy Grzymala-Busse James F. Peters Roman Slowinski Ning Zhong
Haoran Liu Yu Wu
Yuxiu Song Li Yang
Local Committee: Juhua Jing Hong Tong
Program Committee
Peter Apostoli Qingsheng Cai Jitender S. Deogun Maria C. Fernandez Salvatore Greco Jouni Jarvinen Daijin Kim Bozena Kostek Yuefeng Li
Malcolm Beynon Mihir Kr. Chakraborty Dieder Dubois Guenter Gediga Xiaohua Hu Fan Jin Jan Komorowski Marzena Kryszkiewicz Pawan Lingras
Hans Dieter Burkhard Andrzej Czyzewski Ivo Duentsch Fernando Gomide Masahiro Inuiguchi Janusz Kacprzyk Jacek Koronacki Churn-Jung Liau Chunnian Liu
VIII
Conference Committee
Jiming Liu Solomon Marcus Nakata Michinori Tetsuya Murai Piero Pagliani Henri Prade Sheela Ramanna Zhongzhi Shi Zbigniew Suraj Marcin Szczuka Mihaela Ulieru Anita Wasilewska Keming Xie Wenxiu Zhang
Zongtian Liu Benedetto Matarazzo Sadaaki Miyamoto Hung Son Nguyen Gheorghe Paun Mohamed Quafafou Ron Shapira Jerzy Stefanowski Roman Swiniarski Francis E.H. Tay Alicja Wakulicz-Deja Michael Wong Jingtao Yao Zhi-Hua Zhou
Brien Maguire Ernestina Menasalvas-Ruiz Mikhail Moshkov Ewa Orlowska Witold Pedrycz Vijay Raghvan Qiang Shen Jaroslav Stepaniuk Andrzej Szalas Helmut Thiele Hui Wang Xindong Wu Huanglin Zeng
Table of Contents
Keynote Papers Flow Graphs and Decision Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zdzislaw Pawlak
1
The Quotient Space Theory of Problem Solving . . . . . . . . . . . . . . . . . . . . . . . Ling Zhang, Bo Zhang
11
Plenary Papers Granular Computing (Structures, Representations, and Applications) . . . . Tsau Young Lin
16
Rough Sets: Trends and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Skowron, James F. Peters
25
A New Development on ANN in China – Biomimetic Pattern Recognition and Multi Weight Vector Neurons . . . . . . . . . . . . . . . . . . . . . . . . Shoujue Wang On Generalizing Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y.Y. Yao Dual Mathematical Models Based on Rough Approximations in Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hideo Tanaka Knowledge Theory: Outline and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y.X. Zhong A Rough Set Paradigm for Unifying Rough Set Theory and Fuzzy Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lech Polkowski
35 44
52 60
70
Extracting Structure of Medical Diagnosis: Rough Set Approach . . . . . . . . Shusaku Tsumoto
78
A Kind of Linearization Method in Fuzzy Control System Modeling . . . . . Hongxing Li, Jiayin Wang, Zhihong Miao
89
A Common Framework for Rough Sets, Databases, and Bayesian Networks S.K.M. Wong, D. Wu
99
Rough Sets, EM Algorithm, MST and Multispectral Image Segmentation Sankar K. Pal, Pabitra Mitra
104
X
Table of Contents
Rough Sets Foundations and Methods Rough Mereology: A Survey of New Developments with Applications to Granular Computing, Spatial Reasoning and Computing with Words . . 106 Lech Polkowski A New Rough Sets Model Based on Database Systems . . . . . . . . . . . . . . . . . 114 Xiaohua Tony Hu, Tsau Young Lin, Jianchao Han A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Zheng Zheng, Guoyin Wang, Yu Wu Comparison of Conventional and Rough K-Means Clustering . . . . . . . . . . . 130 Pawan Lingras, Rui Yan, Chad West An Application of Rough Sets to Monk’s Problems Solving . . . . . . . . . . . . . 138 Duoqian Miao, Lishan Hou Pre-topologies and Dynamic Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Piero Pagliani Rough Sets and Gradual Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Salvatore Greco, Masahiro Inuiguchi, Roman S/lowi´ nski Explanation Oriented Association Mining Using Rough Set Theory . . . . . . 165 Y.Y. Yao, Y. Zhao, R. Brien Maguire Probabilistic Rough Sets Characterized by Fuzzy Sets . . . . . . . . . . . . . . . . . 173 Li-Li Wei, Wen-Xiu Zhang A View on Rough Set Concept Approximations . . . . . . . . . . . . . . . . . . . . . . . 181 Jan Bazan, Nguyen Hung Son, Andrzej Skowron, Marcin S. Szczuka Evaluation of Probabilistic Decision Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Wojciech Ziarko Query Answering in Rough Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . 197 Aida Vit´ oria, Carlos Viegas Dam´ asio, Jan Ma/luszy´ nski Upper and Lower Recursion Schemes in Abstract Approximation Spaces . 205 Peter Apostoli, Akira Kanda Adaptive Granular Control of an HVDC System: A Rough Set Approach James F. Peters, H. Feng, Sheela Ramanna
213
Rough Set Approach to Domain Knowledge Approximation . . . . . . . . . . . . 221 Tuan Trung Nguyen, Andrzej Skowron Reasoning Based on Information Changes in Information Maps . . . . . . . . . 229 Andrzej Skowron, Piotr Synak
Table of Contents
XI
Characteristics of Accuracy and Coverage in Rule Induction . . . . . . . . . . . . 237 Shusaku Tsumoto Interpretation of Rough Neural Networks as Emergent Model . . . . . . . . . . . 245 Yasser Hassan, Eiichiro Tazaki Using Fuzzy Dependency-Guided Attribute Grouping in Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Richard Jensen, Qiang Shen Conjugate Information Systems: Learning Cognitive Concepts in Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Maria Semeniuk-Polkowska, Lech Polkowski A Rule Induction Method of Plant Disease Description Based on Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Ai-Ping Li, Gui-Ping Liao, Quan-Yuan Wu Rough Set Data Analysis Algorithms for Incomplete Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 K.S. Chin, Jiye Liang, Chuangyin Dang Inconsistency Classification and Discernibility-Matrix-Based Approaches for Computing an Attribute Core . . . . . . . . . . . . . . . . . . . . . . . . 269 Dongyi Ye, Zhaojiong Chen Multi-knowledge Extraction and Application . . . . . . . . . . . . . . . . . . . . . . . . . 274 QingXiang Wu, David Bell Multi-rough Sets Based on Multi-contexts of Attributes . . . . . . . . . . . . . . . . 279 Rolly Intan, Masao Mukaidono Approaches to Approximation Reducts in Inconsistent Decision Tables . . . 283 Ju-Sheng Mi, Wei-Zhi Wu, Wen-Xiu Zhang Degree of Dependency and Quality of Classification in the Extended Variable Precision Rough Sets Model . . . . . . . . . . . . . . . . . . . . . . . 287 Malcolm J. Beynon Approximate Reducts of an Information System . . . . . . . . . . . . . . . . . . . . . . 291 Tien-Fang Kuo, Yasutoshi Yajima A Rough Set Methodology to Support Learner Self-Assessment in Web-Based Distance Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Hongyan Geng, R. Brien Maguire A Synthesis of Concurrent Systems: A Rough Set Approach . . . . . . . . . . . . 299 Zbigniew Suraj, Krzysztof Pancerz Towards a Line-Crawling Robot Obstacle Classification System: A Rough Set Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 James F. Peters, Sheela Ramanna, Marcin S. Szczuka
XII
Table of Contents
Order Based Genetic Algorithms for the Search of Approximate Entropy Reducts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 ´ ˛zak, Jakub Wr´ Dominik Sle oblewski Variable Precision Bayesian Rough Set Model . . . . . . . . . . . . . . . . . . . . . . . . 312 ´ ˛zak, Wojciech Ziarko Dominik Sle Linear Independence in Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Shusaku Tsumoto The Information Entropy of Rough Relational Databases . . . . . . . . . . . . . . . 320 Yuefei Sui, Youming Xia, Ju Wang A T-S Type of Rough Fuzzy Control System and Its Implementation . . . . 325 Jinjie Huang, Shiyong Li, Chuntao Man Rough Mereology in Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . . 329 Cungen Cao, Yuefei Sui, Zaiyue Zhang Rough Set Methods for Constructing Support Vector Machines . . . . . . . . . 334 Yuancheng Li, Tingjian Fang The Lattice Property of Fuzzy Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Fenglan Xiong, Xiangqian Ding, Yuhai Liu Querying Data from RRDB Based on Rough Sets Theory . . . . . . . . . . . . . . 342 Qiusheng An, Guoyin Wang, Junyi Shen, Jiucheng Xu An Inference Approach Based on Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . 346 Fuyan Liu, Shaoyi Lu Classification Using the Variable Precision Rough Set . . . . . . . . . . . . . . . . . . 350 Yongqiang Zhao, Hongcai Zhang, Quan Pan An Illustration of the Effect of Continuous Valued Discretisation in Data Analysis Using VPRSβ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Malcolm J. Beynon
Fuzzy Sets and Systems Application of Fuzzy Control Base on Changeable Universe to Superheated Steam Temperature Control System . . . . . . . . . . . . . . . . . . . . . 358 Keming Xie, Fang Wang, Gang Xie, Tsau Young Lin Application of Fuzzy Support Vector Machines in Short-Term Load Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Yuancheng Li, Tingjian Fang A Symbolic Approximate Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 Mazen El-Sayed, Daniel Pacholczyk
Table of Contents
XIII
Intuition in Soft Decision Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 Kankana Chakrabarty Ammunition Supply Decision-Making System Design Based on Fuzzy Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 Deyong Zhao, Xinfeng Wang, Jianguo Liu The Concept of Approximation Based on Fuzzy Dominance Relation in Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 Yunxiang Liu, Jigui Sun, Sheng-sheng Wang An Image Enhancement Arithmetic Research Based on Fuzzy Set and Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 Liang Ming, Guihai Xie, Yinlong Wang A Study on a Generalized FCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 Jian Yu, Miin-shen Yang Fuzzy Multiple Synapses Neural Network and Fuzzy Clustering . . . . . . . . . 394 Kai Li, Houkuan Huang, Jian Yu On Possibilistic Variance of Fuzzy Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Wei-Guo Zhang, Zan-Kan Nie
Granular Computing Deductive Data Mining. Mathematical Foundation of Database Mining . . 403 Tsau Young Lin Information Granules for Intelligent Knowledge Structures . . . . . . . . . . . . . 405 Patrick Doherty, Witold L / ukaszewicz, Andrzej Sza/las Design and Implement for Diagnosis Systems of Hemorheology on Blood Viscosity Syndrome Based on GrC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Qing Liu, Feng Jiang, Dayong Deng Granular Reasoning Using Zooming In & Out . . . . . . . . . . . . . . . . . . . . . . . . 421 T. Murai, G. Resconi, M. Nakata, Y. Sato A Pure Mereological Approach to Roughness . . . . . . . . . . . . . . . . . . . . . . . . . 425 Bo Chen, Mingtian Zhou
Neural Networks and Evolutionary Computing Knowledge Based Descriptive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 430 J.T. Yao Genetically Optimized Rule-Based Fuzzy Polynomial Neural Networks: Synthesis of Computational Intelligence Technologies . . . . . . . . 437 Sung-Kwun Oh, James F. Peters, Witold Pedrycz, Tae-Chon Ahn
XIV
Table of Contents
Ant Colony Optimization for Navigating Complex Labyrinths . . . . . . . . . . 445 Zhong Yan, Chun-Wie Yuan An Improved Quantum Genetic Algorithm and Its Application . . . . . . . . . 449 Gexiang Zhang, Weidong Jin, Na Li Intelligent Generation of Candidate Sets for Genetic Algorithms in Very Large Search Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Julia R. Dunphy, Jose J. Salcedo, Keri S. Murphy Fast Retraining of Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 458 Dumitru-Iulian Nastac, Razvan Matei Fuzzy-ARTMAP and Higher-Order Statistics Based Blind Equalization . . 462 Dong-kun Jee, Jung-sik Lee, Ju-Hong Lee Comparison of BPL and RBF Network in Intrusion Detection System . . . 466 Chunlin Zhang, Ju Jiang, Mohamed Kamel Back Propagation with Randomized Cost Function for Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 H.A. Babri, Y.Q. Chen, Kamran Ahsan
Data Mining, Machine Learning, and Pattern Recognition Selective Ensemble of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Zhi-Hua Zhou, Wei Tang A Maximal Frequent Itemset Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Hui Wang, Qinghua Li, Chuanxiang Ma, Kenli Li On Data Mining for Direct Marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Chuangxin Ou, Chunnian Liu, Jiajing Huang, Ning Zhong A New Incremental Maintenance Algorithm of Data Cube . . . . . . . . . . . . . 499 Hongsong Li, Houkuan Huang, Youfang Lin Data Mining for Motifs in DNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 David Bell, J.W. Guan Maximum Item First Pattern Growth for Mining Frequent Patterns . . . . 515 Hongjian Fan, Ming Fan, Bingzheng Wang Extended Random Sets for Knowledge Discovery in Information Systems . 524 Yuefeng Li Research on a Union Algorithm of Multiple Concept Lattices . . . . . . . . . . . 533 Zongtian Liu, Liansheng Li, Qing Zhang
Table of Contents
XV
A Theoretical Framework for Knowledge Discovery in Databases Based on Probabilistic Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 Ying Xie, Vijay V. Raghavan An Improved Branch & Bound Algorithm in Feature Selection . . . . . . . . . . 549 Zhenxiao Wang, Jie Yang, Guozheng Li Classification of Caenorhabditis Elegans Behavioural Phenotypes Using an Improved Binarization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 Won Nah, Joong-Hwan Baek Consensus versus Conflicts – Methodology and Applications . . . . . . . . . . . . 565 Ngoc Thanh Nguyen, Janusz Sobecki Interpolation Techniques for Geo-spatial Association Rule Mining . . . . . . . 573 Dan Li, Jitender Deogun, Sherri Harms Imprecise Causality in Mined Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 Lawrence J. Mazlack Sphere-Structured Support Vector Machines for Multi-class Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 Meilin Zhu, Yue Wang, Shifu Chen, Xiangdong Liu HIPRICE-A Hybrid Model for Multi-agent Intelligent Recommendation . . 594 ZhengYu Gong, Jing Shi, HangPing Qiu A Database-Based Job Management System . . . . . . . . . . . . . . . . . . . . . . . . . . 598 Ji-chuan Zheng, Zheng-guo Hu, Liang-liang Xing Optimal Choice of Parameters for a Density-Based Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 Wenyan Gan, Deyi Li An Improved Parameter Tuning Method for Support Vector Machines . . . 607 Yong Quan, Jie Yang Approximate Algorithm for Minimization of Decision Tree Depth . . . . . . . 611 Mikhail J. Moshkov Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Technique for Understanding Data and Knowledge Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 Julio J. Vald´es Hierarchical Clustering Algorithm Based on Neighborhood-Linked in Large Spatial Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 Yi-hong Dong Unsupervised Learning of Pattern Templates from Unannotated Corpora for Proper Noun Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 Seung-Shik Kang, Chong-Woo Woo
XVI
Table of Contents
Approximate Aggregate Queries with Guaranteed Error Bounds . . . . . . . . 627 Seok-Ju Chun, Ju-Hong Lee, Seok-Lyong Lee Improving Classification Performance by Combining Multiple T AN Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 Hongbo Shi, Zhihai Wang, Houkuan Huang Image Recognition Using Adaptive Fuzzy Neural Network and Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 Huanglin Zeng, Yao Yi SOM Based Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 Yuan Jiang, Ke-Jia Chen, Zhi-Hua Zhou User’s Interests Navigation Model Based on Hidden Markov Model . . . . . . 644 Jing Shi, Fang Shi, HangPing Qiu Successive Overrelaxation for Support Vector Regression . . . . . . . . . . . . . . . 648 Yong Quan, Jie Yang, Chenzhou Ye Statistic Learning and Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 Xian Rao, Cun-xi Dong, Shao-quan Yang A New Association Rules Mining Algorithms Based on Directed Itemsets Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660 Lei Wen, Minqiang Li A Distributed Multidimensional Data Model of Data Warehouse . . . . . . . . 664 Youfang Lin, Houkuan Huang, Hongsong Li
Logics and Reasoning An Overview of Hybrid Possibilistic Reasoning . . . . . . . . . . . . . . . . . . . . . . . 668 Churn-Jung Liau Critical Remarks on the Computational Complexity in Probabilistic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 S.K.M. Wong, D. Wu, Y.Y. Yao Critical Remarks on the Maximal Prime Decomposition of Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Cory J. Butz, Qiang Hu, Xue Dong Yang A Non-local Coarsening Result in Granular Probabilistic Networks . . . . . 686 Cory J. Butz, Hong Yao, Howard J. Hamilton Probabilistic Inference on Three-Valued Logic . . . . . . . . . . . . . . . . . . . . . . . . 690 Guilin Qi Multi-dimensional Observer-Centred Qualitative Spatial-temporal Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694 Yi-nan Lu, Sheng-sheng Wang, Sheng-xian Sha
Table of Contents
XVII
Multi-agent Systems Architecture Specification for Design of Agent-Based System in Domain View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697 S.K. Lee, Taiyun Kim Adapting Granular Rough Theory to Multi-agent Context . . . . . . . . . . . . . 701 Bo Chen, Mingtian Zhou How to Choose the Optimal Policy in Multi-agent Belief Revision? . . . . . . 706 Yang Gao, Zhaochun Sun, Ning Li
Web Intelligence and Intelligent Systems Research of Atomic and Anonymous Electronic Commerce Protocol . . . . . 711 Jie Tang, Juan-Zi Li, Ke-Hong Wang, Yue-Ru Cai Colored Petri Net Based Attack Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715 Shijie Zhou, Zhiguang Qin, Feng Zhang, Xianfeng Zhang, Wei Chen, Jinde Liu Intelligent Real-Time Traffic Signal Control Based on a Paraconsistent Logic Program EVALPSN . . . . . . . . . . . . . . . . . . . . . . . . . . . 719 Kazumi Nakamatsu, Toshiaki Seno, Jair Minoro Abe, Atsuyuki Suzuki Transporting CAN Messages over WATM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 Ismail Erturk A Hybrid Intrusion Detection Strategy Used for Web Security . . . . . . . . . . 730 Bo Yang, Han Li, Yi Li, Shaojun Yang Mining Sequence Pattern from Time Series Based on Inter-relevant Successive Trees Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 Haiquan Zeng, Zhan Shen, Yunfa Hu
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
Flow Graphs and Decision Algorithms Zdzisáaw Pawlak University of Information Technology and Management ul. Newelska 6, 01-447 Warsaw, Poland and Chongqing University of Posts and Telecommunications Chongqing, 400065, P.R. China [email protected]
Abstract. In this paper we introduce a new kind of flow networks, called flow graphs, different to that proposed by Ford and Fulkerson. Flow graphs are meant to be used as a mathematical tool to analysis of information flow in decision algorithms, in contrast to material flow optimization considered in classical flow network analysis. In the proposed approach branches of the flow graph are interpreted as decision rules, while the whole flow graph can be understood as a representation of decision algorithm. The information flow in flow graphs is governed by Bayes’ rule, however, in our case, the rule does not have probabilistic meaning and is entirely deterministic. It describes simply information flow distribution in flow graphs. This property can be used to draw conclusions from data, without referring to its probabilistic structure.
1 Introduction The paper is concerned with a new kind of flow networks, called flow graphs, different to that proposed by Ford and Fulkerson [3]. The introduced flow graphs are intended to be used as a mathematical tool for information flow analysis in decision algorithms, in contrast to material flow optimization considered in classical flow network analysis. In the proposed approach branches of the flow graph are interpreted as decision rules, while the whole flow graph can be understood as a representation of decision algorithm. It is revealed that the information flow in flow graphs is governed by Bayes’ formula, however, in our case the rule does not have probabilistic meaning and is entirely deterministic. It describes simply information flow distribution in flow graphs, without referring to its probabilistic structure. Despite Bayes’ rule is fundamental for statistical reasoning, however it has led to many philosophical discussions concerning its validity and meaning, and has caused much criticism [1], [2]. In our setting, beside a very simple mathematical form, the Bayes’ rule is free from its mystic flavor.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 1–10, 2003. © Springer-Verlag Berlin Heidelberg 2003
2
Z. Pawlak
This paper is a continuation of some authors’ ideas presented in [6], [7], [8], where the relationship between Bayes’ rule and flow graphs has been introduced and studied. From theoretical point of view the presented approach can be seen as a generalization of àukasiewicz’s ideas [4], who first proposed to express probability in logical terms. He claims that probability is a property of propositional functions, and can be replaced by truth values belonging to the interval <0,1>. In the flow graph setting the truth values, and consequently probabilities, are interpreted as flow intensity in branches of a flow graph. Besides, it leads to simple computational algorithms and new interpretation of decision algorithms. The paper is organized as follows. First, the concept of a flow graph is introduced. Next, information flow distribution in the graph is defined and its relationship with Bayes’ formula is revealed. Further, simplification of flow graphs is considered and the relationship of flow graphs and decision algorithms is analyzed. Finally, statistical independence and dependency between nodes is defined and studied. All concepts are illustrated by simple tutorial examples.
2 Flow Graphs A flow graph is a directed, acyclic, finite graph G = (N, B, σ), where N is a set of nodes, B ⊆ N × N is a set of directed branches, σ : B → <0,1> is a flow function. Input of x∈N is the set I(x) = {y∈N: ( y, x) ∈ B}; output of x∈N is defined as O(x) = {y∈N: ( x, y ) ∈ B} and σ ( x, y ) is called the strength of ( x, y ) . Input and output of a graph G, are defined as I(G) = {x∈N : I(x) = ∅}, O(G) = {x∈N : O(x) = ∅}, respectively. Inputs and outputs of G are external nodes of G; other nodes are internal nodes of G. With every node x of a flow graph G we associate its inflow and outflow defined as σ + ( x) = ∑ σ ( y, x) , σ − ( x ) = ∑σ ( x, y ), respectively. y∈I ( x )
y∈O ( x )
We assume that for any internal node x, σ + ( x ) = σ − ( x ) = σ ( x ) , where σ (x ) is a troughflow of x. An inflow and an outflow of G are defined as σ + (G ) = ∑ σ − ( x) , x∈I ( G )
σ − (G ) =
∑σ
x∈O ( G )
+
( x ) , respectively.
Obviously σ + (G ) = σ − (G ) = σ (G ) , where σ (G ) is a troughflow of G. Moreover, we assume that σ (G ) = 1. The above formulas can be considered as flow conservation equations [3].
Flow Graphs and Decision Algorithms
3
3 Certainty and Coverage Factors With every branch of a flow graph we associate the certainty and the coverage factors [9], [10]. σ ( x, y ) The certainty and the coverage of ( x, y ) are defined as cer ( x, y ) = and σ ( x) σ ( x, y ) FRY( x, y ) = , respectively, where σ(x) is the troughflow of x. Below some σ ( y) properties, which are immediate consequences of definitions given above are presented: ∑ cer ( x, y ) = 1 , (1) y∈O ( x )
∑ FRY( x, y ) = 1 ,
x∈I ( y )
cer ( x, y ) =
FRY ( x, y )σ ( y ) , σ ( x)
(2) (3)
cer ( x, y )σ ( x ) . (4) σ ( y) Obviously the above properties have a probabilistic flavor, e.g., equations (3) and (4) are Bayes’ formulas. However, these properties can be interpreted in deterministic way and they describe flow distribution among branches in the network. Notice that Bayes’ formulas given above have a new interpretation form which leads to simple computations and gives new insight into the Bayesian methodology. Example 1: Suppose that three models of cars x1, x2 and x3 are sold to three disjoint groups of customers z1, z2 and z3 through four dealers y1, y2, y3 and y4. Moreover, let us assume that car models and dealers are distributed as shown in Fig. 1. FRY ( x, y ) =
Fig. 1. Cars and dealers distribution
4
Z. Pawlak
Computing strength and coverage factors for each branch we get results shown in Figure 2.
Fig. 2. Strength, certainty and coverage factors
4 Paths and Connections A (directed) path from x to y, x ≠ y is a sequence of nodes x1,…,xn such that x1 = x, xn = y and (xi, xi+1) ∈B for every i, 1 ≤ i ≤ n-1. A path x…y is denoted by [x,y]. The certainty of a path [x1, xn] is defined as n −1
cer[ x1 , x n ] = ∏ cer ( xi , xi +1 ) ,
(5)
i =1
the coverage of a path [x1, xn] is
n −1
FRY[ x1 , x n ] = ∏ FRY ( x i , x i +1 ) ,
(6)
i =1
and the strength of a path [x, y] is (7) σ [x, y] = σ (x) cer[x, y] = σ (y) cov[x, y]. The set of all paths from x to y (x ≠ y) denoted < x, y > , will be called a connection from x to y. In other words, connection < x, y > is a sub-graph determined by nodes x and y. The certainty of connection < x, y > is
cer < x, y >=
∑ cer[ x, y ] ,
[ x , y ]∈< x , y >
the coverage of connection < x, y > is
(8)
Flow Graphs and Decision Algorithms
FRY < x, y >=
∑ FRY[ x, y ] ,
[ x , y ]∈< x , y >
5
(9)
and the strength of connection < x, y > is
σ < x, y >=
∑σ [ x , y ] .
[ x , y ]∈< x , y >
(10)
Let x, y (x ≠ y) be nodes of G. If we substitute the sub-graph < x, y > by a single branch ( x, y ) such that σ ( x, y ) = σ < x, y > , then cer ( x, y ) = cer < x, y > , FRY ( x, y ) = FRY < x, y > and σ (G ) = σ (G ′) , where G ′ is the graph obtained from G by substituting < x, y > by ( x, y ) . Example 1 (cont). In order to find how car models are distributed among customer groups we have to compute all connections among cars models and consumers groups. The results are shown in Fig. 3.
Fig. 3. Relation between car models and consumer groups
For example, we can see from the flow graph that consumer group z2 bought 21% of car model x1, 35% − of car model x2 and 44% − of car model x3. Conversely, for example, car model x1 is distributed among customer groups as follows: 31% cars bought group z1, 57% − group z2 and 12% − group z3.
5 Decision Algorithms With every branch ( x, y ) we associate a decision rule x→y, read if x then y; x will be referred to as a condition, whereas y – decision of the rule. Such a rule is characterized by three numbers, σ ( x, y ), cer ( x, y ) and cov( x, y ).
6
Z. Pawlak
Thus every path [ x1 , x n ] determines a sequence of decision rules x1→x2, x2→x3,…,xn-1→xn. From previous considerations it follows that this sequence of decision rules can be * * interpreted as a single decision rule x1x2…xn-1→xn, in short x → xn, where x = x1x2…xn-1, characterized by * cer(x , xn) = cer[x1, xn], (11) * cov(x , xn) = cov[x1, xn], (12) and (13) σ(x*, xn) = σ(x1) cer[x1, xn] = σ(xn) cov[x1, xn]. Similarly, every connection < x, y > can be interpreted as a single decision rule x → y such that: cer ( x, y ) = cer < x, y > , (14) cov ( x, y ) = cov < x, y > , (15) and (16) σ ( x, y ) = σ(x)cer < x, y > = σ(y)cov < x, y > . Let [x1, xn] be a path such that x1 is an input and xn an output of the flow graph G, respectively. Such a path and the corresponding connection < x1 , x n > will be called complete. The set of all decision rules xi1 xi2 ... xin −1 → xin associated with all complete paths
[ x i1 , xin ] will be called a decision algorithm induced by the flow graph. The set of all decision rules x i1 → x in associated with all complete connections
< x i1 , x in > in the flow graph, will be referred to as the combined decision algorithm determined by the flow graph. Example 1 (cont.). The decision algorithm induced by the flow graph shown in Fig. 2 is given below: Rule no. 1) 2) 3)
Rule x1 y1→z1 x1 y1→z2 x1 y1→z3
Strength 0.036 0.072 0.012
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
20) 21) 22)
x3 y4→z1 x3 y4→z2 x3 y4→z3
0.025 0.075 0.150
For the sake of simplicity we gave only some of the decision rules of the decision algorithm. Interested reader can easily complete all the remaining decision rules. Similarly we can compute certainty and coverage for each rule. Remark 1. Due to round-off errors in computations, the equalities (1)...(16) may not be satisfied exactly in these examples.
Flow Graphs and Decision Algorithms
7
The combined decision algorithm associated with the flow graph shown in Fig. 3, is given below: Rule no. 1) 2) 3) 4) 5) 6) 7) 8) 9)
Rule x1→z1 x1→z2 x1→z3 x2→z1 x2→z2 x2→z3 x3→z1 x3→z2 x3→z3
Strength 0.06 0.11 0.02 0.06 0.18 0.06 0.10 0.23 0.18
This decision algorithm can be regarded as a simplification of the decision algorithm given previously and shows how car models are distributed among customer groups.
6
Independence of Nodes in Flow Graphs
Let x and y be nodes in a flow graph G = (N, B, σ), such that (x,y)∈B. Nodes x and y are independent in G if (17) σ ( x, y ) = σ(x) σ(y). From (17) we get σ ( x, y ) = cer ( x, y ) = σ ( y ) , (18) σ ( x) and σ ( x, y ) = cov( x, y ) = σ ( x). (19) σ ( y) If (20) cer(x,y) > σ(y), or (21) cov ( x, y ) > σ(x), then y depends positively on x in G. Similarly, if (22) cer ( x, y ) < σ(y), or (23) cov ( x, y ) < σ(x), then y depends negatively on x in G. Let us observe that relations of independency and dependences are symmetric ones, and are analogous to that used in statistics.
8
Z. Pawlak
Example 1 (cont.). In flow graphs presented in Fig. 2 and Fig. 3 there are no independent nodes, whatsoever. However, e.g. nodes x1,y1 are positively dependent, whereas, nodes y1,z3 are negatively dependent. Example 2. Let X = {1,2,…,8}, x∈X and let a1 denote “x is divisible by 2”, a0 – “x is not divisible by 2”. Similarly, b1 stands for “x is divisible by 3” and b0 – “x is not divisible by 3”. Because there are 50% elements divisible by 2 and 50% elements not divisible by 2 in X, therefore we assume σ(a1) = ½ and σ(a0) = ½. Similarly, σ(b1) = ¼ and σ(b0) = ¾ because there are 25% elements divisible by 3 and 75% not divisible by 3 in X, respectively. The corresponding flow graph is presented in Fig. 4.
Fig. 4. Divisibility by “2” and “3”
The pair of nodes (a0,b0), (a0, b1), (a1,b0) and (a1,b1) are independent, because, e.g., cer(a0,b0) = σ(b0) (cov(a0,b0) = σ(a0)). Example 3. Let X = {1,2,…,8}, x∈X and a1 stand for “x is divisible by 2”, a0 – “x is not divisible by 2”, b1 – “x is divisible by 4” and b0 – “x is not divisible by 4”. As in the previous example σ(a0) = ½ and σ(a1) = ½; σ(b0) = ¾ and σ(b1) = ¼ because there are 75% dements not divisible by 4 and 25% divisible by 4 – in X. The flow graph associated with the above problem is shown in Fig. 5.
Fig. 5. Divisibility by “2” and “4”
The pairs of nodes (a0,b0), (a1,b0) and (a1,b1) are dependent. Pairs (a0,b0) and (a1,b1) are positively dependent, because cer(a0,b0) > σ(b0) (cov(a0,b0) > σ(a0)) and – cer(a1,b1) > σ(b1) (cov(a1,b1) > σ(a1)). Nodes (a1,b0) are negatively dependent, because cer(a1,b0) < σ(b0) (cov(a1,b0) < σ(a1)).
Flow Graphs and Decision Algorithms
9
For every branch (x,y)∈B we define a dependency factor η ( x, y ) defined as cer ( x, y ) − σ ( y ) cov( x, y ) − σ ( x) η ( x, y ) = = . (24) cer ( x, y ) + σ ( y ) cov ( x, y ) + σ ( x) Obviously −1 ≤ η ( x, y ) ≤ 1 ; η ( x, y ) = 0 if and only if cer ( x, y ) = σ ( y ) and
cov( x, y ) = σ ( x) ; η ( x, y ) = −1 if and only if cer ( x, y ) = cov( x, y ) = 0 ; η ( x, y ) = 1 if and only if σ ( y ) = σ ( x) = 0. It is easy to check that if η ( x, y ) = 0 , then x and y are independent, if
−1 ≤ η ( x, y ) < 0 then x and y are negatively dependent and if 0 < η ( x, y ) ≤ 1 then x and y are positively dependent. Thus the dependency factor expresses a degree of dependency, and can be seen as a counterpart of correlation coefficient used in statistics. For example, in the flow graph presented in Fig. 4 we have: η (a0 , b0 ) = 0,
η (a0 , b1 ) = 0, η (a1 , b0 ) = 0 and η (a1 , b1 ) = 0. However, in the flow graph shown in Fig. 5 we have η (a0 , b0 ) = 1 / 7, η (a1 , b0 ) = −1/ 5 and η (a1 , b1 ) = 1 / 3. The meaning of the above results is obvious.
7 Conclusions In this paper a relationship between flow graphs and decision algorithms has been defined and studied. It has been shown that the information flow in a decision algorithm can be represented as a flow in the flow graph. Moreover, the flow is governed by Bayes’ formula, however the Bayes’ formula has entirely deterministic meaning, and is not referring to its probabilistic nature. Besides, the formula has a new simple form, which essentially simplifies the computations. This leads to many new applications and also gives new insight into the Bayesian philosophy. Acknowledgement. Thanks are due to Professor Andrzej Skowron for critical remarks.
References 1.
2. 3.
Bernardo, J. M., Smith, A. F. M.: Bayesian Theory. Wiley series in probability and mathematical statistics. John Wiley & Sons, Chichester, New York, Brisbane, Toronto, Singapore (1994) Box, G.E.P., Tiao, G.C.: Bayesian Inference in Statistical Analysis. John Wiley and Sons, Inc., New York, Chichester, Brisbane, Toronto, Singapore (1992) Ford, L.R., Fulkerson, D.R.: Flows in Networks. Princeton University Press, Princeton. New Jersey
10
Z. Pawlak
àukasiewicz, J.: Die logischen Grundlagen der Wahrscheinlichkeitsrechnung. Kraków (1913). In: L. Borkowski (ed.), Jan àukasiewicz – Selected Works, North Holland Publishing Company, Amsterdam, London, Polish Scientific Publishers, Warsaw (1970) 5. Greco, S., Pawlak, Z., 6áRZL VNL, R.: Generalized Decision Algorithms, Rough Inference Rules, and Flow Graphs, in: J.J. Alpigini et al. (eds.), Lecture Notes in Artificial Intelligence 2475 (2002) 93−104 6. Pawlak, Z.: In Pursuit of Patterns in Data Reasoning from Data – The Rough Set Way. In: J.J. Alpigini et al. (eds.), Lecture Notes in Artificial Intelligence 2475 (2002) 1−9 7. Pawlak, Z.: Rough Sets, Decision Algorithms and Bayes’ Theorem. European Journal of Operational Research 136 (2002) 181−189 8. Pawlak, Z.: Decision Rules and Flow Networks (to appear) 9. Tsumoto, S., Tanaka, H.: Discovery of Functional Components of Proteins Based on PRIMEROSE and Domain Knowledge Hierarchy, Proceedings of the Workshop on Rough Sets and Soft Computing (RSSC-94), 1994: Lin, T.Y., and Wildberger, A.M. (eds.), Soft Computing, SCS (1995) 280−285. 10. Wong, S.K.M., Ziarko, W.: Algorithm for Inductive Learning. Bull. Polish Academy of Sciences 34, 5–6 (1986) 271−276 4.
The Quotient Space Theory of Problem Solving1 Ling Zhang1, 3 and Bo Zhang2, 3 1
Artificial Intelligence Institute, Anhui University, Hefei, Anhui, China 230039 [email protected] 2 Department of Computer Science & Technology, Tsinghua University, Beijing, China 100084 [email protected] 3 State Key Lab of Intelligent Technology & Systems
Abstract. The talk introduces a framework of quotient space theory of problem solving. In the theory, a problem (or problem space) is represented as a triplet, including the universe, its structure and attributes. The worlds with different grain size are represented by a set of quotient spaces. The basic characteristics of different grain-size worlds are presented. Based on the model, the computational complexity of hierarchical problem solving is discussed.
1 Introduction It’s well known that one of the basic characteristics in human problem solving is the ability to conceptualize the world at different granularities and translate from one abstraction level to the others easily, i.e., deal with them hierarchically [1]. In order to analyze and understand the above human behavior, we presented a quotient space model in [2]. The model we proposed was intended to describe the worlds with different grain-size easily and can be used for analyzing the hierarchical problem solving behavior expediently. Based on the model, we have obtained several characteristics of the hierarchical problem solving and developed a set of its approaches in heuristic search, path planning, etc. since the approaches can deal with a problem at different grain size so that the computational complexity may greatly be reduced. The model can also be used to deal with the combination of information obtained from different grain-size worlds (different views), i.e., information fusion. The theory and some recent well-known works on granule computing [3]-[11] have something in common such as the “grain sizes” of the world are partitioned by equivalence relations, a problem can be described under different grain sizes, etc. But we mainly focus on the relationship among the universes with different grain size and the translation among different knowledge bases rather than single knowledge base. In our model, a problem (or space, world) was represented as a triplet, including universe (space), space structure, and attributes, that is, the structure of the space was represented explicitly. In the following discussion, it can be seen that the space structure is of very important in the model. In this talk, we present the framework of 1
Supported by the National Natural Science Foundation of China Grant No. 60135010, the National Key Foundation R&D Project under Grant No. G1998030509
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 11–15, 2003. © Springer-Verlag Berlin Heidelberg 2003
12
L. Zhang and B. Zhang
our quotient space theory and the two basic characteristics of different grain-size worlds.
2 The Framework of Quotient Space Theory Problem Representation at Different Granularities The aim of representing a problem at different granularities is to enable the computer to solve the same problem at different grain-size hierarchically. Suppose that triplet (X, F, f) describes a problem space or simply space (X, F). Where X denotes the universe, F is the structure of universe X. f indicates the attributes (or features) of universe X. Suppose that X represents the universe with the finest grain-size. When we view the same universe X from a coarser grain size, we have a coarse-grained universe denoted by [X]. Then we have a new problem space ([X], [F], [f]). The coarser universe [X] can be defined by an equivalence relation R on X. That is, an element in [X] is equivalent to a set of elements, an equivalence class, in X. So [X] consists of the whole equivalence classes obtained by R. From F and f, we can defined the corresponding [F] and [f]. Then we have a new space ([X], [F], [f]) called a quotient space of (X, F, f). Assume R is the whole equivalent relations on X. Define a “ coarse-fine”, i.e., “<”, relation on R as follows: Assume R1,R2 R and R2
3 Granulation Problem Given a universe X and an equivalence relation R on X, how to define [F] and [f]? Certainly, [F] and [f] should be defined so that a problem in hand can be solved in the model more efficiently. To this end, the general principle adopted is the following. If the original problem space (X, F, f) has a solution then its corresponding coarsegrained space ([X], [F], [f]) should have a solution as well, although some information is lost in the coarse space. In other word, if its coarse-grained space has
The Quotient Space Theory of Problem Solving
13
no solution then the original space must have no solution. We call the principle as “no-solution preserving” property or “false-preserving” property for short. Thus, structure [F] and attribute [f] should be constructed so as to guarantee the falsepreserving property among worlds with different granularities. Certainly, if we can’t construct a quotient space that has a complete “false-preserving” property, we should construct one that possesses the property in some extent at least. We now discuss the construction of quotient structure [F] based on the above principle in the semi order space as an example. When a semi order space X and an equivalence relation R on it are given. In order to guarantee the “false-preserving” property between the semiorder space X and its quotient space [X], a semi order structure should be constructed in the corresponding quotient space. This is a so-called granulation problem, and can be expressed as the following question: Given a problem (X, F, f) and an equivalence relation R, how to construct ([X], [F], [f]). Assume (X, F) is a semi-order space, i.e., there exists a relation F (<) among part of elements on X and satisfies (1) If x
14
L. Zhang and B. Zhang
simplified space of X, it’s easier to find the potential solution regions so that the additional computation is quite limited. This is the fact underlain the power of hierarchical problem solving. So the key to hierarchical problem solving is that in what extent (or with what probability) the potential solution regions can be found in the quotient space. This will depend on several factors such as the structure of quotient space, how many abstraction levels used, etc. In [2] we show some conditions in which the computational complexity can be reduced by hierarchical problem solving strategy. The analysis shows that the efficiency of hierarchical problem solving strategy comes from the “false-preserving” property of quotient spaces.
4 Combination Problem When we have two (coarse) observations (X1, F1, f1) and (X2, F2, f2) of the same problem space (X, F, f), how to view the problem as a whole based on the two observations. This is a granulation combination problem. Combination Principle Assume that (X1, F1, f1) and (X2, F2, f2) are two coarser problem spaces of (X, F, f). The combination space (X3, F3, f3) of spaces (X1, F1, f1) and (X2, F2, f2) should satisfy: (1) X1 and X2 are the quotient spaces of X3 (2) F1 and F2 are quotient topologies of X1 and X2, respectively. F1 and F2 are projections of F3 on X1 and X2, respectively, and F3 satisfies some optimal criteria as well. If (X, F, f) is a semi order space, the construction of (X3, F3, f3) is as follows. (1) Find X3-the least upper bound of X1 and X2. (2) Find F1R and F2R-the right order topologies of F1 and F2 respectively. (3) Find F3R-the least upper bound topology of F1R and F2R. (4) Find F3-the corresponding semi order structure of F3R. If (X1, F1, f1) and (X2, F2, f2) are semi-order quotient spaces, then (X3, F3) above satisfies: (1) F3 is a semi-order structure on X3. (2) pi: (X3, F3)→(Xi, Fi), i=1,2 has order-preserving (3) F3 has the finest semi-order structure and satisfies conditions (1) and (2). Assume [a]1 and [a]2 are promises in X1 and X2, respectively. [b]1 and [b]2 are conclusions in X1 and X2, respectively. Now combing these two observations into one result. Assume promises [a]3 and conclusions [b]3 in X3 are not empty, [a]3=[a]1∩[2]2 and [b]3=[b]1∩[b]2. From the false-preserving property of quotient spaces, if a problem has no solution in any of spaces X1 and X2, there is no solution in space X3. From the definition of semi-order structure of F3, it can be seen that if there exist solutions both in spaces X1 and X2, there must exist a solution in space X3. Assume (X, F, f) is an original space, (X1, F1, f1) and (X2, F2, f2) are its quotient spaces. (X3, F3, f3) is the combination space of (X1, F1, f1) and (X2, F2, f2). If X3=X and F3=F, spaces (X1, F1, f1) and (X2, F2, f2) are called the lossless compression of (X, F, f). This means that if there exist solutions in both X1 and X2, there must exist a solution in space X. This is called as “truth-preserving” property. It’s noted that even
The Quotient Space Theory of Problem Solving
15
though X3=X, structure F3=F does not necessarily hold. So to construct a proper structure F3 of the combination space is of important in granulation combination. In summary, the “false preserving’” and “truth preserving” are two main properties in quotient space model. The properties can be used for investigating hierarchical problem solving and granulation combination.
References 1. 2.
J. R. Hobbs, Granularity, In Proc. of IJCAI, Los Angeles, (1985) 432–435. Bo Zhang and Ling Zhang, Theory and Application of Problem Solving. North-Holland, Elsevier Science Publishers B.V. (1992) 3. L. A. Zadeh, Fuzzy logic=computing with words, IEEE Transactions on Fuzzy Systems, Vol. 4 (1996) 103–111 4. L. A. Zadeh, Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems, Vol. 19 (1997) 111–127 5. L. A. Zadeh, Announcement of GrC. (1997) http://www.cs.uregina.ca/~yyao/GrC/ 6. Z. Pawlak, Rough Sets Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht, Boston, London (1991) 7. Y. Y. Yao, Granular Computing: basic issues and possible solutions. Proc. of Fifth Joint Conference on Information Sciences, Vol. I, Atlantic City, New Jersey, USA, (2000) 186– 189. 8. Y.Y. Yao, and X. Li, Comparison of rough-set and interval-srt models for uncertain reasoning, Fundamental Informatics, Vol. 27 (1996) 289–298 9. Y.Y. Yao, S.K.M. Wong, and L.S. Wang, A nonnumeric approach to uncertain reasoning, International Journal of General Systems, Vol. 23 (1995) 343–359 10. Y.Y. Yao and Ning Zhong, Granular Computing Using Information Table, in T.Y. Lin, Y.Y Yao, and L. A. Zadeh (editors) Data Miming, Rough Sets and Granular Computing, Physica-Verlag, (2000) 102–124. 11. Y.Y. Yao and J.T. Yao, Granular Computing as Basis for Consistent Classification Problems, Manuscript.
Granular Computing Structures, Representations, and Applications Tsau Young (“T.Y.”) Lin Department of Computer Science San Jose State University, San Jose, California 95192 [email protected]
Abstract. The structure and representation theories of (crisp/fuzzy) granulations are presented. The results are applied to data mining, fuzzy control, security and etc. Keywords: Granular computing, rough set, binary relation, conflict of interests.
1
Introduction
– Granulation is very natural for human information processing: we granulate the human body into head, neck, and etc, and geographic features into mountains, plains, and etc. It is an ancient procedure. – for scientific treatments, perhaps Loft Zadeh’s 1979 paper, “Fuzzy Sets and Information Granularity [32],” is the first one in modern time. 1 In [14],we said “granulation . . . appears almost everywhere . . . however, the computing theory on information granulation has not been fully explored in its own right”. The situations have been vastly changed, special sessions in conferences, research papers in journal and chapters in books are growing rapidly. [3], [1], [2],[13],[28],[29]
2
Formalization of Granulation
Let us start with a quote from Lotfi Zadeh’s keynote speech at FUZZIEEE’96 [30]. – “information granulation involves partitioning a class of objects(points) into granules, with a granule being a clump of objects (points) which are drawn together by indistinguishability, similarity or functionality.” 1
The term “granular computing” was first used by this author as the label of his research topic during his sabbatical leave at UC- Berkeley in 1996-97 ([37], pp. 5) under the umbrella of granular mathematics
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 16–24, 2003. c Springer-Verlag Berlin Heidelberg 2003
Granular Computing
17
Here we note that “drawn together” is a special case of “drawn towards some center points(objects):” If the “center points” is the whole clump, the two notions coincide. Now we paraphrase (and generalize) his words as follows (in terms of two universes) – information granulation is a collection of granules, with a granule being a clump of data (in data space) which are drawn towards the center object(s) (in object space) by indistinguishability, similarity or functionality. Note the data and object spaces can be the same one. 2.1
Algebraic Formulation
Let P and Q be two objects. The phenomenon P and Q are drawn together can be captured by a binary relation, that is, (P, Q) ∈ B. Note that we have no assumption on symmetry, so the binary relation does have the sense of direction. In this case, we will say Q is drawn toward P ; we will call P the center. – Information granulation is a collection, CB = {B j | j ∈ J}, of crisp/fuzzy binary relations, where J is an index set that enumerate the different indistinguishability, similarity or functionality. 2.2
Geometric Formulation
Instead of modeling the action, “being drawn together,” we examine the consequence. In other words, we consider the clumps of objects that are drawn to certain object p [14]: – An information granulation at p is a family, N S(p) = {Npj (⊆ U ) | j ∈ J}, of crisp/fuzzy subsets(clumps) that are associated to p, where J is an index set as above. Note that distinct objects may associate to the same granule; the set of objects, associated to the same granule, is called the center of the granule. To help visualization, we will use geometric terms. An object is a point, and a granule associated to p is a neighborhood of p in space (the data or information space). We would like to stress that the terms, granule or neighborhood, include the implicit modifiers, crisp/fuzzy. The family N S(p) of neighborhoods associated to p is called a neighborhood system at p. We write N S = {N S(p) | p ∈ V }. N S is a generalization of classical topological spaces [19,21,14,16]. Proposition 2.1. If each N S(p) contains at most one member at every point p ∈ V , then it defines a binary relation B uniquely and vice versa (also uniquely). Such a neighborhood system has been called a binary neighborhood system BN S. Proposition 2.2. The collection CB defines a neighborhood system N S, and vice versa, but the converse is not unique. Note that for each B j ∈ CB, we have
18
T.Y. Lin
N S : V : −→ P (U ) : p −→ N S j (p) = {u | (p, u) ∈ B j }, where P (U ) is crisp/fuzzy power set of U .
3
Taxonomy
We could classify the structures from several views: 1. Global/local granulation a) Global granulation: The granulation is induced by binary relations b) Local granulation: The granulation is induced by neighborhood system. The neighborhood systems of topological spaces are such examples. 2. Single/multiple/nested granulation a) Single level granulation(binary granulation): The granulation is induced by a binary neighborhood system, or by a binary relation; they are equivalent. b) (Finite) multiple granulation: Global granulation is induced by a (finite) set of binary relations. Local granulation is induced by a neighborhood system (finite granules at each point). This two granulation are not equivalent; see Proposition 2.2. c) Strongly/weakly nested (finite) multi-levels: Local granulation is induced by a (finite) family of strongly/weakly nested binary neighborhood systems [22,23]. Here, strongly nested means every granule in level j is a union of some granules in level (j + 1). The strong-ness is automatically satisfied, if the granulations are partitions. The so-called concept hierarchy is such an example. A nesting without such constraints is called weakly nested [15,14]. 3. Full/partial participation a) Full participation granulation (Open granulation): This term is restricted to the case the data and object has the same space, V = U . A granule is fully participated (open), if each of its members (points) is associated with the granule, in other words, the whole granule is the center. A granulation is fully participated, if every granule is. Geometrically speaking a fully participated neighborhood (granule) is an open neighborhood systems. b) Partial participation granulation: This is the general case.
4
Granular Representations Structures and Applications
We will call the following 3-tuple and 2-tupe, granular structure and granular representation respectively (In [14], we have included C in granular structure) GS = (V, U, N S) and (GS, C) = ( (V, Y, N S), C) where V is called the object space, U is the data space or information space (V and U could be the same set), N S is a multiple granulation (geometrically, it is a neighborhood system), and C is the concept space which consists of all the meaningful names of granule (neighborhoods) of N S.
Granular Computing
4.1
19
Binary Granular Structure and Single-Valued Representations
In this section, we will assume the N S is induced by a finite collection, B, of binary relations (or binary neighborhood systems). 1. The 4-tuple ( (V, U, B), C) = (V, U, B, C) will be called a binary granular representation structure(BGRS). When V is finite, BGRS can take table format. In this case, BGRS is equivalent to binary information table. This can be illustrated from Table 1: The map, “−→ ”, is single valued, and hence by composing with naming map, we have a single valued representation. This is rather surprising; note that binary neighborhoods as covering does overlap. It should get a multi-valued representation (see next sections); The “trick” lies in the existence of “center.” See [15] for more details. 2. If U = V , and B is a family of partitions, then the 4-tuple (V, V, B, C) becomes 3-tupe, (V, B, C), is called rough representation structure (RRS). Whne V is finite, ut takes table format and is equivalent to information table; see [15]. If we use bit to represent granules (equivalence classes) as subsets of V , then the rough representation structure is a generalization of database concept called bitmap indexes; see [18,4]. Induced Partitions. First, we observe that a binary relation (or binary neighborhood system): B : V −→ P (U ); p ∈ V −→ Bp ⊆ V : induces a partition, {B −1 (Bp ) | p ∈ V }; we will call it the induced partition (equivalence relation) of B, and denoted by EB . The equivalence class, [p]EB = B − 1(Bp ), or simply write as [p], is called the center of Bp . The center [p] consists of all those points that have (are mapped to) the same set Bp ⊆ V ; see illustration in Table 1. Table 1. “→” represents a binary neighborhood system(binary relations), the last column represents the Centers (induced Partition) Objects X Y Z S T F W
−→ −→ −→ −→ −→ −→ −→
Binary Neighborhood (B) Center (EB ) {F} {X, W} {Z} {Y, T} {Y, S} {Z} {Z, T} {S} {Z} {Y, T} {X, W} {F} {F} {X, W}
20
T.Y. Lin
Applications of rough structure. If U = V and N S is a finite family of partitions (equivalence relations), then the 3-tuple (U, U, B) is called a rough structure. Professor Liu and I have established some interesting result: the axiomatic rough set theory [25]. See further applications on data mining in [7] of this volume, and [18,10,8,1]. Applications in binary granulations. We will mention some applications: 1. 2. 3. 4.
Approximate Retrieval; see [19,21,14] Article classification system; see [33,14] Conflict Analysis and Chinese Wall Security Policy; see [34,26,9,11,35,36]. Data mining on binary relations: [22,23,24]
See further results in [14,15,12]. 4.2
Formal Word Representations for Fuzzy Controls
Multi-valued Representations. In this section, we apply the same procedure to multiple granulation. To each object, there are several granules, so we get a multi-valued representation. However, in fuzzy world, we can form a “weighted average representation,” hence the multi-valued table becomes a single-valued table; we will import some results from [16]. Table 2. Grades of Membership Functions of the Fuzzy Covering Objects A1 (p) p0 1.0 p1 0.67 p2 0.33 p3 0.0 p4 0.0 p5 0.0 p6 0.0 p7 0.0 p8 0.0 p9 0.0
A2 (p) 0.0 0.33 0.67 1.0 0.67 0.33 0.0 0.0 0.0 0.0
A3 (p) A4 (p) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.33 0.0 0.67 0.0 1.0 0.0 0.67 0.33 0.33 0.67 0.0 1.0
Vector Valued Representations – Formal Word Tables. By naming the granules (fuzzy sets: A1 (p), A2 (p), A3 (p), A4 (p)), we get a multi-valued representation. However, we can take “weighted average” and get a single valued representation; such representations are useful in fuzzy logic; see below. Let us consider the following formal expressions: r1 ∗ Low + r2 ∗ M edian Low + r3 ∗ M edian high + r4 ∗ High
Granular Computing
21
Table 3. Multi-valued Representations Objects Neighborhood Concepts p0 {A1 } Low = N AM E(A1 ) p1 {A1 } Low = N AM E(A1 ) M edian Low = N AM E(A2 ) p2 {A1 , A2 } Low = N AM E(A1 ) M edian Low = N AM E(A2 ) p3 {A2 } M edian Low = N AM E(A2 ) p4 {A2 , A3 } M edian Low = N AM E(A2 ) M edian high = N AM E(A3 ) p5 {A2 , A3 } M edian Low = N AM E(A2 ) M edian high = N AM E(A3 ) p6 {A3 } M edian high = N AM E(A3 ) p7 {A3 , A4 } M edian high = N AM E(A3 ) High = N AM E(A4 ) p8 {A3 , A4 } M edian high = N AM E(A3 ) High = N AM E(A4 ) p9 {A4 } High = N AM E(A4 )
where r1 , r2 , r3 , r4 are real numbers. Mathematically the collection of all such expressions forms an abstract vector space over real numbers. We will denote the vector space by F W (U ). Each vector is called a formal word. Let wi (p) represents the grade of Ai at p, i.e., Ai (p) = wi (p). We will call wi (p) the weight of p in Ai . Based on the weight wi (p) and multi-valued table representation (Table 3, 2), we will form a F W -representation, a formal word representation, as follows: W : V −→ F W ; p −→ W (p) defined by W (p) =
i (wi (p)
∗ N AM E(Ai (p))
In terms of the current example, we have W (p) = w1 (p)∗Low+w2 (p)∗M edian Low+w3 (p)∗M edian high+w4 (p)∗High The expression is called the formal word representation of p; it is Zadeh’s veristic constraint [31]. Table 4 consists of all such formal expressions; it is a vector-valued representation of the universe. Each expression represents a certain weighted sum of concepts; it succinctly expresses the overlapping semantics of neighborhoods.
5
Conclusion
We hope the readers have been convinced that granular computing is viable methodology. In this conclusion, we will over view the general guide principles behind all these works. In large scale computing, one may divide the computing task into three major steps: Granulate and conquer, Computing on the level of granules (mini computing with words), and information recovering (approximations by granules). Plainly, we reduce the problem, find a simplified solution, and recapture the solution within the original context:
22
T.Y. Lin Table 4. A Single Attribute Representation using Formal Words Objects Weighted Fundamental concepts (veristic constraints) p0 W (0.0) = w1 (0.0) ∗ Low p1 W (0.1) = (w1 (0.1) ∗ Low) + (w2 (0.1) ∗ M edian Low) p2 W (0.2) = (w1 (0.2) ∗ Low) + (w2 (0.2) ∗ M edian Low) p3 W (0.3) = (w3 (0.3) ∗ M edian high) p4 W (0.4) = (w1 (p) ∗ Low) + (w2 (p) ∗ M edian Low)+ (w3 (p) ∗ M edian high) + (w4 (p) ∗ High) p5 W (0.5) = (w1 (p) ∗ Low) + (w2 (p) ∗ M edian Low)+ (w3 (p) ∗ M edian high) + (w4 (p) ∗ High) p6 W (0.6) = (w1 (p) ∗ Low) + (w2 (p) ∗ M edian Low)+ (w3 (p) ∗ M edian high) + (w4 (p) ∗ High) p7 W (0.7) = (w1 (p) ∗ Low) + (w2 (p) ∗ M edian Low)+ (w3 (p) ∗ M edian high) + (w4 (p) ∗ High) p8 W (0.8) = (w1 (p) ∗ Low) + (w2 (p) ∗ M edian Low)+ (w3 (p) ∗ M edian high) + (w4 (p) ∗ High) p9 W (0.9) = (w1 (p) ∗ Low) + (w2 (p) ∗ M edian Low)+ (w3 (p) ∗ M edian high) + (w4 (p) ∗ High)
1. Granulate and conquer. 2. Computing on Granules-a mini computing with words: We would like to note that though words are symbols, however, computing with words is not the same as symbolic computing. Loosely speaking symbolic computing is an “axiomatic” computing; all symbols in the computing systems are determined by the axioms, and the computation follows the formal specification. Unfortunately, in many real world problems such ideal environments are often unavailable, such as non-linear controls. Computing with words is informal and fuzzy. Semantics of words are not precisely formalized and stored in the systems. Their semantic will be captured and re-captured during the computing. In certain degree, data processing is computed in this fashion; the semantics of individual data, such as attributes are not stored in the systems, human operators decide their semantics and process accordingly. 3. Information Recapturing: Approximations are the major tools. So far our works have assumed no information structure on each granule. So we only have investigated the “set theory” of granular computing. Many interesting works will be resulted from imposing information structures on each granule.
References 1. Tsau Young Lin, Yiyu Yao, and Lotfi Zadeh, Data Mining, Rough Sets and Granular Computing, Physica Verlag, Heidelberg, 2002 2. Witold Pedrycz, Granular Computing, Physica Verlag, Heidelberg, 2002 3. Ning Zhong, Andrzej Skowron, Setsuo Ohsuga (eds)New Directions in Roughsets, Data Mining, and Granular-Soft Computing Springer Verlag, LNCS 1711, 1999
Granular Computing
23
4. Hector Garcia-Molina, Jeffrey D. Ullman. Jennifer Widom, Database Systems – The Complete Book, Prentice Hall, 2002, ISBN 0-13-031-995-3 5. Date, C. J.: Introduction to Database Systems. 3rd, 7th edn. Addison-Wesley, Reading, Massachusetts (1981, 2000). 6. Z. Pawlak, Rough sets. International Journal of Information and Computer Science 11, 1982, pp. 341-356. 7. T. Y. Lin, “Deductive Data Mining”, In: the Proceeding of International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing (RSFDGrC2003), Chongqing, China, May 26–29, 2003, this Volume. 8. T. Y. Lin, “Attribute (Feature) Completion – The Theory of Attributes from Data Mining Prospect,” in: Proceedings of International Conference on Data Mining, Maebashi, Japan, Dec 9–12, 2002, pp. 282–289 9. T. Y. Lin, “Placing the Chinese Walls on the Boundary of Conflicts – Analysis of Symmetric Binary Relations.” In: Proceedings of the International Conference on Computer Software and Applications, Oxford, England, Aug 26–29, 2002, pp. 966–971 10. T. Y. Lin, “ Database Mining on Derived Attributes – Granular and Rough Computing Approach,” In: Rough sets and Current Trends in Computing, Alpigini, Peters Skowron, Zhong (eds), Lecture Notes in Artificial Intelligence, 2002, pp 14–32. 11. T. Y. Lin, “Granular Computing on Binary Relations-Analysis of Conflict and Chinese Wall Security Policy,” In: Rough sets and Current Trends in Computing Alpigini, Peters Skowron, Zhong (eds), Lecture Notes in Artificial Intelligence No 2475, 2002, pp. 296–299 12. T. Y. Lin, “Granulation and Nearest Neighborhood: Rough Set Approach.” In: Granular Computing: An Emerging Paradigm, W. Pedrycs (ed), Physica-Verlag, 2001, pp. 125–142, 13. A. Bargiela, W. Pedrycz, and K. Hirota, “Logic-based Granular Prototyping,” In: Proceedings of the International Conference on Computer Software and Applications, Oxford, England, Aug 26–29, 2002, pp. 1164–1169 14. T. Y. Lin, “Granular Computing on Binary Relations I: Data Mining and Neighborhood Systems.” In: Rough Sets In Knowledge Discovery, A. Skowron and L. Polkowski (eds), Springer-Verlag, 1998, 107–121 15. T. Y. Lin, “Granular Computing on Binary Relations II: Rough Set Representations and Belief Functions.” In: Rough Sets In Knowledge Discovery, A. Skoworn and L. Polkowski (eds), Springer-Verlag, 1998, 121–140. 16. T. Y. Lin, “Granular Computing: Fuzzy Logic and Rough Sets.” In: Computing with words in information/intelligent systems, L.A. Zadeh and J. Kacprzyk (eds), Springer-Verlag, 183–200, 1999 17. T. Y. Lin, “Data Mining: Granular Computing Approach.” In: Methodologies for Knowledge Discovery and Data Mining, Lecture Notes in Artificial Intelligence 1574, Third Pacific-Asia Conference, Beijing, April 26–28, 1999, 24–33. 18. Tsau Young Lin, “Data Mining and Machine Oriented Modeling: A Granular Computing Approach,” Journal of Applied Intelligence, Kluwer, Vol 13, No 2, 2000, 113–124. 19. T. Y. Lin, Neighborhood Systems and Relational Database. In: Proceedings of 1988 ACM Sixteen Annual Computer Science Conference, February 23–25, 1988, 725 20. Topological Data Models and Approximate Retrieval and Reasoning, in: Proceedings of 1989 ACM Seventeenth Annual Computer Science Conference, February 21–23, Louisville, Kentucky, 1989, 453.
24
T.Y. Lin
21. T. Y. Lin, “Neighborhood Systems and Approximation in Database and Knowledge Base Systems,” Proceedings of the Fourth International Symposium on Methodologies of Intelligent Systems, Poster Session, Charlotte, North Carolina, October 12–15, pp. 75–86, 1989. 22. T. Y. Lin, and M. Hadjimichaelm, Non-classificatory Generalization in Data Mining. In: Proceedings of The Fourth Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, Tokyo, Japan, November 8–10, 404–411,1996. 23. T. Y. Lin, Ning Zhong, J. Duong, S. Ohsuga, “Frameworks for Mining Binary Relations in Data.” In: Rough sets and Current Trends in Computing, Lecture Notes on Artificial Intelligence 1424, A. Skoworn and L. Polkowski (eds), SpringerVerlag, 1998, 387–393. 24. T. Y. Lin, “Generating Concept Hierarchies/Networks: Mining Additional Semantics in Relational Data.” In: Advances in Knowledge Discovery and Data Mining, Lecture Notes in Artificial Intelligence # 2035, 2001, pp. 174–185 (5th Pacific-Asia Conference, Hong Kong, April 2001) 25. T. Y. Lin, and Q. Liu, Rough Approximate Operators-Axiomatic Rough Set Theory. In: Rough Sets, Fuzzy Sets and Knowledge Discovery, W. Ziarko (ed), SpringerVerlag, 256–260, 1994. Also in: The Proceedings of Second International Workshop on Rough Sets and Knowledge Discovery, Banff, Oct. 12–15, 255–257, 1993. 26. T. Y. Lin, “Chinese Wall Security Policy – An Aggressive Model”, Proceedings of the Fifth Aerospace Computer Security Application Conference, December 4–8, 1989, pp. 286–293. 27. John Kelly. General topology, Van Nostrand, 1955. 28. Yao, Y.Y. and Zhong, N. Granular Computing using Information Tables, in: Data Mining, Rough Sets and Granular Computing, Lin, T.Y., Yao, Y.Y. and Zadeh, L.A. (Eds.), Physica-Verlag, Heidelberg, pp. 102–124, 2002. 29. Yao, Y.Y. and Yao, J.T. Granular computing as a basis for consistent classification problems, Proceedings of PAKDD’02 Workshop on Foundations of Data Mining, Communications of Institute of Information and Computing Machinery, Taiwan, 5, 101–106, 2002. 30. Lotfi Zadeh, The Key Roles of Information Granulation and Fuzzy logic in Human Reasoning. In: 1996 IEEE International Conference on Fuzzy Systems, September 8–11, 1, 1996. 31. Lotfi Zadeh, Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems 90(1997), 111–127. 32. L.A. Zadeh, Fuzzy Sets and Information Granularity, in: M. Gupta, R. Ragade, and R. Yager, (Eds), Advances in Fuzzy Set Theory and Applications, North-Holland, Amsterdam, 1979, 3–18. 33. M. Viveros, Extraction of Knowledge from Databases, Thesis, California State University at Northridge, 1989. 34. David D. C. Brewer and Michael J. Nash: “The Chinese Wall Security Policy” IEEE Symposium on Security and Privacy, Oakland, May, 1988, pp. 206–214, 35. Z. Pawlak, “On Conflicts,” Int J. of Man-Machine Studies, 21 pp. 127–134, 1984 36. Z. Pawlak, Analysis of Conflicts, Joint Conference of Information Science, Research Triangle Park, North Carolina, March 1–5, 1997, 350–352. 37. Lotfi A. Zadeh “Some Reflections on Information Granulation and its Centrality in Granular Computing, Computing with Words, the Computational Theory of Perceptions and Precisiated Natural Language,” In: Data Mining, Rough Sets and Granular Computing, T. Y. Lin, Y. Y. Yao, L. Zadeh (eds), Physica-Verlag, 2002
Rough Sets: Trends and Challenges Extended Abstract Andrzej Skowron1 and James F. Peters2 1
2
Institute of Mathematics, Warsaw University Banacha 2, 02-097 Warsaw, Poland [email protected] Department of Electrical and Computer Engineering, University of Manitoba Winnipeg, Manitoba R3T 5V6, Canada [email protected]
Abstract. We discuss how approximation spaces considered in the context of rough sets and information granule theory have evolved over the last 20 years from simple approximation spaces to more complex spaces. Some research trends and challenges for the rough set approach are outlined in this paper. The study of the evolution of approximation space theory and applications is considered in the context of rough sets introduced by Zdzislaw Pawlak and the notions of information granulation and computing with words formulated by Lotfi Zadeh. The deepening of our understanding of information granulation and the introduction to new approaches to concept approximation, pattern identification, pattern recognition, pattern languages, clustering, information granule systems, and inductive reasoning have been aided by the introduction of a calculus of information granules based on rough mereology. Central to rough mereology is the inclusion relation to be a part to a degree. This calculus has grown out of an extension of what S. Le´sniewski called mereology (the study of what it means to be a part of).
1
Introduction
One of the basic concepts of rough set theory [18] is the indiscernibility relation defined by means of information about objects of interest. The indiscernibility relation is used to define set approximations [17,18]. Several generalizations of the rough set approach based on approximation spaces defined by tolerance and similarity relation or a family of indiscernibility relations, have been reported (for references see the papers and bibliography in [16,23]). Rough set approximations have been also generalized for preference relations and rough-fuzzy hybridizations (see, e.g., [39]). Generalized approximation spaces are discussed in [35] where uncertainty and inclusion functions are introduced. The approach based on inclusion functions has been generalized to the rough mereological approach (see, e.g., [22,24,25]). The inclusion relation xµr y with the intended meaning x is a part of y to a degree r has been taken as the basic notion of the rough mereology which is a generalization of the Le´sniewski mereology [8]. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 25–34, 2003. c Springer-Verlag Berlin Heidelberg 2003
26
A. Skowron and J.F. Peters
In the following sections we will discuss the impact of information granulation and inductive reasoning on the concept approximation process. As a result of inductive reasoning one cannot define inclusion degrees of object neighborhoods directly into the target concepts but only into some patterns relevant to such concepts (e.g., left hand sides of decision rules) (see, e.g., [30], [44], [38], [2], [33]). Such degrees together with degrees of inclusion of patterns in target concepts make it possible to define outputs of classifiers for new classified objects. Using the constructed classifiers one can define new patterns relevant to concept approximation [2,33]. The research in this direction has been recently initiated. It can bring new interesting results related to the rough set approach in inductive reasoning, in particular for adaptive learning of concepts. Let us note that the rough set approach for more complex data such as decision tables with transformations (describing deformations of objects) preserving classification of objects, complex decisions or attribute values (e.g., being plans or models of processes) should be developed. Some issues of the rough set approach concerned with data and domain knowledge represented in distributed environments are outlined in the following sections.
2
Rough-Mereological Approach to Approximation of Information Granules
Rough mereology offers a methodology for synthesis and analysis of objects in the distributed environments of intelligent agents, in particular, for synthesis of objects satisfying a given specification to a satisfactory degree or for control in such complex environments. Moreover, rough mereology has been recently used for developing foundations of the information granule calculus, an attempt towards formalization of the paradigm of computing with words based on perception recently formulated by Lotfi Zadeh [45,46,47]. The rough mereological approach built on the basis of the inclusion relation to be a part to a degree generalizes the rough set and fuzzy set approaches (see. e.g., [22], [24], [25], [26]). Such inclusion relations, called rough inclusions, can be used to define other basic concepts like closeness of information granules, their semantics, indiscernibility and discernibility of objects, information granule approximation and approximation spaces, perception structure of information granules as well as the notion of ontology approximation. For details the reader is referred to [15]. The rough inclusion relations together with operations for construction new information granules from already existing ones create a core of a calculus of information granules.1 Distributive multi-agent framework makes it possible to create a relevant computational model for calculus of information granules. Agents (information sources) provide us with information granules that must be transformed, analyzed and built into structures that support problem solving. In such computational model approximation spaces play an important 1
Note, the rough inclusion relations should be extended on newly constructed information granules.
Rough Sets: Trends and Challenges
27
role because information granules received by agents must be approximated (to be understandable by them) before they can be transformed (see, e.g., [24], [36], [15], [21]). 2 Developing calculi of information granules for approximate reasoning is a challenge important for many applications including control of autonomous vehicles [43] and line-crawling robots [19], web mining and spatio-temporal data mining [30], design automation, sensor fusion [20], approximation neuron design [21,15], creating approximate views of relational databases and, in general, for embedding in intelligent systems ability for reasoning with words and reasoning based on perception [45,46,47]. Some steps towards this direction have been taken. Methods for construction of approximate reasoning schemes (ARschemes) have been developed. Such AR-schemes are information granules being clusters of exact constructions (derivations). Reasoning with AR-schemes makes it possible to obtain results satisfying a given specification up to a satisfactory degree (not necessarily exactly) (see e.g., [24,15,30,31]). Methods based on hybridization of rough sets with fuzzy sets, neural networks, evolutionary approach or case based reasoning are especially valid in inducing AR-schemes. Let us finally note that inducing relevant calculi of information granules includes also such complex tasks like discovery of relevant operations on information granules or rough inclusion measures. This is closely related to problems of perception and reasoning based on perception [47]. Using rough inclusions, one can generalize the approximation operations for sets of objects, known in rough set theory, to arbitrary information granules. The approach is based on the following reasoning: Assume p is an inclusion degree, G = {gt }t is a given family of information granules and g is a granule from a given information granules system S. Let us recall that inclusion degrees are partially ordered by a relation ≤ . Now, assuming p < q, one can consider two approximations for a given information granule g by G. The (G, q)-lower approximation of g is defined by 3 LOWG,p,q (g) = M ake granule({gt : νq (gt , g) and q ≥ q}).
(1)
The (G, q)-upper approximation of g is defined by U P PG,p,q (g) = M ake granule({gt : νp (gt , g) and p > p}).
(2)
The definition of a parameterized approximation space given in [35] is an example of the introduced notion of information granule approximation. The presented approach can be generalized to approximation spaces in inductive reasoning. 2 3
Recently, relationships between the rough set approach [37] and the information flow approach to logic of distributed systems [1] have been reported. M ake granule operation is a fusion operation of collections of information granules. A typical example of M ake granule is set theoretical union used in rough set theory. Another example of M ake granule operation is realized by classifiers.
28
3
A. Skowron and J.F. Peters
Rough Sets and Inductive Reasoning
In inductive reasoning we would like to approximate concepts over a universe of objects, say U ∞ , wider than the universe U of objects in a given decision system. In other words, assuming U ⊂ U ∞ , we would like to approximate concepts over U ∞ which are extensions of decision classes in a given decision system. In this section, we present the relevant approximation spaces for such concepts, and show how to induce classifiers approximating those concepts. We also discuss the relationships between the whole process and different approaches pursued in the fields like machine learning, pattern recognition, data mining and knowledge discovery [10,6,29]. The main observation is that, in the considered case, it is necessary to induce also a relevant approximation space. Such a space is usually different from the partition defined by the conditional attributes of a given decision system. It consists of some subsets of U ∞ , called neighborhoods of objects. It should be emphasized that neighborhoods usually create a covering of U ∞ , not necessarily a partition. They are defined by patterns chosen from some relevant pattern languages. In practical applications it is often necessary to specify a data model using a particular description in a pattern language. Moreover, the description usually is consistent only on a given part of the model, since the whole original model is often only partially specified. 4 In order to indicate that a given model is specified by a particular description, we use the term description model. The structure of the pattern languages and the patterns themselves should be discovered. The whole process is quite complex and is illustrated in Fig. 1, where: A = (U, A, d) denotes a decision system; Atrain and Atest are training and testing subsystems of A, respectively; L = {Li }i∈I is a family of pattern languages; Q = {Qj }j∈J is a family of quality measures for description models; M is a description model covering objects in U ; C is a classifier obtained from M and covering the (almost) whole universe U ∞ . Elements of Li are formulas called patterns. Patterns define, in a given decision system, sets of objects in which they are satisfied. Description models describe decision classes of A, by using patterns from Li and some inclusion measures of those patterns in decision classes. The description models can be built by means of, e.g., decision rules over descriptors from Li . Quality measures can be used as criteria for tuning the model. For a given Li and Qj , one can search for a description model using patterns from Li which is (sub-) optimal with respect to the measure Qj . However, the goal is to induce the relevant description model for the induced classifier, covering the whole universe of objects. This, in particular, makes it necessary to tune parameters of the description quality measure. There are many ways for specifying quality measures. For example, a measure Qj , can be specified using the minimum description length principle [27,40], where one estimates the quality of approximation, as well as the size of the description model defined. The minimum description length principle 4
We will discuss this issue in more detail later in this section.
Rough Sets: Trends and Challenges
29
requires to choose a description of the smallest size from among those descriptions with the same approximation quality. In this case, the quality measure depends on two arguments. The first argument represents the quality of approximation (e.g., using the positive region of decision classes or entropy measure). The second argument represents the measures based on the model size. A proper balance between these two arguments is generally obtained using training data. The tuning may involve thresholds for degrees of inclusion of patterns from Li in decision classes or for the positive region size.
A-
Atest
Decomposition Atrain
L Q -
? Selectig strategy
6
?
Description model builder (Li , Qj )
6
? M -
Extension to classifiers
?
Classifier Cquality estimation
6
Qinf o
?
Classifier quality satisfactory? No Yes Tuning by Qinf o
?
STOP
Fig. 1. Approximation space and classifier construction using rough sets.
The whole process, presented in Fig. 1, can be viewed as a process searching for a relevant approximation space. As we have mentioned before, such an approximation space consists of neighborhoods of objects from U . Certainly, such an approximation space is more general than what is discussed in [18]. The induced description model should be extended to a classifier of all objects from the whole universe of objects U ∞ , not only from U (the reader is referred, e.g., to [10], [15] for the definition of classifiers). Recall that for any object to be classified, it is necessary to compute its degree of inclusion in any pattern from the description model. In the case of new objects (outside of U ), these degrees can suggest conflicting decisions and, together with the degrees of pattern inclusion in decision classes, create input for conflict resolution strategy necessary to compute the classifier output. Next, the induced classifier is tested on objects from Atest . Information Qinf o about the classifier behavior is returned from the classifier quality estimation module. If Qinf o shows that the classifier quality is unsatisfactory, it is used to tune parameters in different modules presented in Fig. 1 and to reconstruct the
30
A. Skowron and J.F. Peters
classifier to obtain a new one with a better quality. In addition, matching strategies for objects and patterns as well as parameters for conflict resolution strategy can also be tuned. The parameters involved in the tuning process can, for instance, be inclusion degree thresholds, parameters characterizing approximation quality or parameters measuring the description model size. As a typical example, one can consider the language of patterns consisting of conjunctions of descriptors over a selected set of attributes. More complex pattern language can include conjunctions of formulas that are disjunctions of descriptor conjunctions. The induced approximation spaces can be treated as an example of complex information granules. 3.1
Classifiers
An important class of information granules create classifiers, i.e., algorithms classifying objects into decision classes. The classifier construction based on approximation space induced from a given decision table DT = (U, A, d) can be described as follows [30]: 1. First, one can construct granules Gj corresponding to each particular decision j = 1, . . . , r by taking a collection {gij : i = 1, . . . , kj } of left hand sides of decision rules for a given decision. 2. Let E be a set of elementary granules (e.g., defined by conjunction of descriptors) over A = (U, A). We can now consider a granule denoted by M atch(e, G1 , . . . , Gr ) for any e ∈ E being a collection of coefficients εij where εij = 1 if the set of objects defined by e in A is included in the meaning of gij in A, i.e., SemA (e) ⊆ SemA (gij ); and 0, otherwise. Hence, the coefficient εij is equal to 1 if and only if the granule e matches in A the granule gij . 3. Let us now denote by Conf lict res an operation (resolving conflict between decision rules recognizing elementary granules) defined on granules of the form M atch(e, G1 , . . . , Gr ) with values in the set of possible decisions 1, . . . , r. Hence, Conf lict res(M atch(e, G1 , . . . , Gr )) is equal to the decision predicted by the classifier Conf lict res(M atch(•, G1 , . . . , Gr )) on the input granule e. Parameters to be tuned in classifiers are voting strategies, matching strategies of objects against rules as well as other parameters like closeness of granules in the target granule. The reader can easily describe more complex classifiers by means of information granules. For example, one can consider soft instead of crisp inclusion between elementary information granules representing classified objects and the left hand sides of decision rules or soft matching between recognized objects and left hand sides of decision rules. One can use the constructed classifiers in searching for a new approximation space relevant for the target concept approximation on the universe of objects
Rough Sets: Trends and Challenges
31
extended by the testing objects. New neighborhoods in such approximation space are defined by parameterized patterns expresses in a language used for classifier construction (e.g., one can consider parameterized patterns defined by formulas used for conflict resolution between voting decision rules [2]). The relevant patterns and hence the neighborhoods for concept approximation can be obtained by tuning parameters of such parameterized patterns. The new patterns can also be used in adaptive reconstruction of classifiers. Developing rough set based strategies for adaptive concept approximation is a challenge. Some results on rough set approach to adaptive classifier construction are reported [44].
4
Rough Sets, Boolean Reasoning, and Approximate Boolean Reasoning
In this section we discuss a methodology that makes it possible to search for patterns defining neighborhoods of approximation spaces relevant for concept approximation. The ability to discern between perceived objects is important for constructing many different kinds of reducts making possible to define relevant patterns for concepts approximations. The idea of Boolean reasoning is based on construction for a given problem P a corresponding Boolean function fP with the following property: the solutions for the problem P can be decoded from prime implicants of the Boolean function fP [4]. Let us mention that to solve real-life problems it is necessary to deal with Boolean functions having a large number of variables. A successful methodology based on the discernibility of objects and Boolean reasoning has been developed for computing of many different kinds of reducts and their approximations for inducing decision rules, association rules, discretization of real value attributes, symbolic value grouping, searching for new features defined by oblique hyperplanes or higher order surfaces, pattern extraction from data as well as conflict resolution or negotiation (for references see, e.g., [7], [29], [11], [12], [32], [40] and bibliography in [16], [23], [7], [41]). Most of the problems related to generation of the above mentioned entities are (at least) NP-complete or NP-hard. However, it was possible to develop efficient heuristics returning suboptimal solutions of the problems. The results of experiments on many data sets are very promising. They show very good quality of solutions generated by the heuristics in comparison with other methods reported in literature (e.g. with respect to the classification quality of unseen objects). Moreover, they are very efficient from the point of view of time necessary for computing of the solution. It is important to note that the methodology allows for some problems to construct heuristics having a very important approximation property which can be formulated as follows: expressions generated by heuristics (i.e., implicants) close to prime implicants define approximate solutions for the problem. A challenging issue is to develop a methodology called approximate Boolean reasoning for deriving such heuristics feasible for a wide class of problems related to rough set applications, e.g., in data mining. Such an approach is also suggested in [28]. However, the problems we are dealing with require analysis of very large
32
A. Skowron and J.F. Peters
formulas for which general purpose heuristics will not be feasible. One possibility is to develop heuristics feasible for such problems is to use domain knowledge about them. Some promising results in this direction have been obtained [11,12, 7,29].
5
Learning from Sparse Data
One of the main issue in learning theory is to develop methodology for reasoning from sparse data [42,3]. This research direction been recently suggested by statisticians as a new direction not based on searching for stochastic data models generating them. Developing methods based on rough sets for reasoning from sparse data is a challenge for the rough set approach. Some results in this direction are reported [5], [7], [23]. However, much more work should be done in this direction. For example, evolutionary strategies discovering subspaces of features (from which relevant features can be selected) should be developed. Such strategies can be gained from experience with learning systems that search for such subspaces. We have suggested that in searching for such strategies domain knowledge can also be used. Then inducing productions from data becomes feasible because, in a sense, there is a sufficiently small distance between feature spaces of premisses and conclusions of such rules. Next AR-schemes can be derived from productions that have been discovered [24,30,36,33].
6
Conclusions
We have discussed different aspects of approximation spaces and some current research trends and challenges related to concept approximation issues. Among these trends and challenges are those related to (i) approximate Boolean reasoning, (ii) inducing operations on information granules, inclusion and closeness measures, productions and AR-schemes, (iii) domain knowledge approximation, (iv) reasoning from sparse data, (v) adaptive learning of concepts, (vi) computational models for calculi on information granules, (vii) rough-neuro computing. Many of these trends and challenges are closely related to computing with words and computational theory of perceptions [47]. All of them are concentrated around searching for calculi of information granules for approximate reasoning. We have shown that approximation spaces developed and investigated in rough set theory play important role in all discussed research directions. Acknowledgements. The research of Andrzej Skowron has been supported by the State Committee for Scientific Research of the Republic of Poland (KBN) research grant 8 T11C 025 19 and by the Wallenberg Foundation grant. The research of James Peters has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant 185986.
References 1. Barwise, J., Seligman, J.: Information Flow: The Logic of Distributed Systems, Cambridge University Press, Tracts in Theoretical Computer Science 44, 1997.
Rough Sets: Trends and Challenges
33
2. Bazan, J., Nguyen, H.S., Skowron, A., Szczuka, M. A view on rough concept approximations (in this volume). 3. Breiman, L.: Statistical modeling: The two cultures, Statistical Science 16(3), 2001, 199–231. 4. Brown, F.M.: Boolean Reasoning. Kluwer Academic Publishers, Dordrecht, 1990. 5. Duentsch, I., Gediga, G.: Rough Set Data Analysis: A Road to Non-invasive Knowledge Discovery. Methods Publishers, Bangor, UK, 2000. ˙ 6. Kloesgen, W., Zytkow, J. (eds.), Handbook of KDD, Oxford University Press, 2002. 7. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial. In: [16], 1999, 3–98. 8. Le´sniewski, S.: Grundz¨ uge eines neuen Systems der Grundlagen der Mathematik. Fundamenta Mathematicae 14, 1929, 1–81. 9. Lin, T.Y., Yao, Y.Y., Zadeh, L.A. (eds.): Data Mining, Rough Sets and Granular Computing. Physica-Verlag, Heidelberg, 2002. 10. Mitchell, T.M.: Machine Learning. Mc Graw-Hill, Portland, 1997. 11. Nguyen, H.S.: Discretization of Real Value Attributes, Boolean Reasoning Approach, Ph.D. Dissertation, Warsaw University 1997, 1–90. 12. Nguyen, H.S.: Efficient SQL-learning method for data mining in large data bases. IJCAI’99, 1999, 806–811. 13. Nguyen, H.S. and Skowron, A.: Quantization of real value attributes. Proceedings of the Second Joint Annual Conference on Information Sciences, Wrightsville Beach, North Carolina, USA, September 28–October 1, 1995, 34–37. 14. Pal, S.K., Pedrycz, W., Skowron, A., Swiniarski, R. (eds.): Rough-Neuro Computing. Neurocomputing: An International Journal (special volume) 36, 2001. 15. Pal, S.K., Polkowski, L., Skowron, A. (eds.): Rough-Neuro Computing: Techniques for Computing with Words. Springer-Verlag, Berlin, 2003. (to appear). 16. Pal, S.K., Skowron, A. (eds.): Rough Fuzzy Hybridization: A New Trend in Decision–Making. Springer-Verlag, Singapore, 1999. 17. Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11, 1982, 341–356. 18. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer, Dordrecht, 1991. 19. Peters, J.F., Ahn, T.C., Degtyaryov, V., Borkowski, M., Ramanna, S.: Linecrawling robot navigation: Rough neurocomputing approach. In: C. Zhou, D. Maravall, D. Ruan (eds.), Fusion of Soft Computing and Hard Computing for Autonomous Robotic Systems. Physica-Verlag, Heidelberg, 2003 (to appear). 20. Peters, J.F., Ramanna, S., Borkowski, M., Skowron, A., Suraj, Z.: Sensor, filter and fusion models with rough Petri nets, Fundamenta Informaticae 47(3–4), 2001, 307–323. 21. Peters, J.F., Skowron, A., Stepaniuk, J., Ramanna, S.: Towards an ontology of approximate reason. Fundamenta Informaticae, 51(1-2), 2002, 157–173. 22. Polkowski, L., Skowron, A.: Rough mereology: a new paradigm for approximate reasoning. International J. Approximate Reasoning 15(4), 1996, 333–365. 23. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery 1–2. Physica-Verlag, Heidelberg, 1998. 24. Polkowski, L., Skowron, A.: Towards adaptive calculus of granules. In: [48], 1999, 201–227. 25. Polkowski, L., Skowron, A.: Rough mereological calculi of granules: A rough set approach to computation. Computational Intelligence 17(3), 2001, 472–492. 26. Polkowski, L., Skowron, A.: Rough-neuro computing. LNAI 2005, Springer-Verlag, Berlin, 2002, 57–64.
34
A. Skowron and J.F. Peters
27. Rissanen, J.J.: Modeling by shortest data description, Automatica 14, 1978, 465– 471. 28. Selman, B., Kautz, H., McAllester, D.: Ten challenges in propositional reasoning and search. IJCAI’97 1, Nagoya, Aichi, Japan, 1997, 50–54. 29. Skowron, A.: Rough sets in KDD. In: Z. Shi, B. Faltings, and M. Musen (eds.), 16-th World Computer Congress (IFIP’2000): Proc. of Conf. on Intelligent Information Processing (IIP’2000), Pub. House of Electronic Industry, Beijing, 2000, 1–17. 30. Skowron, A.: Toward intelligent systems: Calculi of information granules. Bulletin of the International Rough Set Society 5(1-2), 2001, 9–30. 31. Skowron, A., Approximate reasoning by agents in distributed environments. In: N. Zhong, J. Liu, S. Ohsuga, J. Bradshaw (eds.): Intelligent agent technology: Research and development, World Scientific, Singapore, 2001, 28–39. 32. Skowron, A., Nguyen, H.S.: Boolean reasoning scheme with some applications in data mining. LNCS 1704, 1999, 107–115. 33. Skowron, A., Nguyen, T.T.: Rough set approach to domain knowledge approximation. (in this volume). 34. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In: R. Slowi´ nski (ed.), Intelligent Decision Support. Handbook of Applications and Advances of the Rough Set Theory. Kluwer, Dordrecht 1992, 311–362. 35. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 1996, 245–253. 36. Skowron, A., Stepaniuk, J.: Information granules: Towards foundations of granular computing. International Journal of Intelligent Systems 16(1), 2001, 57–86. 37. Skowron, A., Stepaniuk, J., Peters, J.: Rough sets and infomorphisms: Towards approximation of relations in distributed environments. Fundamenta Informaticae 2003 (to appear). 38. Skowron A., Szczuka M., Approximate reasoning schemes: Classifiers for computing with words. Proceedings of SMPS 2002, Physica-Verlag, Heidelberg, 2002, 338–345. 39. Slowi´ nski, R., Greco, S., Matarazzo, B.: Rough set analysis of preference-ordered data. LNAI 2475, Springer-Verlag, Heidelberg, 2002, 44–59. ´ ezak, D.: Approximate Decision Reducts. Ph.D. Thesis, Warsaw University, 2002 40. Sl¸ (in Polish). 41. Swiniarski, R., Skowron, A.: Rough set methods in feature selection and recognition. Pattern Recognition Letters 24(6), 2003, 833–849. 42. Vapnik, V.: Statistical Learning Theory. Wiley, New York, 1998. 43. WITAS project. see http://www.ida.liu.se/ext/witas/eng.html. 2001. 44. Wr´ oblewski, J.: Adaptive Methods of Object Classification. Ph.D. Thesis, Warsaw University, 2002 (in Polish). 45. Zadeh, L.A.: Fuzzy logic = computing with words. IEEE Trans. on Fuzzy Systems 4, 1996, 103–111. 46. Zadeh, L.A.: Toward a theory of fuzzy information granulation and its certainty in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90, 1997, 111–127. 47. Zadeh, L.A.: A new direction in AI: Toward a computational theory of perceptions. AI Magazine 22(1), 2001, 73–84. 48. Zadeh, L.A., Kacprzyk, J. (eds.): Computing with Words in Information/Intelligent Systems 1–2. Physica-Verlag, Heidelberg, 1999.
A New Development on ANN in China – Biomimetic Pattern Recognition and Multi Weight Vector Neurons Shoujue Wang Lab of Artificial Neural Networks. Inst. of Semiconductors. CAS. Beijing 100083, China [email protected]
Abstract. A new model of pattern recognition principles—Biomimetic Pattern Recognition, which is based on “matter cognition” instead of “matter classification”, has been proposed. As a important means realizing Biomimetic Pattern Recognition, the mathematical model and analyzing method of ANN get breakthrough: a novel all-purpose mathematical model has been advanced, which can simulate all kinds of neuron architecture, including RBF and BP models. As the same time this model has been realized using hardware; the high-dimension space geometry method, a new means to analyzing ANN, has been researched. Keywords: Neural network, high-dimensional space geometry, pattern recognition
1 Introduction Artificial neural network is an important method of pattern recognition all along, according to ANN’s powerful ability to approximate functions and excellent ability of self-learning, partitioning and robustness. But on one hand, ANN lacks an all-purpose mathematical model, which can simulate all kinds of neuron architecture. Furthermore, after the networks have been built, huge multi-variable nonlinear equations are very difficult to analyze mathematically because the mathematical description of each neuron is a multi-variable nonlinear equation. The complex multivariable nonlinear equations are always quite hard to analyze and solve, although lots of methods of mathematical analysis have been advanced by many scholars [1]. This condition will limit the in-depth development of the ANN from the analytic point of view. On the other hand, the traditional pattern recognition is based on “matter classification”, which uses “optimal separating” as main principle. It is different from function of human “cognize”. In order to solve these problems, a novel all-purpose mathematical model of neuron and a new analysis method of ANN have been proposed. At the same time, a novel mode of pattern recognition, Biomimetic Pattern Recognition, has been advanced and applied practically. The superiorities of this method have been preliminarily showed by the experimental results.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 35–43, 2003. © Springer-Verlag Berlin Heidelberg 2003
36
S. Wang
2 The Basic All-Purpose Mathematical Model of Neurons As we know, the basic unit of ANN is neuron, no matter how it is simulated with hardware or software in general-purpose computer. In accordance, the ANN’s performance is primarily decided by the basic computation method and function of neurons. According to the brain cells’ activation and restraint mechanism, a hyperplane neuron model and a hypersphere neuron model have been proposed respectively in the earlier time [2]. The mathematics model of hyperplane neural shown as following: n
Y = f (∑ W i X i − θ )
(1)
i=0
Where Y denotes the neuron output. f is an activation function (nonlinear function) of neuron, which may be a step function. X is the input vector of neuron. W is the neuron’s weight vector and θ is the neuron’s activation threshold. The mathematical model of hypersphere neuron (RBF) can be expressed as following: n
Y = f ( ∑ (W i − X i ) 2 −θ 2 )
(2)
i=0
It was testified by experiments that in applications of pattern recognition or function fitting, the neural networks described by (2) has the better performance than the neural network described by (1). This paper discusses a basic mathematical model with better commonality, universal functionality, and easy implementation by general-purpose neural networks hardware. The hardware based on this model can have better adaptability and higher performance. The latter text will discuss a novel basic algorithm for high-order hypersurface neural networks, which has been applied in the design and practice of CASSANDAR-II neurocomputer. 2.1 The Discussion of the Commonality of Hypersurface Neuron Basic Mathematical Model The basic mathematical model of hypersurface neuron for the neurocomputer must satisfy the following conditions: (1) The model can cover both the traditional hyperplane neuron and RBF neural networks. (2) The model had the possibility to implement many various hypersurface. (3) The model can implement character modification by adjusting minority parameters. (4) The model can easily implement high-speed calculation with hardware methods. According to the condition (1), the basic mathematical calculation of general neuron must have basic calculations of both formula (1) and (2). Thus, the signal for each input node must have two weights: one is direction weight and the other is core weight. The following formula illustrates the structure of this model:
A New Development on ANN in China n
W ji ( x j − W ji’ )
j =1
W ji ( x j − W ji’ )
Y = f [∑ (
) S W ji ( x j − W ji’ )
P
−θ ]
37
(3)
f denotes the activation function of neuron.
Where Y denotes the output of neuron.
θ is the threshold of neuron. Wji and W ji are two weights from the j-th input node to neuron. Xj is a positive input value from the j-th input node. n is the dimension of input space. S is a parameter for determining the sign of single entry. If S=0, the sign of single entry is always positive and if S=1, the sign of single entry is the same as the sign of W ji ( X j − W ji’ ) . P is an index parameter. ’
Obviously, the formula (1) and formula (2) are only two special cases of formula (4). When parameter has different value, hypersurface of neuron has different figure, like the following examples. p=1/2,w11=0.5, w21=0.5,w31=0.5
Fig. 1. p=3,w11=0.5 w21=0.5,w31=0.5
Fig. 4.
p=1,w11=0.5, w21=0.5,w31=0.5
Fig. 2.
p=2,w11=0.5, w21=0.5,w31=0.5
Fig. 3.
p=2,w11=1.0, w21=1.0,w31=1/3
p=2,w11=1/3, w21=1.0,w31=1.0
Fig. 5.
Fig. 6.
38
S. Wang
2.2 The General Calculation of Neural Networks Hardware Based on Multi Weight Vector Neurons The author created general CASSANDRA–II neurocomputer hardware based on formula (5), which is suitable for traditional BP networks, RBF networks and various high-order hypersurface neural networks. Its general calculation is as follows: Omi (t + 1) = Fk i {λi [Ci (ℜ) − θ i ]) (4) Where, n
W ji ( I mj − W ji’ )
j =1
W ji ( I mj − W )
ℜ = [∑ (
’ ji
n
Wgi (Omg − Wgi’ )
g =1
Wgi (Omg − Wgi’ )
) S W ji ( I mj − W ji’ ) + ∑ ( P
P
) S Wgi (Omg − Wgi’ ) ]
(5)
According to the formula (4), CASSANDRA–II neurocomputer can simulate random neural networks architectures with various neuron features (including hyperplane, hypersphere, various hyper-sausage, hypercube and so on). (The detail of above formula is provided in represent [3]).
3
Analysis and Theory of High-Dimensional Space Geometry for ANN
Although the high-dimensional space geometry have been developed for many years, systemic and applied geometric analytical methods have not been found yet. We try to do some primary supplements in this filed. We gave some axioms and theorems of the high-dimensional space geometry, which can be used to analyze the behavior of the ANN. 3.1 The Corresponding Representation of the Neuron in the High-Dimensional Space Geometry Generally, a neuron can be understood as a simple calculation as following: Y = f [Φ ( x1 , x 2 ,
L, x ) − θ ]
(6)
n
For the neuron of BP networks n
Φ ( x1 , x 2 , L, x n ) = ∑ wi xi
(7)
i =1
And for the neuron of RBF networks Φ ( x1 , x 2 ,
L, x ) = [ ∑ ( x − w ) ]
1 2 2
n
n
i =1
i
i
(8)
So, a neuron can be correspondingly represented as a hyperplane or a hypersurface in the high-dimensional space. The equation is: (9) Φ ( x1 , x 2 , , x n ) − θ = 0 The basis of the neuron output function f is the distance between the input point in input space and the hyperplane or the hypersurface. Obviously, the conception of geometry about the hyperplane or the hypersurface is quite effective to analyze the behavior of ANN.
L
A New Development on ANN in China
39
3.2 The Basically Analytical Methods of Theorems about the High-Dimensional Space Geometry This part is omitted because of the limit of Paper length. It is discussed in detail in another Paper [6].
4
The Theory and Application of Biomimetic (Topological) Pattern Recognition
Biomimetic Pattern Recognition (BPR) is a new model of Pattern Recognition based on “matter cognition” instead of “matter classification”. This new model is rather closer to the function of human being, than traditional statistical Pattern Recognition using “optimal separating” as its main principle. The basic principle of Biomimetic Pattern Recognition is: In the feature space n R , suppose that set A be a point set including all samples in class A. For if x, y A and ε >0 are given, there must be Set B: B = {x1 = x, x 2 ,
L, x
n −1
, x n = y | ρ ( x m , x m +1 ) < ε ,
∀m ∈ [1, n − 1], m ∈ N } ⊂ A
That is to say, there is no isolated point in set A and A is connected. Traditional pattern recognition takes the optimal demarcation of different kinds of samples in feature space as the target. Whereas the biomimetic pattern recognition takes the optimal cover of distribution of one kind of samples in feature space as the target. Take the case of two-dimension space as following chart:
In this chart, triangles are the samples to be identified, circles and crosses are two kinds of samples that differ from triangles, fold lines are the partition modes of traditional BP networks pattern recognition, great circles are the partition modes of RBF networks (equal to the recognition mode with template matching), curves consisting of hyper sausage stand for “recognition” modes of the biomimetic pattern recognition. The distribution of the samples in the feature space is continuous, so the basic point of biomimetic pattern recognition is to analyze the relationship of the training samples. The differences between the Biomimetic Pattern Recognition and traditional Pattern Recognition are showed in Table 1.
40
S. Wang Table 1. Compareation of Traditional Pattern Recognition and BPR
Traditional Pattern Recognition The optimal classification of many kinds of samples The distinction between one kind of samples and limited kinds of known samples Based on the distinct of different samples To find the optimal classification surface
Biomimetic Pattern Recognition Cognizing different kinds of samples one by one The distinction between one kind of samples and unknown unlimited different kinds of samples Based on the connection of homologous samples To find the optimal covering of homologous samples
The method used by biomimetic pattern recognition is “High-Dimensional Space Complex Geometrical Body Cover Recognition Method”. It studies some kinds of samples’ distribution in feature space and gives reasonable cover, so the samples can be “recognized”. That is to say, in actual Biomimetic Pattern Recognition, to judge whether the points belongs to set Pa, a n-dimensional geometrical shape in the feature space covering set Pa need be constructed with the software and hardware methods. This geometrical shape is a union of infinite n-dimensional hypersphere with constant k as radius and infinite points as center of sphere. More precisely, it is the topological product of set A and n-dimensional hypersphere. According to dimension theory [4], to divide n-dimensional space into two parts, the interface must be an n-1 — dimensional hyperplane or hypersurface. A neuron in Artificial Neural Networks (ANN) can construct an n-1 — dimensional hyperplane or hypersurface in n-dimensional space. Moreover, from the foregoing theory, a neuron can construct many kinds of complex closed hypersurface and a multi-weight neuron can be a covered range of complicated shape in feature space based on its multiweight vectors.. Therefore, ANN using the analysis method of high-dimensional space geometry is a very appropriate method to implement the Biomimetic Pattern Recognition. The application of Biomimetic Pattern Recognition using ANN is very broad, one of which, cognizing objects, will be introduced in the following context. In this applied example, cognized objects (such as naval ship, tank, bus, bow, horse, sheep) on horizontal plane or sea level are tested to be cognized from different directions. Sampling process is to collect the bmp images sampled from different directions (continuous mapping) and then compress into 256-dimensional samples in the feature space. Since observational directions are horizontal, the direction can only vary in one dimension. So, the samples distribution in the feature space is approximate one-dimensional manifold. Considering some small changes on other directions, the covering shape of certain kind of object in the feature space can be regarded as a topological product of one-dimensional manifold homeomorphic with a annulus and 256-dimensional hypersphere. It can be described that there has a points
A New Development on ANN in China
41
set Pa in the 256-dimentional feature space, and the distances between any point in Pa and a certain closed space curve is less then a certain constant k. The space curve covers set S including all samples. Then, we got S = {x | x = S i (i = 1,2,
Lthe total number of samples)}
Pa = {x | ρ ( x, y ) ≤ k , y ∈ A, x ∈ R n } where
A = {x | x = xi , i = (1, 2,
L, n), n ∈ N , ρ ( x
m
, x m +1 ) < ε ,
ρ ( x1 , x n ) < ε , ε > 0, n − 1 ≥ m ≥ 1, m ∈ N }, A ⊂ R , S ⊂ A n
To cover Pa approximately with several neurons in ANN, space curve A is replaced by several space line segments. Then, the covering of every neuron is a topological product of a line segment and a n-dimensional hypersphere. ’ Consider the original samples set S, let S be a subset of S with j elements, as following: S ’ = {x | x = S i’ (i = 0,1,2,
L, j), ρ (S , S ’ i
’ i +1
) ≤ d ≤ ρ ( S i’−1 , S i’+1 ),
where d is selected cons tan t , S 0’ = S ’j }, S ’ ⊂ S
Let j neurons cover Pa approximately, then the covering of ith neuron Pi is: Pi = {x | ρ ( x, y ) ≤ k , y ∈ Bi , x ∈ R n } Bi = {x | x = αS i’ + (1 − α ) S i’+1 , α = [0,1]} j −1
The covering of all j neurons is: Pa’ = U Pi , which represents hypersurfaces i =0
looking as if sausages. So it is called “Hyper Sausage Neuron”(HSN). In the experiments, the following eight objects were used: lion, tiger, tank, car etc. as 0 shown in figure 7. Those objects were rotated 360 to sample and each object has been sampled 400 times. So the first samples set has eight kinds of samples and totally 3200 samples. And then, in another time sampling was done repeatedly, we got the second samples set with 3200 samples.
Fig. 7.
Then we use six objects — cat, dog, zebra, etc, as shown in figure 8. Each object was sampled 400 times by using the above method. So the third samples set with 2400 samples was created. The main steps and results of the experiments are as follows:
42
S. Wang
Fig. 8.
1) The training sets are chosen from the First Samples Set with different distances of every two adjoining samples, They have 338 samples in eight classes altogether. 2) Neural networks corresponding to eight kinds of objects — Pa, Pb, …, etc. were constructed with unified distance parameter k. And the number of neurons is as equal as that of samples in the training set. 3) All 6400 samples in the first and second sample sets were regarded as testing samples for calculating correct recognition rate. The result of experiment is that the correct recognition rate is 99.87%. No sample has been recognized incorrectly, only 0.13% is rejected. 4) All 8800 samples in three sample sets are regarded as testing samples for calculating error recognition rate. And the error recognition rate is 0. The contract of the experimental results of RBF-SVM [5] and HSN-BPR with reducing number of training samples is showed in Table 2. Table 2. The Results of RBF-SVM and BRP
Amount of Training Samples 338
RBF-SVM
BPR
SV
correct rate
HSN
correct rate
2598
99.72%
338
99.87%
251 216
1925 1646
99.28% 94.56%
251 216
99.87% 99.41%
192
1483
88.38%
192
98.98%
182
1378
80.95%
182
98.22%
169
1307
78.28%
169
98.22%
A New Development on ANN in China
43
5 Conclusion A new model of pattern recognition principles—Biomimetic Pattern Recognition using hyper sausage neuron, is proposed in this Paper. It is a fire-new theoretics of Pattern Recognition. As the base, we have advanced a new model of hypersurface neuron—multi weight vector neuron and implemented it in the new CASSANDRAneurocomputer. This method of pattern recognition is more effective and efficient than traditional pattern recognition. And it’s application is very broad, we apply Biomimetic Pattern Recognition to “identifying objects (such as naval ship, tank, bus, bow, horse, sheep) from different directions on horizontal plane or sea level”, etc. the result is much better than SVM.
References [1] J.J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities”, Proc. Natl. Acad. Sci., U.S.A, 1982, 79, 2554–2558 [2] W.S. McCulloch, and W.Pitts. A logic Calculus of the Ideas Imminent in Nerves Activity, .Bulletin of Mathematical Biophysics, 1943, 5:115–133. [3] Wang Shou-jue, “Discussion on the Basic Mathematical Models of Neurons in General Purpose Neurocomputer”, ACTA ELECTRONICA SINICA, Vol.29, No.5, pp.577–580, May.2001 [4] Ryszard Engelking, Dimension Theory, PWN-Polish Scientific Publishers-Warszawa, 1978. [5] Boser B, Guyon I and Vapnik V.N., A Training Algorithm for Optimal Margin Classifiers, Fifth Annual Workshop on Computational Learning Theory, Pittsburgh ACM, 1992, 144 – 152 [6] Wang Shoujue, Wang Bainan, Analysis and Theory of High-Dimension Space Geometry for Artificial Neural Networks, Acta Electronica Sinica, Vol. 30 No.1, Jan 2002
On Generalizing Rough Set Theory Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2, [email protected]
Abstract. This paper summarizes various formulations of the standard rough set theory. It demonstrates how those formulations can be adopted to develop different generalized rough set theories. The relationships between rough set theory and other theories are discussed.
1
Formulations of Standard Rough Sets
The theory of rough sets can be developed in at least two different manners, the constructive and algebraic methods [16,17,18,19,20,25,29]. The constructive methods define rough set approximation operators using equivalence relations or their induced partitions and subsystems; the algebraic methods treat approximation operators as abstract operators. 1.1
Constructive Methods
Suppose U is a finite and nonempty set called the universe. Let E ⊆ U × U be an equivalence relation on U . The pair apr = (U, E) is called an approximation space [6,7]. A few definitions of rough set approximations can be given based on different representations of an equivalence relation. An equivalence relation E can be conveniently represented by a mapping from U to 2U , where 2U is the power set of U . More specifically, the mapping [·]E is given by: [x]E = {y ∈ U | xEy}. (1) The subset [x]E is the equivalence class containing x. The family of all equivalence classes is commonly known as the quotient set and is denoted by U/E = {[x]E | x ∈ U }. It defines a partition of the universe, namely, a family of pairwise disjoint subsets whose union is the universe. A new family of subsets, denoted by σ(U/E), can be obtained from U/E by adding the empty set ∅ and making it closed under set union, which is a subsystem of 2U . In fact, it is an σ-algebra of subsets of U and a sub-Boolean algebra of the Boolean algebra (2U ,c , ∩, ∪). Furthermore, σ(U/E) defines uniquely a topological space (U, σ(U/E)), in which σ(U/E) is the family of all open and closed sets [6]. Under the equivalence relation, we only have a coarsened view of the universe. Each equivalence class is considered as a whole granule instead of many individuals [21]. They are considered as the basic or elementary definable, observable, or measurable subsets of the universe [7,20]. By the construction of G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 44–51, 2003. c Springer-Verlag Berlin Heidelberg 2003
On Generalizing Rough Set Theory
45
σ(U/E), it is also reasonable to assume that all subsets in σ(U/E) are definable. To a large extent, the standard rough set theory deals with the approximation of any subset of U in terms of definable subsets in σ(U/E). From the different representations of an equivalence relation, we obtain three constructive definitions of rough set approximations [17,19,21,27]: Element based definition: apr(A) = {x | x ∈ U, [x]E ⊆ A} = {x | x ∈ U, ∀y ∈ U [xEy =⇒ y ∈ A]}, apr(A) = {x | x ∈ U, [x]E ∩ A = ∅} = {x | x ∈ U, ∃y ∈ U [xEy, y ∈ A]};
(2)
Granule based definition: apr(A) = {[x]E | [x]E ∈ σ(U/E), [x]E ⊆ A}, apr(A) = {[x]E | [x]E ∈ σ(U/E), [x]E ∩ A = ∅};
(3)
Subsystem based definition: apr(A) = {X | X ∈ σ(U/E), X ⊆ A}, apr(A) = {X | X ∈ σ(U/E), A ⊆ X}.
(4)
The three equivalent definitions offer different interpretations of rough set approximations. According to the element based definition, an element x is in the lower approximation apr(A) of a set A if all its equivalent elements are in A, the element is in the upper approximation apr(A) if at least one of its equivalent elements is in A. According to the granule based definition, apr(A) is the union of equivalence classes which are subsets of A, apr(A) is the union of equivalence classes which have a nonempty intersection with A. According to the subsystem based definition, apr(A) is the largest definable set in the subsystem σ(U/E) that is contained in A, apr(A) is the smallest definable subset in σ(U/E) that contains the set A. One may interpret apr, apr : 2U −→ 2U as two unary set-theoretic operators called approximation operators. The system (2U ,c , apr, apr, ∩, ∪) is called a rough set algebra [16]. It is an extension of the set algebra (2U ,c , ∩, ∪) with added operators. The lower and upper approximation operators have the following properties: (i).
apr(A) = (apr(Ac ))c ,
apr(A) = (apr(Ac ))c .
(ii).
apr(U ) = U,
apr(∅) = ∅,
(iii). apr(∅) = ∅,
apr(U ) = U,
(iv). apr(A ∩ B) = apr(A) ∩ apr(B),
apr(A ∪ B) = apr(A) ∪ apr(B).
Property (i) states that the approximation operators are dual operators with respect to set complement c . By properties (ii) and (iii), the approximations of
46
Y.Y. Yao
both the universe U and the empty set ∅ are themselves. Property (iv) states that the lower approximation operator is distributive over set intersection ∩, and the upper approximation operator is distributive over set union ∪. Additional properties of approximation operators are summarized below, using the same labelling system as in modal logic [2,18,25,29]: (D).
apr(A) ⊆ apr(A);
(T).
apr(A) ⊆ A,
(B).
A ⊆ apr(apr(A)),
(T ).
A ⊆ apr(A);
apr(apr(A)) ⊆ A;
(B ).
(4).
apr(A) ⊆ apr(apr(A)),
(4 ).
apr(apr(A)) ⊆ apr(A);
(5).
apr(A) ⊆ apr(apr(A)),
(5 ).
apr(apr(A)) ⊆ apr(A).
They follow from the definition of approximation operators. Based on the three definitions, it is possible to investigate the connections between rough sets and other theories [18]. The element based definition relates approximation operators to the necessity and possibility operators of modal logic [25]. The granule based definition relates rough sets to granular computing [21]. The subsystem based definition relates approximation operators to interior and closure operators of topological spaces [12], and closure operators of closure systems [20]. Furthermore, they can be used to show the connection between rough set theory and belief functions [11,26]. 1.2
Algebraic Methods
Algebraic methods focus on the algebraic system (2U ,c , L, H, ∩, ∪) without directly reference to equivalence relations, where L and H are two abstract unary operators called approximation operators [5,15,17]. Additional operators have been introduced to characterize the approximation operators [4,15,18]. The connections between constructive and algebraic methods can be established by stating axioms on L and H under which there exists an equivalence relation producing the same approximation operators [5,16]. The main result can be stated as follows [5,16]. Suppose L and H are a pair of dual unary operators on 2U . There exists an equivalence relation E on U such that L(A) = aprE (A) and H(A) = aprE (A) if and only if L and H satisfy axioms (iv), (T), (B), and (4). The equivalence relation is defined by [x]E = H({x}) . Consider now three additional operators. Upper approximation distribution: Suppose that property (iv) holds for approximation operators L and H. For any subset A ⊆ U we have H(A) = x∈A H({x}). By setting h(x) = H({x}), we obtain an operator from the universe to the power set of the universe, namely, h : U −→ 2U . By definition, this mapping is called a upper approximation distribution, and the upper approximation can be calculated by H(A) = x∈A h(x). The lower approximation operator can be defined through duality. In the standard rough set model, the upper approximation operator is given by h(x) = [x]E .
On Generalizing Rough Set Theory
47
Basic mapping: The Boolean algebra (2U ,c , ∩, ∪) is an atomic Boolean algebra whose atoms are singleton subsets of U . Let A(2U ) be the set of all atoms. The atom {x} can be identified with the element x of U . The equivalence relation induced mapping [·]E can be identified with a basic mapping, ϕ : A(2U ) −→ 2U . From the element based definition we have [4]: Atom based definition: L(A) = {a | a ∈ A(2U ), ϕ(a) ⊆ A}, H(A) = {a | a ∈ A(2U ), ϕ(a) ∩ A = ∅}. (5) An advantage of this definition is that all entities under consideration are elements of the power set 2U . Conversely, from approximation operators, we have: ϕ({x}) = {y | y ∈ U, x ∈ H({y})}.
(6)
In the standard rough set model, the basic mapping is given by ϕE ({x}) = [x]E , which is essentially the same as upper approximation distribution. If an arbitrary binary relation is used, the observation is no longer true. Basic set assignment: Suppose a pair of approximation operators satisfy axioms (i)-(iv). One can define a mapping, m : 2U −→ 2U , called basic set assignment as follows: m(A) = L(A) − B⊂A L(B). (7) The basic set assignment satisfies the following axioms: (m1). (m2). A = B =⇒ m(A) ∩ m(B) = ∅. A⊆U m(A) = U, The approximation operators can be obtained by: L(A) = B⊆A m(B), H(A) = B∩A=∅ m(B).
(8)
The connection between basic mapping and basic set assignment is given by: m(A) = {x | x ∈ U, ϕ({x}) = A},
ϕ({x}) = A, x ∈ m(A).
(9)
In the standard rough set model, we have m([x]E ) = [x]E for equivalence classes of E, and m(A) = ∅ for all other subsets of U .
2
Generalized Rough Sets
The theory of rough sets can be generalized in several directions. Within the set-theoretic framework, generalizations of the element based definition can be obtained by using non-equivalence binary relations [9,17,18,25,29], generalizations of the granule based definition can be obtained by using coverings [9,14, 19,21,30], and generalizations of subsystem based definition can be obtained by using other subsystems [20,27]. By the fact that the system (2U ,c , ∩, ∪) is a Boolean algebra, one can generalize rough set theory using other algebraic systems such as Boolean algebras, lattices, and posets [4,18,20]. Subsystem based definition and algebraic methods are useful for such generalizations.
48
2.1
Y.Y. Yao
Rough Set Approximations Using Non-equivalence Relations
Let R ⊆ U × U be a binary relation on the universe, which defines a generalized approximation space apr = (U, R). Given two elements x, y ∈ U , if xRy, we say that y is R-related to x, x is a predecessor of y, and y is a successor of x. For an element x ∈ U , its successor neighborhood is given by [25]: Rs (x) = {y | y ∈ U, xRy}.
(10)
With respect to element based definition, we define a pair of lower and upper approximations by replacing the equivalence class [x]R with the successor neighborhood Rs (x): aprR (A) = {x | Rs (x) ⊆ A}, aprR (A) = {x | Rs (x) ∩ A = ∅}.
(11)
The basic mapping is given by ϕ({x}) = Rs (x) and the basic set assignment is given by m(A) = {x | Rs (x) = A}. The connection between the constructive and algebraic methods with respect to non-equivalence relations can be stated as follows [16,18]. Suppose L and H are a pair of dual operators satisfying axioms (i)-(iv), there exists a serial, a reflexive, a symmetric, a transitive and an Euclidean binary relation, respectively, on U such that L(A) = aprR (A) and H(A) = aprR (A) if and only if L and H satisfy axioms (D), (T), (B), (4) and (5), respectively. The binary relation is defined by Rs (x) = {y | x ∈ H({y})}. 2.2
Rough Set Approximations Using Coverings
A covering of a universe U is a family of subsets of the universe such that their union is the universe. By allowing nonempty overlap of two subsets, a covering is a natural generalization of a partition. The granule based definition can be used to generalize approximation operators. Let C be a covering of the universe U . By replacing U/E with C and equivalence classes with subsets in C in the granule based definition, one immediately obtains a pair of approximation operators [30]. However, they are not a pair of dual operators. To resolve this problem, one may extend granule based definition in two ways. Either the lower or the upper approximation operator is extended, and the other one is defined by duality. The results are two pairs of dual approximation operators [19]: aprC (A) =
{X | X ∈ C, X ⊆ A}
= {x | x ∈ U, ∃X ∈ C[x ∈ X, X ⊆ A]}, aprC (A)
= (aprC (Ac ))c = {x | x ∈ U, ∀X ∈ C[x ∈ X =⇒ X ∩ A = ∅]},
On Generalizing Rough Set Theory
49
aprC (A) = (aprC (Ac ))c = {x | x ∈ U, ∀X ∈ C[x ∈ X =⇒ X ⊆ A]}, aprC (A) = {X | X ∈ C, X ∩ A = ∅} = {x | x ∈ U, ∃X ∈ C[x ∈ X, X ∩ A = ∅]}. In general, the above two approximation operators are different. More specifically, (aprC , aprC ) satisfies axioms (i)-(iii), (T), and (4); (aprC , aprC ) satisfies axioms (i)-(iv), and (B). Given a reflexive binary relation R, the family of successor neighborhoods induces a covering of the universe denoted by U/R = {Rs (x) | x ∈ U }. Approximation operators defined by using U/R and R are different. 2.3
Rough Set Approximations Using Subsystems
In the standard rough set model, the same subsystem is used to define lower and upper approximation operators. When generalizing the subsystem based definition, one may use two subsystems, one for the lower approximation operator and the other for the upper approximation operator. Rough Set Approximations in Topological Spaces Let (U, O(U )) be a topological space, where O(U ) ⊆ 2U is a family of subsets of U called open sets. The family of open sets contains ∅ and U , and is closed under union and finite intersection. The family of all closed sets C(U ) = {¬X | X ∈ O(U )} contains ∅ and U , and is closed under intersection and finite union. A pair of generalized approximation operators can be defined by replacing U/E with O(U ) for lower approximation operator, and U/E with C(U ) for upper approximation operator. In this case, the approximation operators are in fact the topological interior and closure operators, characterized by axioms (i), (ii), (iv), (T) and (4). Rough Set Approximations in Closure Systems A family C(U ) of subsets of U is called a closure system if it contains U and is closed under intersection [3]. By collecting the complements of members of C(U ), we obtain another system O(U ) = {¬X | X ∈ C(U )}. According to properties of C(U ), the system O(U ) contains the empty set ∅ and is closed under union. We define a pair of approximation operators in a closure system by replacing U/E with O(U ) for lower approximation operator, and U/E with C(U ) for upper approximation operator. They satisfy axioms (iii), (T) and (4). Rough Set Approximations in Boolean Algebras, Lattices, and Posets Recall that (2U ,c , ∩, ∪) is a Boolean algebra, and σ(U/E) is a sub-Boolean algebra. One can immediately generalize rough set theory to a Boolean algebra (B, ¬, ∧, ∨) by using subsystem based definition. In this case, we can replace U with the maximum element 1, ∅ with the minimum element 0, set complement c with Boolean algebra complement ¬, set intersection ∩ with meet ∧, and set
50
Y.Y. Yao
union ∪ with join ∨. The resulting algebras is known as Boolean algebras with added operators [10,16,18]. In particular, one can define different subsystems corresponding to the previously discussed standard rough set model, topological rough set model, and closure system rough set model [20]. In an atomic Boolean algebra, one can also generalize rough set theory by using the atom based definition through the basic mapping ϕ. By imposing different axioms on ϕ, one can derive various Boolean algebras with added operators [4]. One may further generalize rough set theory by using lattices and posets [1, 4,20]. The crucial point is the design of a subsystem which makes the subsystem based definition applicable.
3
Concluding Remarks
We discuss research results and directions about generalizing the theory of rough sets. The theory is developed using both constructive and algebraic methods, and their connections are established. In the constructive framework, three definitions of approximation operators are examined, the element based, the granule based, and the subsystem based definitions. The element based definition enables us to generalize the theory with non-equivalence relations. The granule based definition can be used to generalize the theory with coverings. The subsystem based definition can be used to generalize the theory in many algebraic systems. In comparison, algebraic methods are more applicable and can be used to generalize the theory in a unified manner. We restrict our discussion to the operator oriented view of rough set theory by treating lower and upper approximation as a pair of unary set-theoretic operators. There are many other views of the rough set theory [16]. Many important generalizations of the theory, such as probabilistic and decision theoretic rough sets [13,22,23,28], and rough membership functions [8,24], although not mentioned in this paper, need further investigation.
References 1. Cattaneo, G. Abstract approximation spaces for rough theories, in: Rough Sets in Knowledge Discovery, Polkowski, L. and Skowron, A. (Eds.), Physica-Verlag, Heidelberg, 59–98, 1998. 2. Chellas, B.F. Modal Logic: An Introduction, Cambridge University Press, Cambridge, 1980. 3. Cohn, P.M. Universal Algebra, Harper and Row Publishers, New York, 1965. 4. J¨ arvinen, J. On the structure of rough approximations, Proceedings of the Third International Conference on Rough Sets and Current Trends in Computing, LNAI 2475, 123–130, 2002. 5. Lin, T.Y. and Liu, Q. Rough approximate operators: axiomatic rough set theory, in Rough Sets, Fuzzy Sets and Knowledge Discovery, W.P. Ziarko (Ed.), SpringerVerlag, London, 256–260, 1994. 6. Pawlak, Z. Rough sets, International Journal of Computer and Information Sciences, 11, 341–356, 1982.
On Generalizing Rough Set Theory
51
7. Pawlak, Z. Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Boston, 1991. 8. Pawlak, Z. and Skowron, A. Rough membership functions, in: Advances in the Dempster-Shafer Theory of Evidence, Yager, R.R., Fedrizzi, M. and Kacprzyk, J. (Eds.), John Wiley and Sons, New York, 251–271, 1994. 9. Pomykala, J.A. Approximation operations in approximation space, Bulletin of Polish Academy of Sciences, Mathematics, 35, 653–662, 1987. 10. Rasiowa, H. An Algebraic Approach to Non-classical Logics, North-Holland, Amsterdam, 1974. 11. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory, in: Advances in the Dempster-Shafer Theory of Evidence, Yager, R.R., Fedrizzi, M. and Kacprzyk, J. (Eds.), Wiley, New York, 193–236, 1994. 12. Wiweger, A. On topological rough sets, Bulletin of the Polish Academy of Sciences, Mathematics, 37, 89–93, 1989. 13. Wong, S.K.M. and Ziarko, W. Comparison of the probabilistic approximate classification and the fuzzy set model, Fuzzy Sets and Systems, 21, 357–362, 1987. 14. Wybraniec-Skardowska, U. On a generalization of approximation space, Bulletin of the Polish Academy of Sciences, Mathematics, 37, 51–61, 1989. 15. Wybraniec-Skardowska, U. Unit Operations, ZeszytyNaukowe Wyzszej Szkoly Pedagogicznej im Powstancow Slaskich w Opolu, Matematyka, XXVII, 113–129, 1992. 16. Yao, Y.Y. Two views of the theory of rough sets in finite universes, International Journal of Approximation Reasoning, 15, 291–317, 1996. 17. Yao, Y.Y. Constructive and algebraic methods of the theory of rough sets, Information Sciences, 109, 21–47, 1998. 18. Yao, Y.Y. Generalized rough set models, in: Rough Sets in Knowledge Discovery, Polkowski, L. and Skowron, A. (Eds.), Physica-Verlag, Heidelberg, 286–318, 1998. 19. Yao, Y.Y. Relational interpretations of neighborhood operators and rough set approximation operators, Information Sciences, 111, 239–259, 1998. 20. Yao, Y.Y. On generalizing Pawlak approximation operators, Proceedings of the First International Conference, RSCTC’98, LNAI 1424, 298–307, 1998. 21. Yao, Y.Y. Information granulation and rough set approximation, International Journal of Intelligent Systems, 16, 87–104, 2001. 22. Yao, Y.Y. Information granulation and approximation in a decision-theoretic model of rough sets, manuscript, 2002. 23. Yao, Y.Y. Probabilistic approaches to rough sets, manuscript, 2002. 24. Yao, Y.Y. Semantics of fuzzy sets in rough set theory, manuscript, 2002. 25. Yao, Y.Y. and Lin, T.Y. Generalization of rough sets using modal logic, Intelligent Automation and Soft Computing, An International Journal, 2, 103–120, 1996. 26. Yao, Y.Y. and Lingras, P.J. Interpretations of belief functions in the theory of rough sets, Information Sciences, 104, 81–106, 1998. 27. Yao, Y.Y. and Wang, T. On rough relations: an alternative formulation, Proceedings of The Seventh International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing, LNAI 1711, 82–90, 1999. 28. Yao, Y.Y. and Wong, S.K.M. A decision theoretic framework for approximating concepts, International Journal of Man-machine Studies, 37, 793–809, 1992. 29. Yao, Y.Y., Wong, S.K.M. and Lin, T.Y. A review of rough set models, in: Rough Sets and Data Mining: Analysis for Imprecise Data, Lin, T.Y. and Cercone, N. (Eds.), Kluwer Academic Publishers, Boston, 47–75, 1997. 30. Zakowski, W. Approximations in the space (U, Π), Demonstratio Mathematica, XVI, 761–769, 1983.
Dual Mathematical Models Based on Rough Approximations in Data Analysis Hideo Tanaka Faculty of Human and Social Environment, Hiroshima International University 555-36 Gakuendai, Kurose, Hiroshima 724-0695, JAPAN [email protected]
Abstract. In rough set approach, the rough approximations called lower and upper ones have been discussed. This concept can be extended into a new research field of data analysis. The proposed approach to data modeling is to obtain dual mathematical models by using a similar concept to rough sets. The dual models called lower and upper models have an inclusion relation. In the other words, the proposed method can be described as two approximations to a phenomenon under consideration such that Lower Model ⊆ Phenomenon ⊆ Upper Model. Thus, the lower and upper models are obtained by the greatest lower bound and the least upper bound, respectively. This property is illustrated by interval regression models which are not crisp, but have an interval relationship between inputs and outputs. Generally, the lower and upper models are formulated by greatest lower and least upper bounds, respectively. The given phenomenon can be expressed by the pair (lower model, upper model) corresponding to rough approximations.
1
Introduction
In the rough sets proposed by Z.Pawlak[1], lower and upper approximations of the given objects are derived from our knowledge base which is a decision table. Indiscernibility plays an important role in the rough set approach as the granularity of knowledge. In this paper, the rough approximations denoted as a pair of lower and upper approximations are extended into dual models called lower and upper models in data analysis. The proposed method can be described as two approximations to a phenomenon under consideration such that Lower Model ⊆ Phenomenon ⊆ Upper Model.
(1)
Thus, the lower and upper models are obtained by the greatest lower bound and the least upper bound, respectively. This property is illustrated by interval regression models. Interval regression models have an interval relationship G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 52–59, 2003. c Springer-Verlag Berlin Heidelberg 2003
Dual Mathematical Models
53
between inputs and outputs that can be regarded as an indiscernibility relationship. Therefore we can obtain a granular knowledge as an interval relationship. In the other words, a phenomenon in an uncertain environment can be approximated by lower and upper models called dual models. It should be noted that two models are dual if and only if the lower model can be defined from the upper model and vice versa. Using two concepts mentioned above, we have studied interval regression[2, 3,4], interval AHP(Analytic Hierarchy Process)[5,6], identification of dual possibility distributions[7,8] and interval DEA(Data Envelopment Analysis)[9]. In these studies, we have obtained the lower and upper models as rough approximations such that (1) holds. There exists always an upper model for any data structure, while it is not assured to attain a solution for a lower model if the assumed model can not express the data structure . If we can not obtain the lower model, it might be caused by adopting a mathematical model not fitting to the given data. If there is a lower model, we say that the given data can be roughly identified by the dual models. This is just similar to the concept of the roughly definable sets[10]. We also introduce the measure of fitness to gauge the degree of approximaion based on the dual models which is also similar to the accuracy measure[10]. In this paper, the interval regression model is illustrated to show the properties mentioned above. Lastly, we compare the properties obtained in interval regression with those in rough set approach.
2
Interval Regression Analysis t
The given i-th input vector is denoted as xj = (1, xj1 , · · · , xjn ) and the given j-th output is an interval denoted as Yj = (yj , ej ) where j = 1, · · · , p, yj is a center of interval Yj and ej is a radius. ej is called a width. An interval Yj can be rewritten as Yj = [yj − ej , yj + ej ]. An interval linear model is assumed to be Y = A0 + A1 x1 + · · · + An xn = Ax
(2)
where A = (A0 , · · · , An ) is an interval coefficient vector. Ai is an interval denoted as Ai = (ai , ci ) and Y is an interval output. Using interval arithmetic, (2) can be written as Y (xj ) = (a0 , c0 ) + (a1 , c1 )xj1 + · · · + (an , cn )xjn = (at xj , ct |xj |)
(3)
where a = (a0 , · · · , an )t , c = (c0 , · · · , cn )t , |xj | = (1, |xji |, · · · , |xjn |)t . It should be noted that the estimated interval output can be represented as the center at xj and the width ct |xj |. The given input-output data are (xj , Yj ) = (1, xj1 , · · · , xjn , Yj ), j = 1, · · · , p
(4)
54
H. Tanaka
where Yj = (yj , ej ). Since we consider the inclusion relation between the given interval output Yj and the estimated interval Yj (xj ), let us define the inclusion relation of two intervals. The inclusion relation of two intervals A1 = (a1 , c1 ) and A2 = (a2 , c2 ) is defined as follows. a1 − c1 ≥ a2 − c2 A1 ⊆ A2 ⇔ (5) a1 + c1 ≤ a2 + c2 Using this inclusion relation, let us define the lower and upper models as follows. The lowe model is denoted as Y∗ (xj ) = A∗0 + A∗1 xj1 + · · · + A∗n xjn
(6)
where A∗i = (a∗i , c∗i ), i = 0, · · · , n. In the lower model, it is assumed that the estimated interval Y∗ (xj ) should be included in the given output Yj , that is Y∗ (xj ) ⊆ Yj . Using the definition of inclusion relation (5), (7) can be rewritten as yj − ej ≤ at∗ xj − ct∗ |xj | at∗ xj + ct∗ |xj | ≤ yj + ej
(7)
(8)
which can be regarded as the constraint conditions. Consider the optimization problem such that the sum of widths of the estimated intervals Y∗ (xj ) is maximized subject to the constraint conditions (8). Therefore the object function is written as max J∗ =
a∗ ,c∗
p
ct∗ |xj |.
(9)
j=1
This optimization problem becomes an LP problem as follows. max J∗ =
a∗ ,c∗
p
ct∗ |xj |
(10)
j=1
s.t.) yj − ej ≤ at∗ − ct∗ |xj | at∗ xj + ct∗ |xj | ≤ yj + ej c∗ ≥ 0 (j = 1, · · · , p). If there is a solution in (10), the lower model can be obtained by solving (10). The lower model is obtained as the greatest lower bound in the sense of inclusion relation. This model is corresponding to the lower approximation in rough sets. The upper model is denoted as Y ∗ (xj ) = A∗0 + A∗1 xj1 + · · · + A∗n xjn
(11)
Dual Mathematical Models
55
where A∗i = (a∗i , c∗i ), i = 1, · · · , p. In the upper model, it is assumed that the estimated interval Y ∗ (xj ) should include the given output Yj , that is Y ∗ (xj ) ⊇ Yj Using the definition of inclusion relation (5), (12) can be rewritten as a∗t xj − c∗t |xj | ≤ yj − ej yj + ej ≤ a∗t xj + c∗t |xj |
(12)
(13)
which can be regarded as the constraint conditions. Consider the optimization problem such that the sum of widths of the estimated intervals Y ∗ (xj ) is minimized subject to the constraint (13). Therefore the objective function is written as min J∗ = ∗ ∗
a ,c
p
c∗t |xj |.
(14)
j=1
This optimization problem becomes an LP problem as follows. min J∗ = ∗ ∗
a ,c
p
c∗t |xj |
(15)
j=1
s.t.) a∗t xj − c∗t |xj | ≤ yj − ej yj + ej ≤ a∗t + c∗t |xj | c∗ ≥ 0 (j = 1, · · · , p). There exists always a solution in (15), because there is an admissible set of the constraint conditions (13) if a sufficient large positive vector is taken for c∗ . The upper model is obtained as the least upper bound in the sense of inclusion relation. This model is corresponding to the upper approximation in rough sets. Since the lower and upper models can be obtained by solving LP problems (10) and (15), it is very easy to obtain two models, if there is a solution in the lower model (10). In order that there is an solution (a∗i , c∗i ) in the lower model, some consistency between the given data structure and the assumed interval model is necessary. It follows from the constraint conditions (7) and (12) that Y∗ (xj ) ⊆ Yj ⊆ Y ∗ (xj ), j = 1, · · · , p
(16)
which is illustrated in Figure 1. However, for the new sample vector x it is not guaranteed that Y∗ (x ) ⊆ Y ∗ (x ).
(17)
Therefore as the relation (17) for any x can hold, let us consider the integrated model which the lower and upper models are combined into as follows.
56
H. Tanaka
Fig. 1. Inclusion relations between the given interval output and the estimated intervals
min ∗
a∗ ,c ,a∗ ,c∗
p j=1
c∗t xj −
p
ct∗ xj
(18)
j=1
s.t.) Y∗ (xj ) ⊇ Yj Y ∗ (xj ) ⊆ Yj a∗i + c∗i ≤ a∗i + c∗i a∗i − c∗i ≤ a∗i − c∗i c∗i , c∗i ≥ 0, i = 0, · · · , n. Since A∗i ⊇ Ai∗ (i = 0, · · · , n) are added to the constraint conditions in (18), for any x (17) holds always. The given data structure can be approximated by the dual models (Y∗ (x), Y ∗ (x)). In order to show the validity of the above formulations[2], assume that the given data (xoj , yjo ), j = 1, · · · , p satisfy the interval linear system Y o (xoj ) = Ao0 + Ao1 xj1 + · · · + Aon xjn = Ao xj .
(19)
Thorem 1. If the given data (xoj , yjo ), j = 1, · · · , p satisfy (19), the interval vectors A∗ and A∗ obtained from (10) and (15), respectively, are the same as Ao . Thus, we have A∗ = A∗ = Ao , Y∗ (x) = Y ∗ (x) = Y o (x).
(20)
This theorem means that interval regression techniques can obtain the real interval coefficients if the given data satisfy (19). The solution of the lower model is not always guaranteed, because we fail to assume a proper regression model for the given data structure. In case of no solution for a linear system in the lower model, we can take the following polynominals. Y (x) = A0 + Ai xi + Aij xi xj + Aijk xi xj xk + · · · . (21) i
i,j
i,j,k
Since a polynominal such as (21) can represent any function, the center of the estimated interval Y (x) in the lower model can hit the center of the given
Dual Mathematical Models
57
output Yj . Thus, one can select a polynominal as a lower model by increasing the number of terms of the polynominal (21) until a solution is found. It should be noted that (21) can be considered as a linear system with respect to parameters. Thus, we can use the proposed models with no difficulty to obtain the parameters in (21). The existence of the lower model means that the assumed model is somewhat reliable. Thus, the measure of fitness for the j-th data denoted as ϕY (xj ) can be introduced as ct∗ |xj | ϕY (xj ) = ∗t (22) c |xj | which is corresponding to the accuracy measure in rough sets[10]. Then the measure of fitness for all data denoted as ϕY can be defined as ϕY (xj ) ϕY = (23) p where 0 ≤ ϕY ≤ 1. The larger the value of ϕY , the more the model is fitting to the data. If the given data satisfy the linear system(19), the measure of fitness ϕY become 1. Note that ϕY is the average ratio of lower spread to upper spread over the p data.
3
Numerical Example
Let us consider the following example to illustrate interval regression. Table 1. The given data in Example 1
No.(j) 1 2 3 4 5 6 7 8 Input(x) 1 2 3 4 5 6 7 8 Output(y) [14,16] [16,19] [18,21] [19,21] [17,20] [17,19] [18,20] [20,25] The data are shown in Table 1. Assume that Y (x) = A0 + A1 x. We can find that there is no solution in the lower model. Then by increasing the number of terms of the polynominal, we can find the solution in the lower model with the following polynominal. Y (x) = A0 + A1 x + A2 x2 + A3 x3 . Therefore we obtained the lower and upper models by solving the integrated model(18) as follows. Y∗ (x) = (7.4894, 0.5356) + (9.2274, 0)x − (2.1884, 0)x2 + (0.1598, 0.0003)x3 Y ∗ (x) = (6.9150, 1.8850) + (9.2274, 0)x − (2.1884, 0)x2 + (0.1598, 0.0012)x3 . The given intervals Yj and the estimated intervals Y∗ (x), Y ∗ (x) are shown in Figure 2 where the dotted line denotes the lower model, the solid line denotes the upper model and the vertical trace denotes the given intervals.
58
H. Tanaka
Fig. 2. The estimated intervals in the lower and upper models and the given intervals in Numerical Example
4
Comparison between Interval Regression and Rough Sets Concept
Let us briefly illustrate the elementary terms using in rough sets[10]. U denotes the universe of discourse, and R denotes an equivalence relation on U considered as an indiscernibility relation. Then the ordered pair A = (U, R) is called an approximation space. Equivalence classes of the relation R are called elementary sets in A, denoted as {Ej , j = 1, · · · , n}. Let a set X ⊂ U be given. Then an upper approximation of X in A denoted as A∗ (X) means the least definable set containing X, and the lower approximation of X in A denoted as A∗ (X) means the greatest definable set contained in X. Thus, the inclusion relation A∗ (X) ⊆ A∗ (X) is satisfied. An accuracy measure of a set X in the approximation space A = (U, R) is defined as αA (X) =
Card(A∗ (X)) Card(A∗ (X))
(24)
where Card(A∗ (X)) is the cardinality of A∗ (X). This accuracy measure αA (X) corresponds well to the measure of fitness ϕY (x) in interval regression analysis. When the classification C(U ) = {X1 , · · · , Xn } is given, the accuracy of the classification C(U ) is defined as Card( A∗ (Xj )) βA (U ) = (25) Card( A∗ (Xj )) which is corresponding to ϕY in interval regression. Furthermore, the concept of adopting a polynominal (21) as an interval regression model corresponds to the refined elementary sets in rough sets. Table 2 shows the comparison of the concepts between interval regression and rough sets.
Dual Mathematical Models
59
Table 2. Comparison of the concept between interval regression and rough sets Interval regression analysis Rough sets Upper estimation model : Y ∗ (x) Upper approximation : A∗ (X) Lower estimation model : Y∗ (x) Lower approximation : A∗ (X) Spread of Y ∗ : c∗t |x| Cardinality of A∗ (X) : Card(A∗ (X)) Spread of Y∗ : ct∗ |x| Cardinality of A∗ (X) : Card(A∗ (X)) Inclusion relation : Y ∗ (xj ) ⊇ Y∗ (xj ) Inclusion relation : A∗ (X) ⊇ A∗ (X) Measure of fitness for j-th input : ϕY (xj ) Accuracy measure of Xi : αA (Xj ) Measure of fitness for all data : ϕY Accuracy measure of classification : βA (U ) (The higher, the better.) (The higher, the better.)
5
Concluding Remarks
It can be said that the concept of rough set approach is applied to mathematical models and rough approximation is extended to the space of mathematical models. Our approach is to represent an uncertain phenomenon by the dual models that have inclusion relations. In order to analyze a complex phenomenon, we need the proposed approach in the near future. This approach is very unique in rough sets studies so that we need to have criticism, comments and suggestions on dual mathematical models.
References 1. Pawlak, P.: Rough Sets, Kluwer Academic, Dordrecht (1991). 2. Tanaka, H., Guo, P.: Possibilistic Data Analysis for Operations Research, PhysicaVerlag, Heidelberg (1999). 3. Tanaka, H., Lee, H.: Interval regression analysis by quadratic programming approach, IEEE Trans. on Fuzzy Systems 6 6 (1998) 473–481. 4. Tanaka, H., Lee, H.: Interval Regression analysis with polynominal and its similarity to rough sets concept, Fundamenta Informaticae, IOS Press. 37 (1999) 71–87. 5. Sugihara, K., Tanaka, H.: Interval evaluations in the analytic hierarchy process by possibility analysis, An Int. J. of Computational Intelligence 17 3 (2001) 567–579. 6. Sugihara, K., Ishii, H., Tanaka, H.: On interval AHP, 4th Asian Fuzzy Systems Symposium – Proceedings of AFSS2000 – (2000) 251–254. 7. Guo, P., Tanaka, H., Zimmermann, H.J.: Upper and lower possibility distributions of fuzzy decision variables in upper level decision problems, Int. J. of Fuzzy Sets and Systems 111 (1999) 71–79. 8. Guo, P., Tanaka, H.: Possibilistic data analysis and its application to portfolio selection problems, Fuzzy Economic Review 3/2 (1998) 3–23. 9. Entani, T., Maeda, Y., Tanaka, H.: Dual models of interval DEA and its extension to interval data, European J. of Operational Research 136 (2002) 32–45. 10. Pawlak, Z.: Rough Classification, Int. J. of Man-Machine Studies 20 (1984) 469– 483.
Knowledge Theory: Outline and Impact* Y.X. Zhong University of Posts & Telecom, Beijing 100876, China, [email protected]
Abstract. Knowledge is widely regarded as the most valuable wealth but there has not been a systematic theory of knowledge existed yet till the present time. An attempt is thus made in the paper to present the results of a preliminary study of Knowledge Theory. The results reported here contains three parts: fundamentals of the theory, the main body of the theory – the mechanism of knowledge creation from information and the mechanism of intelligence creation from knowledge, and the impact that Knowledge Theory may have.
1 Introduction Information, knowledge, and intelligence are recognized a trinity in intelligence science in which Information is raw resource describing the problems in real world and knowledge is the product abstracted from information processing while intelligence is the ability to solve the problem by using information and knowledge. More than 50 years ago, Claude E. Shannon published a paper titled as The Mathematical Theory of Communication [1] thus formed a statistical theory of communication and was named Information Theory later. This opened up a new era of information research giving wide implication on many branches of modern sciences. In 1956, a conference with the theme of intelligence simulation by computers was held at Dartmouth, Massachusetts. Since then Artificial Intelligence has been announced born [2]. The establishment of Intelligence Theory offered a great hope to mankind in that it may be possible to make machines intelligent. Both Information and intelligence theories have made progress since showing up respectively. Yet the absence of knowledge theory has been a constraint limiting the further development of the trinity of intelligence science. To make effort for establishing a systematic theory of knowledge is thus becoming an urgent task.
2 Concepts and Definitions Definition 2-1. Information concerning any object is something able to show the possible states of the object and the manner with which the states of the object may vary [3]. _______________________________ * The work was supported by the grant from Natural Science Foundation of China (Grant No.69982001). G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 60–69, 2003. © Springer-Verlag Berlin Heidelberg 2003
Knowledge Theory: Outline and Impact
61
Definition 2-2. Knowledge concerning a category of objects is something able to show the possible states of the objects and the law with which the states of the objects may vary. Definition 2-3. For given problem and its environment as well as the goal that is sought through the problem solving, Intelligence can be defined as the ability to successfully perform the following functions: – information acquisition purposively, – knowledge creation properly from the information acquired, – strategy formation reasonably based on the knowledge created and the goal, – strategy execution effectively for solving the problem under the environment and achieving the goal. It is clear from the definition 2-3 that it would be impossible for intelligence theory to work effectively without a thorough understanding toward the information theory and knowledge theory.
3 Representation of Knowledge Axiom 3-1. Knowledge, as an entirety, can be decomposed into three interrelated components, namely the formal knowledge, the content knowledge, and the utility knowledge. Definition 3-1. Assuming that an object X has N possible states, X={xn}, n=1, ... , N, the degree with which the state xn may certainly stay is defined as the certainty of the state xn and denoted by cn , n = 1, …, N. The set of the certainties of all states of X is named the generalized distribution of certainty of X and denoted by C ={cn}, n=1, ... , N. The formal knowledge can then be represented by use of {X, C}. N
0 ≤ cn ≤ 1, ∀n and
∑c n =1
≥=≤ 1
n
(3.1)
Definition 3-2. Assuming that an object X has N possible states, X={xn}, n=1, ... , N, the degree with which a meaning of a state xn may be valid is defined as the truth of the state xn and denoted by tn n=1, ... , N. The set of the truth of all states of X is named the generalized truth distribution of X and denoted by T = { tn }, n = 1, …, N. The content knowledge can then be well represented by {X, T}. N
0 ≤ t n ≤ 1, ∀n
and
∑t n =1
n
≥=≤ 1
(3.2)
Definition 3-3. Assuming that an object X has N possible states, X={xn}, n=1, ... , N, the value indicated by a state xn to a subject' goal is defined as the utility of the state xn and denoted by un n=1, ... , N. The set of the utilities of all states of X is named the generalized utility distribution of X and denoted by U = { un }, n=1,…,N. The utility knowledge can then be represented by use of {X, U}. N
0 ≤ un ≤ 1, ∀n
and
∑u n =1
n
≥=≤ 1
(3.3)
Definition 3-4. The joint, ℑn , of the certainty of a state xn with its related truth is defined as the integrative truth of the state xn , n=1, ... , N, and is expressed as
62
Y.X. Zhong
ℑn = αcn • βt n ⇒ cn t n , ∀n and ℑ = {ℑn } (3.4) Here ℑ is called the integrative truth distribution and the arrow stands for "reducible to". Obviously, N
0 ≤ ℑn ≤ 1, ∀n and
∑ℑ n =1
≥=≤ 1
n
(3.5)
The knowledge corresponding to this is called the integrative content knowledge. Definition 3-5. The joint, ηn , of the certainty of a state xn with its related truth and utility is defined as the integrative utility of the state xn, n=1, …, N, and is expressed as ηn = αcn • βt n • γun ⇒ cn t n un , ∀n and η = {ηn } (3.6) Here η is termed as integrative utility distribution. The arrow stands for "reducible to". Also, N
0 ≤ ηn ≤ 1, ∀n
and
∑η n =1
n
≥=≤ 1
(3.7)
The knowledge corresponding to this is called the integrative utility knowledge.
4 Knowledge Measurements Definition 4-1. The quantitative measurement of knowledge is simply termed the knowledge amount. A straightforward way is to define the amount of knowledge as the amount of a standard problem that the knowledge can solve. A reasonable definition of the standard problem is the so-called alternative problem with two equally-probably choices. Definition 4-2. The problem amount contained in a standard problem is defined as a unit amount of problem. The unit can be termed alt (the first three letters of the word "alternative"). A piece of knowledge will have an alt if an alternative standard problem is solved by the knowledge. The unit of knowledge is set to be alt too. It is also necessary to find a general approach to the calculation of the amount of problem and knowledge. Noticing Eq.(3.1), there are two cases to be considered: Case 1, the certainties are normal – the sum of all certainties equal to 1; and Case 2, the certainties are not normal – the sum of all certainties are not necessarily equal to 1.The Case 1 is studied first in what follows. Definition 4-3. The uniform distribution C0 of certainty and the 0 – 1 type distribution Cs of certainty represents two extreme cases of the certainty distributions and can be expressed as below: and
C0 = {cn | cn = N1 , n = 1,2,..., N } Cs = {cn | cn ∈ {0,1}, ∀n}
(4.1) (4.2)
Knowledge Theory: Outline and Impact
63
Definition 4-4. The average certainty defined on certainty distribution C can be defined as N
M φ (C ) = φ {∑ cnφ (cn )} −1
(4.3)
n =1
φ
in Eq.(4.3) is a continuous and monotonic function to be determined and
φ −1 is
the inverse function of φ and is also continuous and monotonic function. Definition 4-5. Assuming that both problems X and Y have N possible states and that the certainty distributions are C and D respectively, they will be regarded as φ irrelevant if N
N
N
φ {∑ cn φ (cn d n )} = φ {∑ cn φ (cn )} • φ {∑ cn φ (d n )} −1
−1
n =1
−1
n =1
(4.4)
n =1
Theorem 4-1. Function φ in Eq.(4.1), satisfying definitions 4-4 and 4-5, must take the form of logarithm. This is a kernel theorem with fundamental significance in knowledge theory. The proof of the theorem is a basic mathematical problem and can easily be found in [3], [5] and [6]. Corollary 4-1. The average certainty of event X with certainty distribution C will have the form: N
M φ (C ) = ∏ (cn ) cn
(4.5)
n =1
Corollary 4-2. The value of average certainty is bounded by
1 = M φ ( C0 ) ≤ M φ ( C ) ≤ M φ ( C s ) = 1 N
1
N
and 1: (4.6)
The proof of corollaries 4-1 and 4.2 are straightforward. It would be a good idea to take the average certainty as a basic parameter for measuring the amount of knowledge. A piece of knowledge concerning an event (X, C) is said to be maximum if the average certainty defined on C will have the value M φ ( C ) = 1. On the other hand, it is said to be nothing if the average certainty have the value
M φ (C ) = 1 N . The latter case can serve as a reference of knowledge
amount calculation. Definition 4-6. The amount of knowledge of an event (X, C) can be measured by
K (C ) = log
M φ (C ) M φ (C0 )
N
= log N + ∑ cn log cn
(4.7)
n =1
K(C) can be understood as the self-knowledge amount of event X. However, one may also be concerned with the amount of knowledge a certain subject receives. In the latter case, some new concepts are demanded. Definition 4-7. The certainty distribution of an event a subject knew before observing is termed as a priori certainty distribution. While the certainty distribution of the event a subject has after his observing is termed as a posteriori certainty distribution.
64
Y.X. Zhong
For convenience in writing, the a priori certainty distribution is denoted by C and the a posteriori certainty distribution is denoted by C*. Consequently, a subject, R, is said to have gained a certain amount of knowledge about an event X if, and only if, his a posteriori average certainty about the event X became larger than his a priori average certainty through his own observation. Or, he may have lost a certain amount of knowledge. Definition 4-8. The amount of knowledge about an event (X, C) a subject, R, gains through observation can be measured by N
N
K (C , C*; R) = K (C*) − K (C ) = ∑ c log c − ∑ cn log cn n =1
* n
* n
(4.8)
n =1
It is clear from Eq.(4.8) that the amount of knowledge about an event (X, C) a subject, R, received through his observation can be calculated as long as the a priori and a posteriori certainty distributions have been given. Interestingly, the previous result of the unit amount of knowledge given in Definition 4-2 is completely in agreement with that given in Definition 4-8 and the former can directly be derived from the latter if letting N = 2, c1 = c2 = 1 2 , and
C* = Cs* as can be seen below: K (C , Cs* ; R) = 0 − log( 1 2 ) = 1
alt. (4.9) where the base of logarithm is set to 2 and the unit amount of knowledge is 1 alt. In the ideal case of observation, the a posteriori certainty distribution C* will always be of 0-1 type. In this case, if the a priori certainty distribution C is in uniform type, then from Eq.(4.8) we have
K (C0 , Cs* ; R) = log N
alt. (4.10) Now, let us consider the calculation of the amount of knowledge in case 2. Because of lacking of normality, it is not proper in case 2 to directly employ the results developed in case 1. However, a kind of new distribution of certainties, Cn , can be established for any single element
cn , n=1, …, N, of the given certainty
distribution, C= { cn }, in case 2 as can be seen below.
Cn = {cn , (1 − cn )}, ∀n
(4.11) Clearly, all the two-element distributions shown in Eq.(4.11) are normal and, therefore, all the results obtained in case 1 can now be employed to these new distributions. As the results, the average certainty defined on Cn = {cn , (1 − cn )}, ∀n , can be derived from Eqs.(4.5) and (4.11):
M φ (Cn ) = (cn ) cn (1 − cn ) (1− cn ) (4.12) and the self-knowledge amount for Cn can be obtained from Eq.(4.7) K (Cn ) = cn log cn + (1 − cn ) log(1 − cn ) + log 2 (4.13) The knowledge amount about the event ( x n , Cn ) the subject, R, gained from his observation can be obtained:
K(Cn , Cn* ; R) = cn* log cn* + (1 − cn* ) log(1 − cn* ) − [cn log cn + (1 − cn )log(1 − cn )] (4.14)
Knowledge Theory: Outline and Impact
65
For the deterministic fuzzy event, the average knowledge amount about the entire event (X, C) the subject, R, gained from observation can then be written as
1 K (C , C*; R) = N
N
∑ K (C n =1
n
, Cn* ; R)
(4.15)
Based on Eqs.(4.8) and (4.15), the calculation of formal knowledge amount can then be handled. It is worth of noticing that the calculations of content knowledge amount K (T , T *; R ) , utility knowledge amount K (U , U *; R ) , integrative content knowledge amount
K ( ℑ, ℑ*; R ) and integrative utility knowledge amount
K (η, η*; R ) can also be carried out through the procedure similar to the ones from Eq.(4.11) to Eq.(4.15) because the parameters of truth and utility are fuzzy in nature. Therefore, the following results can easily be derived:
1 N ∑ K (Tn , Tn* ; R) N n=1 K (Tn , Tn* ; R ) = K (Tn* ; R ) − K (Tn ; R ) K (T , T *; R) =
K (Tn ; R) = t n log t n + (1 − t n ) log(1 − t n ) + log 2
(4.16) (4.17) (4.18)
Similarly, we have
K (U , U *; R) = K (ℑ, ℑ*; R) =
1 N
1 N
N
∑ K (U
n
, U n* ; R)
(4.19)
n =1
N
∑ K (ℑ
n
, ℑ*n ; R)
(4.20)
n =1
and
K (η ,η*; R ) =
1 N
N
∑ K (η
n
,η n* ; R)
(4.21)
n =1
5 The Production of Knowledge: The Relation between Knowledge and Information [4] There are two approaches to knowledge production in general: induction and deduction. The former is the approach to the knowledge creation directly from information through induction while the latter is the approach to the knowledge creation from old knowledge existed through reduction. The reduction approach will play a more and more important role in knowledge creation as the existed knowledge continuously accumulated. On the other hand, however, the induction approach will remain a fundamental one for ever because there will be endless new, and never observed, problems ahead. Our stress here will be placed on induction approach because of the interest for finding the linkage between information and knowledge.
66
Y.X. Zhong
5.1 The Production Mechanism of Formal Knowledge Concepts are the basic elements of knowledge. It would be reasonable to start the exploration of the formal knowledge production mechanism from that of concept formation. Production Mechanism 5-1. The mechanism of concept formation in formal knowledge can be understood as a process of formal comparison and induction. The basic procedure may consist as follows: Observing a sample S1 of information randomly given, try to extract its static state features such as shape, size, color, texture, weight, position in space, and so on and also extract its dynamic features, i.e., the ways the states vary. Observing the second sample of information S 2 , also randomly given, extract its formal features as did in step 1 and then compare them with the previous ones. Keep S 2 if its major features are in common with that of S1 and neglect S 2 otherwise. Repeat the step 2 for N times, where N is a sufficient large positive integer. A set of information features { f k }, k = 1, 2, …, K, that are common within the N' ≤ N samples, can then be formd and is termed the Common Feature Set (CFS) of the N' samples, where N-N' samples were ignored, K>0, the number of common features, is an integer and may gradually decrease as N increases. Ignoring all the samples that do not fit { f k }, the remaining N' samples are named a basic set of the samples. When N and N' are sufficiently large and { f k } tends to be no change as N increases, CFS is then said to have been stable and a concept, abstracted from the class of information samples, is thus formed. The CFS, { f k }, is called the intension of the concept whereas all the members of the class are called the extension of the concept. Otherwise, return to step 3, till CFS is stable. This is the simplest description about the possible production mechanism of a concept. Based on the procedure described above, it is easy to establish any concepts of interest and to distinguish one concept from another. This is really the base of knowledge creation. 5.2 The Production Mechanism of Utility Knowledge It is interesting to note before going to the next point that because content knowledge is more abstract than utility knowledge in nature we can only discuss the mechanism of utility knowledge first and then move to that of content knowledge. In most cases, a subject, or a system, should have certain objective (goal) and the utility of a piece of information toward the subject/system can then be judged in terms of the contribution the information may make to the implementation of the objective. Production Mechanism 5-2. The procedure of utility knowledge production from information may need the following steps:
Knowledge Theory: Outline and Impact
67
Clearly define the general objective, G, for the concerned subject. G may further be decomposed into a number of sub-objectives, G = {Gn } n =1 , that are easy to test to see whether any sub-objective is threatened or supported by any external stimulus. Inputting a piece of information X, the subject records and stores the formal description, D(X), of X including the description about its states and the manner with which the states vary. Observing the influence that the information makes to {Gn } : whether any subobjective suffers threatening or gained supporting from the information. The former N
+
−
indicates the positive utility, un , and the latter the negative, un , for all n. An average value of utility, u , may also need to consider. Set up the descriptor {D( X ), G; u} which means that the information with the formal descriptor D(X) provides a utility u to the subject whose objectives is G. Notice that {D( X ), G; u} is just another expression of utility knowledge. For any new information received, the utility knowledge can be obtained by comparing the formal description of the new information with D(X). As long as the new information has the formal feature similar to D(X), it will be regarded to have utility u to the subject having objective G in accord with {D( X ), G; u} . Otherwise, a loop from step 1 to 4 will be needed. This mechanism of utility knowledge production is feasible either for humans or artificial systems. The key steps are the definition of the objective the subject/system has and the calculation of the utility based on the comparison and analysis. 5.3 The Production Mechanism of Content Knowledge As is mentioned above, the content knowledge can only be produced after the production of formal and utility knowledge. The production mechanism of content knowledge can be stated as below. Production Mechanism 5-3. The procedure of content knowledge production may contain the steps: (1) The production of the related formal knowledge, K F , by the production mechanism 5-1; (2) The production of the related utility knowledge, KU , by the production mechanism 5-2; (3) The establishment of a linkage between K F and KU such that K C :
K F α KU
(4) The content knowledge is " K F
α KU "
in which
α
expresses the logic
relationship between K F and KU such as "be", "have", "do", and so on. (5) Name the meaning of the content knowledge by a term.
68
Y.X. Zhong
6 The Activation of Knowledge: The Relation between Knowledge and Intelligence [4] It is clear from the definition of intelligence in section 2 that the intelligence consists a number of basic component factors: information acquisition, knowledge creation, strategy formation, and strategy execution, etc. We have discussed the issues of knowledge creation in Section 5 and here we would like to discuss the issue of the formation of strategy from knowledge. This is also termed knowledge activation. How can knowledge be practically activated into intelligent strategy? Suppose that the information about the present state of the problem has been stored in database already and that the knowledge needed for solving the problem is also stored in knowledge base and rule base. Moreover, the goal that the problem solving is seeking for has been set up. Knowledge Activation 6-1. The algorithm of knowledge activation can then be described as below. Set up a threshold ε 0 that is an allowable error of goal seeking. Establish a measure that indicates the error ε between the problem state and the goal state. Select an applicable rule (a rule whose conditions match with the present state of the problem) from rule base and apply it to the database, producing a new state of the problem in database. If the error between the new state and the goal state is larger than ε 0 , select another rule that would be able to make error decrease. As long as the error is decreased but yet still larger than ε 0 , continue the rule selection and application. When ε ≤ ε 0 , the algorithm terminates, the problem is solved The strategy is the sequence of the successful rule selections for the problem solving. Otherwise got back to step (2) The algorithm shows that the intelligent strategy for problem solving can be formed through the utilization of the knowledge that is related to the problem, the environment (the constraints of the problem solving) and the goal, including the contents previously stored in knowledge base and the rule base.
7 The Impact: A Unified Theory of Information-Knowledge-Intelligence It is a pity that there has been lack of a systematic theory of knowledge till the present time. We do have "knowledge engineering" for quite a time but it does not touch the nucleus of knowledge theory, the essential linkage among information, knowledge, and intelligence. Nevertheless, the world is more and more urgently demanding the knowledge theory as the knowledge-based economy advanced rapidly everywhere. We have presented the fundamentals of knowledge theory from section 2 though to section 4 of the paper. We have also delivered the main body of the knowledge theory in sections 5 and 6 that showed that knowledge can be refined from information
Knowledge Theory: Outline and Impact
69
through induction mechanism and intelligence can be formed from knowledge through activation mechanism. By jointly employing the two mechanisms, a kernel of the unified theory of information, knowledge, and intelligence may be established. This is the impact that knowledge theory may have. It is necessary to mention that all the results presented in the paper are of course a preliminary work on knowledge theory. Many extensive issues of knowledge theory remain open.
References [1] C. E. Shannon, The Mathematical Theory of Communication, BSTJ, vol.27, 1948, p. 379– 423, p. 632–656 [2] A. Feigenbaum et al., Handbook of Artificial Intelligence, Vol.1, William Kaufmann.Inc., 1981 [3] Y. X. Zhong, Principles of Information Science, UPT Press, Beijing, 1988(I), 1996(II) [4] Y. X. Zhong, A Framework of Knowledge Theory, Journal of China Electronics, Vol. 28, No.5, 2000 [5] J. Aczel, Lectures on Functional Equations and Their Applications, Academic Press, 1966. [6] G. H. Hardy, et al, Inequalities, Cambridge University Press, London, 1973
A Rough Set Paradigm for Unifying Rough Set Theory and Fuzzy Set Theory Lech Polkowski Polish–Japanese Institute of Information Technology, Koszykowa 86, 02008 Warsaw, Poland Department of Mathematics and Computer Science, University of Warmia and ˙ lnierska 14a, 10561 Olsztyn, Poland Mazury, Zo Lech.Polkowski, [email protected]
To Professors Zdzislaw Pawlak and Lotfi A. Zadeh Abstract. In this plenary address, we would like to discuss rough inclusions defined in Rough Mereology, a joint idea with A. Skowron, as a basis for common models for rough as well as fuzzy set theories. We would like to justify the point of view that tolerance (or, similarity) is the leading motif common to both theories and in this area paths between the two lie. Keywords. Rough set theory, fuzzy set theory, rough mereology, rough inclusion.
1
Introduction
We give here some basic insight into the two theories. 1.1
Rough Sets: Basic Ideas
Rough Set Theory begins with the idea (cf.[6],[5]) of an approximation space, understood as a universe U together with a family R of equivalence relations on U(knowledge base). Given a sub–family S ⊆ R, the equivalence relation S = S induces a partition PS of U into equivalence classes [x]S of the relation S. In terms of PS , concept approximation is possible; a concept is a subset X ⊆ U . There are two cases: (1) a concept X is S–exact: X = {[x]S : [x]S ⊆ X}; (2) otherwise, X is said to be S − −rough. In case (2), the idea of approximation comes useful (cf. [5]). Two exact sets, approximating X from below and from above, are: (low) SX = {[x]S : [x]S ⊆ X} (1) (upp) SX = {[x]S : [x]S ∩ X = ∅} Then clearly: (1) SX ⊆ X ⊆ SX; (2) SX (resp. SX) is the largest (resp. the smallest) S–exact set contained in (resp. containing) X. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 70–77, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Rough Set Paradigm for Unifying Rough Set Theory
71
Sets (concepts) with identical approximations may be identified: consider an equivalence relation ≈S defined as follows (cf.[5]) X ≈S Y ⇔ [S]X = SY ∧ SX = SY
(2)
This is clearly an equivalence relation; let ConceptsS denote the set of these classes. Then for x, y ∈ ConceptsS , we have x = y ⇔ ∀u ∈ U.φu (x) = φu (y) ∧ ψu (x) = ψu (y)
(3)
where for x = [X]≈S we have φu (x) = 1 in case [u]S ⊆ X, otherwise φu (x) = 0; similarly, ψu (x) = 1 in case [u]S ∩ X = ∅, otherwise ψu (x) = 0. The formula (3) witnesses the leibnizian indiscernibility in ConceptS : entities are distinct if and only if they are discerned by at least one of available functionals. The idea of indiscernibility is one of the most fundamental in Rough Set Theory (cf. [5]). Other fundamental notions are derived from the observation on complexity of the generation of S: one may ask whether there is some T ⊂ S such that (*) S = T ; in case the answer is positive one may search for a minimal with respect to inclusion subset U ⊂ S satisfying (*). Such a subset is said to be an S–reduct. Letus observe that U ⊂ S is an S–reduct if and only if for each R ∈ U we have that (U \ {R}) = U; in this case we say that U is independent. These ideas come afore most visibly in case of knowledge representation in the form of information systems cf.[5]. An information system is a universe U along with a set A of attributes each of which is a mapping from U into a value set V . Clearly, each attribute a ∈ A does induce a relation of a–indiscernibility IN D(a) defined as follows: xIN D(a)y ⇔ a(x) = a(y), x, y ∈ U
(4)
The family {Ind(a) : a ∈ A} is a knowledge base and for each B ⊆ A, the relation IN D(B) = {IN D(a) : a ∈ B} is defined. The notions of B– indiscernibility, B–reduct, independence are defined as in general case cf. [5]. 1.2
Fuzzy Sets: Basic Notions
A starting point for Fuzzy Set Theory is that of a fuzzy set cf. [12]. Fuzzy sets come as a generalization of the usual mathematical idea of a set: given a universe U , a set X in U is expressed by means of its characteristic function χX : χX (u) = 1 in case u ∈ X, χX (u) = 0, otherwise. A fuzzy set X is defined by allowing χX to take values in the interval [0, 1]. Thus, χX (u) ∈ [0, 1] is a measure of the degree to which u is in X. Once fuzzy sets are defined, one may define a fuzzy algebra of sets by defining operators responsible for the union, the intersection, the complement of fuzzy sets. Usually, these are defined by selecting a t–norm, a t–conorm, and a negation functions where a t–norm T (x, y) is a function allowing the representation (cf.[10], Ch. 14): T (x, y) = g(f (x) + f (y)) (5)
72
L. Polkowski
where the function f : [0, 1] → [0, +∞) in (5) is continuous decreasing on [0, 1] and g is the pseudo–inverse to f (i.e. g(u) = 0 in case u ∈ [0, f (1)], g(u) = f −1 (u) in case u ∈ [f (1), f (0)], and g(u) = 1 in case u ∈ [f (0), +∞)). A t–conorm C is induced by a t–norm T via the formula C(x, y) = 1 − T (1 − x, 1 − y). A negation n : [0, 1] → [0, 1] is a continuous decreasing function such that n(n(x)) = x. An important example of a t–norm is the Lukasiewicz product ⊗(x, y) = max{0, x+y−1} cf. [8]; we recall also the Menger product P rod(x, y) = x · y. 1.3
Direct Bridging of Rough and Fuzzy Universes
It is natural to create a fuzzy universe within a rough one: given an information system (U, A) and a set B ⊆ A of attributes, and a concept X ⊆ U , a rough membership function mB,X on U is defined as follows cf. [7]: mB,X (u) =
|X ∩ [u]B | |[u]B |
(6)
It may be noticed that (1) in case X is a B–exact set, mB,X is the characteristic function of X i.e. X is perceived as a crisp set (2) mB,X is a piece-wise constant function, constant on classes of IN D(B) (3) from mB,X , rough set approximations are reconstructed via: BX = {u ∈ U : mB,X (u) = 1} BX = {u ∈ U : mB,X (u) > 0
(7)
In this case, as witnessed by (7) the rough and fuzzy notions of necessity and possibility coincide.
2
Rough vs. Fuzzy: Logical Opposition
With each concept X ⊆ U , and a set of attributes B, two approximations BX, BX are defined; for u ∈ U , the sentence u ∈ X may acquire one of the three logical values: 1 in case u ∈ BX (certainty), 0 in case u ∈ / BX (impossibility), and ? in case u ∈ BX \ BX (don’t know). It follows that rough sets are related to 3–valued logic. 2.1
3–Valued Logic
It was proposed in [3] as a logic with truth–values 0, 1, 12 in which the implication functor Cpq and the negation functor N p were defined by means of: C(x, y) = min{1, 1 − x + y} N (x) = 1 − x These values may be seen in a table.
(8)
A Rough Set Paradigm for Unifying Rough Set Theory
73
Table 1. 3–logic of L ukasiewicz C 0 1 12 N 0 111 1 1 0 1 12 0 1 1 1 1 12 2 2
Example 1. Truth table for 3–logic The L ukasiewicz logic was completely axiomatized by M. Wajsberg (cf. [10], Ch. 4) by means of the axiom schemes: W1 Cq(Cpq) W2 C(Cpq)C((Cqr)(Cpr)) W3 C(C(C(pN p)p)p) W4 C((CN qN p)(Cpq)) Formulae W 1 − W 4 may be translated into an algebra W by first identifying formulae α ≈ β of 3–logic in case α ⇔ β is a theorem of the 3–logic, next considering classes [α]≈ of formulae, and then defining operations → and N , and constant 1 on the set of classes as follows: 1. [α]≈ → [β]≈ = [Cαβ]≈ . 2. N [α]≈ = [N α]≈ . 3. [α]≈ = 1 where α is a theorem. Then W 1 − W 4 are rendered as follows in the resulting algebra W (the Wajsberg algebra): w1 x → (y → x) = 1 w2 (x → y) → ((y → z) → (x → z)) = 1 w3 ((x → N x) → x) → x) = 1 w4 (N x → N y) → (y → x) = 1 w5 if 1 → x = 1 then x = 1 w6 x → y = 1 and y → x = 1 imply x = y.
Now, let us look at concepts in a given universe U as defined by means of attributes in A: a concept X is represented via its approximations (I = AX, C = AX) (I for interior, C for closure). Letting E = U \ C (E for exterior), we represent a concept as a pair (I, E); always, I ∩ E = ∅. In case I ∪ E = U , the represented concept is exact; otherwise it is rough and then manifestly, no {x} ⊆ D \ I is exact. In the family C of pairs (I, E) as above, one may introduce operations →, N following Becchio and Pagliani (cf. [10], Ch. 12) (to simplify, we write −X for U \ X): N (I, E) = (E, I)//(I1 , E1 ) → (I2 , E2 ) = (E1 ∪ I2 ∪ −I1 ∩ −E2 , I1 ∩ E2 )
(9)
Then (cf. [10], pp. 361 ff.): Proposition 1. The algebra (C, →, N, 0 = (∅, U ), 1 = (U, ∅)) does satisfy w1.– w6. Proof. A proof is in verifying directly that w1.–w6. are observed under the rough set interpretation. For instance, with x = (I1 , E1 ), y = (I2 , E2 ), the formula w1. yields on the left–hand side x → (y → x) = (E1 ∪ E2 ∪ I1 ∪ −I2 ∩ −E1 ∪ −I1 ∩ −I2 ∪ −I1 ∩ −E1 , ∅) = (U, ∅) = 1.
74
L. Polkowski
Summing up the above, it turns out that 3–logic describes rough sets in logical terms. The corresponding logic for fuzzy sets is clearly [0,1]–valued logic. Functors C, N are expressed by (8) and a set of axiom formulae was conjectured by L ukasiewicz [4] and proved complete by Rose and Rosser (cf. [10], Ch. 13, for references). [0,1]–valued logic enters fuzzy logic in a way envisioned by Goguen (cf. [10] for references) i.e. given a fuzzy set of axioms and fuzzy derivation rules, formulae are produced along with their degrees of truth. Thus, a fuzzy derivation rule is two–fold, in its syntactic part giving rules for new formula forming and in its semantic parts producing the truth degree of the new formula from truth degrees of its parents. As in [8], fuzzy logic is interpreted in the L ukasiewicz lattice L = ([0, 1], ∨, ∧, ⊗, −→, 0, 1) where ∨(x, y) = max{x, y}, ∧(x, y) = min{x, y}, ⊗(x, y) = max{0, x + y − 1}, x −→ y = min{1, 1 − x + y}. A fuzzy set A of axiom formulae is defined; derivation rules are fuzzy detachx,y α x ment: ( α,α⇒β , ⊗(x,y) ) and lifting: ( a⇒α , a−→x ) for a ∈ (0, 1). β The set F of formulae of fuzzy logic is the smallest set containing a set V of variables, the set {a : a ∈ (0, 1) of constants and closed under functors ∨, ∧, ⇒, † interpreted as respectively, ∨, ∧, −→, ⊗. The set A consists among others of theses of [0, 1]–logic, formulae like a ∧ b ⇒ (a ∧ b), a ∨ b ⇒ (a ∨ b), a ⇒ b ⇒ (a −→ b), α ⇒ 1, α † 1 = 1. The fuzzy set A is given via its membership function χA : F → L outlined as follows: χA (a) = a χA (a † b) = ⊗(a, b) (10) χA (a ⇒ b) = a −→ b Otherwise, χA (α) is 1 for α a thesis of [0,1]–logic, and 0 for all other cases of α. Given this, for each fuzzy set X : F → L , we find the smallest set CS (X) : F →L such that (1) A ∨ X ≤ CS (X) (2) for each derivation rule R = (R1 , R2 ): χX (R1 (x1 , ..., xk )) ≥ R2 (χX (x1 ), ..., χX (xk )) and we say that a formula α is a syntactic consequence of X in degree at least a = χCS (X) (α), in symbols X a α. Analogously a semantics is introduced: the semantics is the family E of fuzzy homomorphisms T : F → L i.e. T (α ‡ β) = T (α) ⊕ T (β) where ‡, ⊕ = ∨, ∨, ∧, ∧, †, ⊗, ⇒, −→, respectively. Given E, the semantic consequence CE (X) = {Y ∈ E : X ≤ Y} : L F →L F is defined. We say that α is a semantic consequence to X in degree at least a in case χCE (X) (α) ≥ a, in symbols X |=a α. The basic result of [8] is that CS = CE i.e. fuzzy logic is complete. It is noteworthy that ⊗ is the unique up to isomorphism functor making this logic complete (loc.cit). From the above short discussion we see that rough and fuzzy worlds are opposite ends of many–valued logical scale. Bridging them will be the task in next section.
A Rough Set Paradigm for Unifying Rough Set Theory
3
75
A Common Extension to Rough and Fuzzy: Rough Mereological Similarity
We introduce a reasoning mechanism whose distinct facets correspond to rough respectively fuzzy approaches to reasoning. We outline its main ideas. 3.1
Classical Ideas of Mereology
The basic notion is that of part relation on a universe U cf. [2]; in symbols: xπy reads x is a part of y. It should satisfy the following: p1 xπy ∧ yπz ⇒ xπz; p2 ¬(xπx) The notion of an element, x el y, is the following: x el y ⇔ xπy ∨ x = y. Thus, x el y ∧ y el x is equivalent to x = y. The relation el is an ordering of the universe U . In order to make a non–empty concept M ⊆ U into an entity, the class operator Cls is used cf. [2]. The definition of the class of M , in symbols Cls(M ), is as follows: c1 x ∈ M ⇒ x el Cls(M ) c2 x el Cls(M ) ⇒ ∃y, z.y el x ∧ y el z ∧ z ∈ M
(11)
Condition c1 in (11) includes all members of M into Cls(M ) as elements; condition c2 demands that each element of Cls(M ) has an element in common with a member of M (compare this with the definition of the union of a family of sets). We demand also that Cls(M ) be unique. From this demand it follows that cf. [2]: (Inf ) [∀y.(y el x ⇒ ∃w.w el y ∧ w el z)] ⇒ x el z (12) (Inf) is a useful rule for recognizing that x el z. 3.2
Rough Mereology
Given a mereological universe (U, π), we introduce cf. [11], on U × U × [0, 1] a relation µ(x, y, r) read x is a part of y to degree at least r. We will write rather more suggestively xµr y, calling µ a rough inclusion. We require of µ the following: r1 xµ1 x r2 xµ1 y ⇔ x el y r3 xµ1 y ⇒ ∀z, r.(zµr x ⇒ zµr y) r4 xµr y ∧ s < r ⇒ xµs y
(13)
Informally, r2 ties rough inclusions to mereological underlying universes, r3 does express monotonicity (a bigger entity cuts a bigger part of everything), r4 says that a part in degree r is a part to any lesser degree.
76
3.3
L. Polkowski
Basic Examples
In addition to the rough membership function, we consider the following rough inclusions. Gaussian linear (Gl) rough inclusion. Given (U, A), we define DISA (x, y) = {a ∈ A; a(x) = a(y)} and then yµr x if and only if e(−Σa∈DISA (x,y) wa ≥ r where r x,xµs z wa ∈ (0, ∞) is a weight attached to a. Then cf. [9], the rule yµyµ holds. rs z L ukasiewicz rough inclusion. In notation of the above, we let yµr x if and A (x,y)| r x,xµs z only if 1 − |DIS|A| ≥ r. Then cf. [9], the rule yµ yµ⊗(r,s) z holds. The reader will find in [9] a detailed discussion of the two inclusions, a part of this address. Gl and Luk satisfy the additional requirement: r y,yµs z r5. there exists a t–norm f such that the inference rule xµ xµf (r,s) z holds. We say that µ is an f –rough inclusion. 3.4
Rough World from Rough Inclusions
Assume a rough inclusion µ and a property M as above; given an entity x, we define its lower M, µ–approximation xM as follows: xM = Cls({y ∈ Y : y ∈ M ∧ yµ1 x})
(14)
Then we have as a direct consequence of (Inf) in (12): (L1) xM el x. Again applying (Inf), we obtain: (L2) xM el xM M hence xM = xM M . We will say that x is M –exact in case there exists M0 ⊆ M such that x = Cls(M0 ). Then we have: (L3) x = xM if and only if x is M –exact. Indeed, if x = xM then x = Cls({y ∈ M : yµ1 x}). The converse follows from (Inf) directly. Similarly, we introduce the upper M, µ–approximation xM : xM = Cls({y ∈ U : y ∈ M ∧ ∃r > 0.yµr x})
(15)
We say that M is an rm–covering in case the following holds: (rmcov) ∀y. y el x ⇒ ∃w.y el w ∧ w ∈ M ∧ ∃r > 0.wµr x
(16)
Then, again by (Inf): (U1) if M is an rm–covering, then x el xM . We say that M is an rm–partition in case the following holds: 1. yµr z ∧ y, z ∈ M ⇒ r = 0 2. y ∈ M ∧ M0 ⊆ M ∧ ∀z ∈ M0 .yµ0 z ⇒ yµ0 Cls(M0 ). Then: (U2) if M is an rm–covering and an rm–partition then xM = xM M . Again, using (inf) we get: (U3) for an rm–partition and rm–covering M , x = xM if and only if x = Cls(M0 ) for some non-empty M0 ⊆ M . Thus: (LU) xM = x = xM if and only if x is M –exact.
A Rough Set Paradigm for Unifying Rough Set Theory
3.5
77
Fuzzy World from Rough Inclusions
Because of yµ1 x equivalent to y el x, we may interpret yµr x as a statement of fuzzy membership; we will write µx (y) = r to stress this interpretation. Thus, rough inclusions induce globally a family of fuzzy sets {X : X = Cls(V ) ∧ ∅ = V ⊆ U } with fuzzy membership functions µX . We assume r5 additionally for µ with a t–norm f . Let us consider a relation τ on U defined as follows: xτr y ⇔ xµr y ∧ yµr x
(17)
τr is for each r a tolerance relation. Also: 1. xτ1 x 2. xτr y ⇔ yτr x 3. xτr y ∧ yτs z ⇒ xτf (r,s) z. Thus, τ is the f –fuzzy similarity cf. [13]. From this a number of results on fuzzy equivalences and partitions cf. [10] may be derived along the lines indicated.
References 1. L. Borkowski (ed.). Jan L ukasiewicz. Selected Works. North Holland – Polish Sci. Publ., Amsterdam – Warsaw, 1970. 2. S. Le´sniewski. On the foundations of mathematics. Topoi, 2: 7–52, 1982. 3. J. L ukasiewicz. Farewell lecture, Warsaw Univ., March 1918. In: [1], pp. 84–86. 4. J. L ukasiewicz and A. Tarski. Untersuchungen ueber den Aussagenkalkuls. In: [1], pp. 130–152. 5. Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer, Dordrecht, 1992. 6. Z. Pawlak. Rough sets. Intern. J. Comp. Inf. Sci., 11 (1982), pp. 341–356. 7. Z. Pawlak and A. Skowron. Rough membership functions. In R. R. Yager, M. Fedrizzi, and J. Kacprzyk, editors, Advances in the Dempster-Schafer Theory of Evidence, 251–271, Wiley, New York, 1994. 8. J. Pavelka. On fuzzy logic I,II,III. Zeit. Math. Logik Grund. Math., 25, 1979, pp. 45–52, 119–134, 447–464. 9. L. Polkowski. Rough Mereology. A Survey of new results... These Proceedings. 10. L. Polkowski. Rough Sets. Mathematical Foundations. Physica, Heidelberg, 2002. 11. L. Polkowski, A. Skowron. Rough mereology: a new paradigm for approximate reasoning. International Journal of Approximate Reasoning, 15(4): 333–365, 1997. 12. L. A. Zadeh. Fuzzy sets. Information and Control, 8 (1965), pp. 338–353. 13. L. A. Zadeh. Similarity relations and fuzzy orderings. Information Sciences, 3 (1971), pp. 177–200.
Extracting Structure of Medical Diagnosis: Rough Set Approach Shusaku Tsumoto Department of Medical Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho Izumo City, Shimane 693-8501 Japan [email protected]
Abstract. One of the most important problems on rule induction methods is that they cannot extract rules, which plausibly represent experts’ decision processes. It is because rule induction methods induce probabilistic rules that discriminates between a target concept and other concepts, assuming that all the concepts are on the same level. However, medical experts assume that all the concepts of diseases are belonging to the different level of hierarchy. In this paper, the characteristics of experts’ rules are closely examined and a new approach to extract plausible rules is introduced, which consists of the following three procedures. First, the characterization of decision attributes (given classes) is extracted from databases and the concept hierarchy for given classes is calculated. Second, based on the hierarchy, rules for each hierarchical level are induced from data. Then, for each given class, rules for all the hierarchical levels are integrated into one rule. The proposed method was evaluated on a medical database, the experimental results of which show that induced rules correctly represent experts’ decision processes.
1
Introduction
One of the most important problems in data mining is that extracted rules are not easy for domain experts to interpret. One of its reasons is that conventional rule induction methods [6] cannot extract rules, which plausibly represent experts’ decision processes [8]: the description length of induced rules is too short, compared with the experts’ rules. For example, rule induction methods, including AQ15 [2] and PRIMEROSE [8], induce the following common rule for muscle contraction headache from databases on differential diagnosis of headache: [location = whole] ∧[Jolt Headache = no] ∧[Tenderness of M1 = yes] → muscle contraction headache.
This rule is shorter than the following rule given by medical experts. [Jolt Headache = no] ∧([Tenderness of M0 = yes] ∨[Tenderness of M1 = yes] ∨[Tenderness of M2 = yes]) ∧[Tenderness of B1 = no] ∧[Tenderness of B2 = no] ∧[Tenderness of B3 = no] ∧[Tenderness of C1 = no] ∧[Tenderness of C2 = no] ∧[Tenderness of C3 = no] ∧[Tenderness of C4 = no] → muscle contraction headache
where [Tenderness of B1 = no] and [Tenderness of C1 = no] are added. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 78–88, 2003. c Springer-Verlag Berlin Heidelberg 2003
Extracting Structure of Medical Diagnosis: Rough Set Approach
79
One of the main reasons why rules are short is that these patterns are generated only by one criteria, such as high accuracy or high information gain. The comparative studies[8,9] suggest that experts should acquire rules not only by one criteria but by the usage of several measures. Those characteristics of medical experts’ rules are fully examined not by comparing between those rules for the same class, but by comparing experts’ rules with those for another class[8]. For example, the classification rule for muscle contraction headache given in Section 1 is very similar to the following classification rule for disease of cervical spine: [Jolt Headache = no] ∧([Tenderness of M0 = yes] ∨[Tenderness of M1 = yes] ∨[Tenderness of M2 = yes]) ∧([Tenderness of B1 = yes] ∨[Tenderness of B2 = yes] ∨[Tenderness of B3 = yes] ∨[Tenderness of C1 = yes] ∨[Tenderness of C2 = yes] ∨[Tenderness of C3 = yes] ∨[Tenderness of C4 = yes]) → disease of cervical spine
The differences between these two rules are attribute-value pairs, from tenderness of B1 to C4. Thus, these two rules can be simplified into the following form: a1 ∧ A2 ∧ ¬A3 → muscle contraction headache a1 ∧ A2 ∧ A3 → disease of cervical spine The first two terms and the third one represent different reasoning. The first and second term a1 and A2 are used to differentiate muscle contraction headache and disease of cervical spine from other diseases. The third term A3 is used to make a differential diagnosis between these two diseases. Thus, medical experts first select several diagnostic candidates, which are very similar to each other, from many diseases and then make a final diagnosis from those candidates. In this paper, the characteristics of experts’ rules are closely examined and a new approach to extract plausible rules is introduced, which consists of the following three procedures. First, the characterization of decision attributes (given classes) is extracted from databases and the concept hierarchy for given classes is calculated. Second, based on the hierarchy, rules for each hierarchical level are induced from data. Then, for each given class, rules for all the hierarchical levels are integrated into one rule. The proposed method was evaluated on a medical database, the experimental results of which show that induced rules correctly represent experts’ decision processes.
2 2.1
Rough Set Theory and Probabilistic Rules Rough Set Notations
In the following sections, we use the following notations introduced by GrzymalaBusse and Skowron[7], which are based on rough set theory[3]. These notations are illustrated by a small database shown in Table 1, collecting the patients who complained of headache.
80
S. Tsumoto
Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, i.e., a : U → Va for a ∈ A, where Va is called the domain of a, respectively.Then, a decision table is defined as an information system, A = (U, A ∪ {d}). For example, Table 1 is an information system with U = {1, 2, 3, 4, 5, 6} and A = {age, location, nature, prodrome, nausea, M 1} and d = class. For location ∈ A, Vlocation is defined as {occular, lateral, whole}. The atomic formulae over B ⊆ A ∪ {d} and V are expressions of the form [a = v], called descriptors over B, where a ∈ B and v ∈ Va . The set F (B, V ) of formulas over B is the least set containing all atomic formulas over B and closed with respect to disjunction, conjunction and negation. For example, [location = occular] is a descriptor of B. For each f ∈ F (B, V ), fA denote the meaning of f in A, i.e., the set of all objects in U with property f , defined inductively as follows. 1. If f is of the form [a = v] then, fA = {s ∈ U |a(s) = v} 2. (f ∧ g)A = fA ∩ gA ; (f ∨ g)A = fA ∨ gA ; (¬f )A = U − fa By the use of the framework above, classification accuracy and coverage, or true positive rate is defined as follows. Definition 1. Let R and D denote a formula in F (B, V ) and a set of objects which belong to a decision d. Classification accuracy and coverage(true positive rate) for R → d is defined as: αR (D) =
|RA ∩ D| |RA ∩ D| , and κR (D) = , |RA | |D|
where |˙|, αR (D), κR (D) and P(S) denote the cardinality of a set, a classification accuracy of R as to classification of D and coverage (a true positive rate of R to D), respectively. Finally, we define partial order of equivalence as follows: Definition 2. Let Ri and Rj be the formulae in F (B, V ) and let A(Ri ) denote a set whose elements are the attribute-value pairs of the form [a, v] included in Ri . If A(Ri ) ⊆ A(Rj ), then we represent this relation as: Ri Rj . 2.2
Probabilistic Rules
According to the definitions, probabilistic rules with high accuracy and coverage are defined as: α,κ
R → d s.t. R = ∨i Ri = ∨ ∧j [aj = vk ], αRi (D) ≥ δα and κRi (D) ≥ δκ , where δα and δκ denote given thresholds for accuracy and coverage, respectively.
Extracting Structure of Medical Diagnosis: Rough Set Approach
3
81
Characterization Sets
In order to model medical reasoning, a statistical measure, coverage plays an important role in modeling, which is a conditional probability of a condition (R) under the decision D(P (R|D)). Let us define a characterization set of D, denoted by L(D) as a set, each element of which is an elementary attribute-value pair R with coverage being larger than a given threshold, δκ . That is, Definition 3. Let R denote a formula in F (B, V ). Characterization sets of a target concept (D) is defined as: Lδκ (D) = {R|κR (D) ≥ δκ } Then, three types of relations between characterization sets can be defined as follows: Independent type: Lδκ (Di ) ∩ Lδκ (Dj ) = φ, Boundary type: Lδκ (Di ) ∩ Lδκ (Dj ) = φ, and Positive type: Lδκ (Di ) ⊆ Lδκ (Dj ). All three definitions correspond to the negative region, boundary region, and positive region, respectively, if a set of the whole elementary attribute-value pairs will be taken as the universe of discourse. We consider the special case of characterization sets in which the thresholds of coverage is equal to 1.0. That is, L1.0 (D) = {Ri |κRi (D) = 1.0} Then, we have several interesting characteristics. Theorem 1. Let Ri and Rj two formulae in L1.0 (D) such that Ri Rj . Then, αRi ≤ αRj . Proof. Since κRi and κRj are 1.0, Ri ∩ D = D and Rj ∩ D = D. From the |D| A ∩D| definition of accuracy, αR (D) = |R|R = |R . Since Ri Rj , |RiA | ≥ |Rj A |. A| A| Therefore, αRi (D) =
|D| |D| ≤ = αRj (D) RiA Rj A
Thus, when we collect the formulae whose values of coverage are equal to 1.0, the sequence of conjunctive formulae corresponds to the sequence of increasing chain of accuracies. Since κR (D) = 1.0 means that the meaning of R covers all the samples of D, its complement U − RA , that is, ¬R do not cover any samples of D. Especially, when R consists of the formulae with the same attributes, it can be viewed as the generation of the coarsest partitions. Thus,
82
S. Tsumoto
procedure T otal P rocess; var inputs LD : List; /* A list of Target Concepts */ begin Calculate a set of characterization set Lc ; Calculate a set of intersection Lid ; Calculate a list of similarity measures Ls ; Calculate a list of grouping Lg ; (Fig. 2) Induce a set of rules for Lg : Lr ; (Fig. 3) Combine Rules in Lr for each Di ; end {T otal P rocess} Fig. 1. An Algorithm for Total Process
Theorem 2. Let R be a formula in L1.0 (D) such that R = ∨j [ai = vj ]. Then, R and ¬R gives the coarsest partition for ai , whose R includes D.
From the propositions 1 and 2, the next theorem holds. Theorem 3. Let A consist of {a1 , a2 , · · · , an } and Ri be a formula in L1.0 (D) such that Ri = ∨j [ai = vj ]. Then, a sequence of a conjunctive formula F (k) = ∧ki=1 Ri gives a sequence which increases the accuracy.
4
Rule Induction with Grouping
As discussed in Section 2, When the coverage of R for a target concept D is equal to 1.0, R is a necessity condition of D. That is, a proposition D → R holds and its contrapositive ¬R → ¬D holds. Thus, if R is not observed, D cannot be a candidate of a target concept. Thus, if two target concepts have a common formula R whose coverage is equal to 1.0, then ¬R supports the negation of two concepts, which means these two concepts belong to the same group. Furthermore, if two target concepts have similar formulae Ri , Rj ∈ L1.0 (D), they are very close to each other with respect to the negation of two concepts. In this case, the attribute-value pairs in the intersection of L1.0 (Di ) and L1.0 (Dj ) give a characterization set of the concept that unifies Di and Dj , Dk . Then, compared with Dk and other target concepts, classification rules for Dk can be obtained. When we have a sequence of grouping, classification rules for a given target concepts are defined as a sequence of subrules. From these ideas, a rule induction algorithm with grouping target concepts can be described as Figure 1. First, this algorithm calculates a characterization set L1.0 (Di ) for {D1 , D2 , · · · , Dk }. Second, from the list of characterization sets, it calculates the intersection between L1.0 (Di ) and L1.0 (Dj ) and stores it into Lid . Third, the procedure calculates the similarity (matching number)of the intersections and sorts Lid with respect of the similarities. Finally, the algorithm chooses one intersection (Di ∩ Dj ) with maximum similarity (highest matching number) and group Di and Dj into a concept DDi . These procedures will be continued until all the grouping is considered.
Extracting Structure of Medical Diagnosis: Rough Set Approach
83
procedure Grouping ; var inputs Lc : List; /* A list of Characterization Sets */ Lid : List; /* A list of Intersection */ Ls : List; /* A list of Similarity */ var outputs Lgr : List; /* A list of Grouping */ var k : integer; Lg , Lgr : List; begin Lg := {} ; k := n; /* n: A number of Target Concepts*/ Sort Ls with respect to similarities; Take a set of (Di , Dj ), Lmax with maximum similarity values; k:= k+1; forall (Di , Dj ) ∈ Lmax do begin Group Di and Dj into Dk ; Lc := Lc − {(Di , L1.0 (Di )}; Lc := Lc − {(Dj , L1.0 (Dj )}; Lc := Lc + {(Dk , L1.0 (Dk )}; Update Lid for DDk ; Update Ls ; Lgr := ( Grouping for Lc , Lid , and Ls ) ; Lg := Lg + {{(Dk , Di , Dj ), Lg }}; end return Lg ; end {Grouping}
Fig. 2. An Algorithm for Grouping
5
Example
Let us consider the case of Table 1 as an example for rule induction. For a similarity function, we use a matching number[1] which is defined as the cardinality of the intersection of two the sets. Also, since Table 1 has five classes, k = 6. 5.1
Grouping
From this table, the characterization set for each concept is obtained as shown in Fig 4. Then, the intersection between two target concepts are calculated. Since common and classic have the maximum matching number, these two classes are grouped into one category, D6 . Then, teh characterization of D6 is obtained as : D6 = {[loc = lateral], [nat = thr], [jolt = 1], [nau = 1], [M 1 = 0], [M 2 = 0]} from Fig 5. In the second iteration, the intersection of D1 and others is considered as shown in Fig 6. From this matrix, we have two possibilities of grouping: one is to group m.c.h. and i.m.l. That is, these two diseases are grouped into D7 : D7 = {([loc = occular] ∨ [loc = whole]), [nat = per], [prod = 0]} The other one is to group D1 and i.m.l., where D7 = {[jolt = 1], [M 1 = 0], [M 2 = 0]}.
84
S. Tsumoto
procedure RuleInduction ; var inputs Lc : List; /* A list of Characterization Sets */ Lid : List; /* A list of Intersection */ Lg : List; /* A list of grouping*/ /* {{(Dn+1 ,Di ,Dj ),{(DDn+2 ,.)...}}} */ /* n: A number of Target Concepts */ var Q, Lr : List; begin Q := Lg ; Lr := {}; if (Q = ∅) then do begin Q := Q − f irst(Q); Lr := Rule Induction (Lc , Lid , Q); end (DDk , Di , Dj ) := f irst(Q); if (Di ∈ Lc and Dj ∈ Lc ) then do begin Induce a Rule r which discriminate between Di and Dj ; r = {Ri → Di , Rj → Dj }; end else do begin Search for L1.0 (Di ) from Lc ; Search for L1.0 (Dj ) from Lc ; if (i < j) then do begin r(Di ) := ∨Rl ∈L1.0 (Dj ) ¬Rl → ¬Dj ; r(Dj ) := ∧Rl ∈L1.0 (Dj ) Rl → Dj ; end r := {r(Di ), r(Dj )}; end return Lr := {r, Lr } ; end {Rule Induction}
Fig. 3. An Algorithm for Rule Induction
In the third iteration of the former case(3a ), the intersection is calculated as Fig 7 and D2 and psycho are grouped into D3 : D3a = { [nat=per], [prod=0] } In the latter case(3b ), it is calculated as Fig 8 and m.c.h. and psycho are grouped into D8 : D8a = { [nat=per], [prod=0] }. Fig 9 and 10 depicts the two results of grouping like a dendrogram in clustering analysis[1]. 5.2
Rule Induction
First Model for Diagnosis. Figure 9 shows one candidate of the differential diagnosis. For the differential diagnosis of common. First, this model discriminate between D6 (common and classic) and D8 (m.c.h., i.m.l. and psycho). Then, common and classic within D6 are differentiated. Thus, a classification rule for common is composed of two subrules: (discrimination between D6 and D8 ) and
Extracting Structure of Medical Diagnosis: Rough Set Approach
85
Table 1. A small example of a database No. loc nat his prod jolt nau M1 M2 class 1 occular per per 0 0 0 1 1 m.c.h. 2 whole per per 0 0 0 1 1 m.c.h. 3 lateral thr par 0 1 1 0 0 common. 4 lateral thr par 1 1 1 0 0 classic. 5 occular per per 0 0 0 1 1 psycho. 6 occular per subacute 0 1 1 0 0 i.m.l. 7 occular per acute 0 1 1 0 0 psycho. 8 whole per chronic 0 0 0 0 0 i.m.l. 9 lateral thr per 0 1 1 0 0 common. 10 whole per per 0 0 0 1 1 m.c.h. Definition. loc: location, nat: nature, his:history, Definition. prod: prodrome, nau: nausea, jolt: Jolt headache, M1, M2: tenderness of M1 and M2, 1: Yes, 0: No, per: persistent, thr: throbbing, par: paroxysmal, m.c.h.: muscle contraction headache, psycho.: psychogenic pain, i.m.l.: intracranial mass lesion, common.: common migraine, and classic.: classical migraine. {([loc = occular] ∨ [loc = whole]), [nat = per], [his = per], [prod = 0], [jolt = 0], [nau = 0], [M 1 = 1], [M 2 = 1]} L1.0 (common) = {[loc = lateral], [nat = thr], ([his = per] ∨ [his = par]), [prod = 0], [jolt = 1], [nau = 1], [M 1 = 0], [M 2 = 0]} L1.0 (classic) = {[loc = lateral], [nat = thr], [his = par], [prod = 1], [jolt = 1], [nau = 1], [M 1 = 0], [M 2 = 0]} L1.0 (i.m.l.) = {([loc = occular] ∨ [loc = whole]), [nat = per], ([his = subacute] ∨[his = chronic]), [prod = 0], [jolt = 1], [M 1 = 0], [M 2 = 0]} L1.0 (psycho) = {[loc = occular], [nat = per], ([his = per] ∨ [his = acute]), [prod = 0]} L1.0 (m.c.h.) =
Fig. 4. Characterization Sets for Table 1 m.c.h.
m.c.h. common − {[prod=0]}
common
−
−
classic i.m.l.
− −
− −
classic ∅
i.m.l. psycho {([loc=occular]∨[loc=whole]), {[nat=per],[prod=0]} {[nat=per],[prod=0]} {[loc=lateral], [nat=thr],[jolt=1], {[prod=0],[jolt=1], {[prod=0]} [nau=1], [M1=0],[M2=0]} [M1=0], [M2=0] } − {[jolt=1],[M1=0],[M2=0]} { } − − {[nat=per],[prod=0]}
Fig. 5. Intersection of Two Characterization Sets (Step 2) m.c.h. D6 i.m.l. psycho − {} {([loc=occular]∨[loc=whole]), {[nat=per],[prod=0]} {[nat=per],[prod=0]} D6 − − {[jolt=1], [M1=0], [M2=0]} {} i.m.l. − − − {[nat=per],[prod=0]}
m.c.h.
Fig. 6. Intersection of Two Characterization Sets after the first Grouping (Step 3)
86
S. Tsumoto D6 D7 psycho D6 − {} {} D7 − − {[nat=per],[prod=0]}
Fig. 7. Intersection of Two Characterization Sets after the first Grouping (1) (Step 4a) m.c.h. D7
m.c.h. D7 psycho − {} {[nat=per],[prod=0] } − {} {}
Fig. 8. Intersection of Two Characterization Sets after the first Grouping (2) (Step 4b)
(discrimination within D6 ). On the other hand, a classification rule for m.c.h. is composed of three subrules: (discrimination between D6 and D8 ), (discrimination between D7 and psycho) and (discrimination within D7 ). Let us consider the first case. The first part can be obtained by the intersection in Figure 7. That is, D8 → [nat = per] ∧ [prod = 0] ¬[nat = per] ∨ ¬[prod = 0] → ¬D8 . Then, since from Figure 4, the difference set between characterization sets of common and classic is {[prod = 1]}, for a classification rule for common within D7 is: [prod = 0] → common. Combining these two parts, the classification rule for common is (¬[nat = per] ∨ ¬[prod = 0]) ∧ [prod = 0] → common. After its simplification, the rule is: ¬[nat = per] → ¬common, whose accuracy is equal to 2/3. In the same way, the rule for classic is obtained as: ¬[nat = per] ∧ [prod = 1] → classic. common classic m.c.h. i.m.l. psycho
Fig. 9. Grouping by Characterization Sets (1)
Extracting Structure of Medical Diagnosis: Rough Set Approach
87
common classic i.m.l. m.c.h.
psycho
Fig. 10. Grouping by Characterization Sets (2)
Second Model for Diagnosis. Figure 10 shows the other candidate of the differential diagnosis. For differential diagnosis, First, this model discriminate between D7 (common, classic and i.m.l.) and D8 (m.c.h. and psycho). Then, D6 and i.m.l. within D7 are differentiated. Finally, common and classic within D7 are checked. Thus, a classification rule for common is composed of two subrules: (discrimination between D7 and D8 ), (discrimination between D6 and D7 ), and (discrimination within D6 ). The first part can be obtained by the intersection in Figure 7. That is, D8 → [nat = per] ∧ [prod = 0] ¬[nat = per] ∨ ¬[prod = 0] → ¬D8 . Then, the second part can be obtained by the intersection in Figure 6. That is, D7 → [jolt = 1] ∧ [M 1 = 0] ∧ [M 2 = 0] ¬[jolt = 1] ∨ ¬[M 1 = 0] → ¬D7 . Finally, the third part can be obtained by the difference set for common and classic: {[prod = 1]}. [prod = 0] → common. Combining these three parts, the classification rule for common is (¬[nat = per] ∨ ¬[prod = 0]) ∧ ([jolt = 1] ∧ [M 1 = 0] ∧ [M 2 = 0]) ∧ [prod = 0] → common. After its simplification, the rule is: (¬[nat = per]) ∧ ([jolt = 1] ∧ [M 1 = 0] ∧ [M 2 = 0]) → common. whose accuracy is equal to 2/3. It is notable that the second part ([jolt = 1] ∧ [M 1 = 0] ∧ [M 2 = 0]) is redundant in this case, compared with the first model. However, from the viewpoint of characterization of a target concept, it is very important part.
88
6
S. Tsumoto
Conclusion
In this paper, the characteristics of experts’ rules are closely examined, whose empirical results suggest that grouping of diseases ais very important to realize automated acquisition of medical knowledge from clinical databases. Thus, we focus on the role of coverage in focusing mechanisms and propose an algorithm for grouping of diseases by using this measure. The above example shows that rule induction with this grouping generates rules, which are similar to medical experts’ rules and they suggest that our proposed method should capture medical experts’ reasoning. This research is a preliminary study on a rule induction method with grouping and it will be a basis for a future work to compare the proposed method with other rule induction methods by using real-world datasets. Acknowledgments. This work was supported by the Grant-in-Aid for Scientific Research (13131208) on Priority Areas (No.759) “Implementation of Active Mining in the Era of Information Flood” by the Ministry of Education, Science, Culture, Sports, Science and Technology of Japan.
References 1. Everitt, B. S., Cluster Analysis, 3rd Edition, John Wiley & Son, London, 1996. 2. Michalski, R. S., Mozetic, I., Hong, J., and Lavrac, N., The Multi-Purpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains, in Proceedings of the fifth National Conference on Artificial Intelligence, 1041–1045, AAAI Press, Menlo Park, 1986. 3. Pawlak, Z., Rough Sets. Kluwer Academic Publishers, Dordrecht, 1991. 4. Polkowski, L. and Skowron, A.: Rough mereology: a new paradigm for approximate reasoning. Intern. J. Approx. Reasoning 15, 333–365, 1996. 5. Quinlan, J.R., C4.5 – Programs for Machine Learning, Morgan Kaufmann, Palo Alto, 1993. 6. Readings in Machine Learning, (Shavlik, J. W. and Dietterich, T.G., eds.) Morgan Kaufmann, Palo Alto, 1990. 7. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory. In: Yager, R., Fedrizzi, M. and Kacprzyk, J.(eds.) Advances in the Dempster-Shafer Theory of Evidence, pp. 193–236, John Wiley & Sons, New York, 1994. 8. Tsumoto, S., Automated Induction of Medical Expert System Rules from Clinical Databases based on Rough Set Theory. Information Sciences 112, 67–84, 1998. 9. Tsumoto, S. Extraction of Experts’ Decision Rules from Clinical Databases using Rough Set Model Intelligent Data Analysis, 2(3), 1998. 10. Zadeh, L.A., Toward a theory of fuzzy information granulation and its certainty in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90, 111–127, 1997.
A Kind of Linearization Method in Fuzzy Control System Modeling Hongxing Li , Jiayin Wang, and Zhihong Miao Department of Mathematics, Beijing Normal University, Beijing 100875, China
Abstract. A kind of linearization method in fuzzy control system modeling is proposed, in order to deal with the nonlinear model with variable coefficients. The method can turn a nonlinear model with variable coefficients into a linear model with variable coefficients in the way that the membership functions of the fuzzy sets in fuzzy partitions of the universes are changed from triangle waves into rectangle waves. However, the linearization models are incomplete in their forms because of their lacking some items. For solving this problem, joint approximation by using linear models is introduced. The simulation results show that marginal linearization models are of higher approximation precision than their original nonlinear models.
1
Introduction
A kind of modeling method based on fuzzy inference (MMFI) on the plants of control systems is proposed in [1]. The key way is that for a control system a fuzzy logic system describing the plant is obtained by acting fuzzy inference mechanism on the plant, then the fuzzy logic system is turned into a kind of nonlinear differential equation or a system of nonlinear differential equations with variable coefficients[1] based on the interpolation mechanism[2−10] of fuzzy logic systems. By the way, the differential equation is just linear when its order is one. As pointed out in [1], this has actually solved the problem for the plant in a fuzzy control system to be modelled. From the results in [1], we can learn that for a fuzzy system, when its input variables are more than two, the model as obtained by the method in [1] is all of same form nonlinear differential equation or a system of nonlinear differential equations with variable coefficients, called HX equation[1] . Here “nonlinear” may not bring too much trouble for solving such kind of equation. In fact, we can easily find out the solutions of the equations and draw the solution curves and phase plane curves under given initial values by using Matlab5.3 (see the simulation experiments in [1]). However this kind of “nonlinear” may make it difficult to consider some qualitative and quantitative analysis to a system, such as the stability, controllability, and observability. To
Supported by the National Natural Science Foundation of China (Grant No. 60174013) and the Research Fund for Doctoral Program of Higher Education (Grant No. 20020027013) To whom correspondence should be addressed. [email protected]
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 89–98, 2003. c Springer-Verlag Berlin Heidelberg 2003
90
H. Li, J. Wang, and Z. Miao
solve this problem, in this paper, we propose a marginal linearization method aiming at approximately turning HX equations into some kind of linear differential equations or systems of linear differential equations with variable coefficients.
2
The Input-Output Models of Fuzzy Control Systems
First of all, we introduce some useful concepts and notations as follows, taking second order systems for examples. Let Y = [a1 , b1 ], Y˙ = [a2 , b2 ] and Y¨ = [a3 , b3 ] respectively be universes of y(t), y(t) ˙ and y¨(t). Suppose A = {Ai }(1≤i≤p) , B = {Bj }(1≤j≤q) and C = {Cij }(1≤i≤p,1≤j≤q) to be the fuzzy partitions respectively of corresponding universes[1−7] Y = [a1 , b1 ], Y˙ = [a2 , b2 ] and Y¨ = [a3 , b3 ] (i.e. groups of base elements), where yi , y˙ j and y¨ij are respectively the peakpoints[1−6] of Ai , Bj and Cij , with the condition: a1 ≤ y1 < y2 < · · · < yp ≤ b1 and a2 ≤ y˙ 1 < y˙ 2 < · · · < y˙ q ≤ b2 . A, B and C can be regarded as linguistic variables so that a group of fuzzy inference rules is formed as follows. If y(t) is Ai and y(t) ˙ is Bj then y¨(t) is Cij ,
(1)
where i = 1, 2, · · · , p, j = 1, 2, · · · , q. By using the conclusions in [2], the fuzzy logic system based on (1) can be represented as a binary piecewise interpolation function: p q y¨(t) = F (y(t), y(t)) ˙ = Ai (y(t))Bj (y(t))¨ ˙ yij . (2) i=1 j=1
Usually, Ai , Bj and Cij are taken as “triangle wave” membership functions y(t) − yi−1 , yi−1 ≤ y(t) ≤ yi , yi − yi−1 Ai (y(t)) = y(t) − yi+1 (3) , yi ≤ y(t) ≤ yi+1 , y − y i i+1 0, otherwise , y(t) ˙ − y˙ j−1 , y˙ j−1 ≤ y(t) ˙ ≤ y˙ j , y˙ j − y˙ j−1 ˙ − y˙ j+1 Bj (y(t)) ˙ = y(t) (4) , y˙ j ≤ y(t) ˙ ≤ y˙ j+1 , y ˙ − y ˙ j j+1 0, otherwise , where i = 1, 2, · · · , p and we stipulate that y0 = y1 and yp+1 = yp , j = 1, 2, 3, · · · , q, and also stipulate y˙ 0 = y˙ 1 and y˙ q+1 = y˙ q . By noticing that (2) is only related to the peakpoints of Cij , we do not need to consider the membership functions of Cij . Theorem 1. Under the above assumptions, the input-output model on free motion of the second order system based on (1) can be represented as a nonlinear differential equation with variable coefficients as follows (see [1]): y¨(t) = F (y(t), y(t)) ˙
= a(y(t), y(t))y(t) ˙ + b(y(t), y(t)) ˙ y(t) ˙ +c(y(t), y(t))y(t) ˙ y(t) ˙ + d(y(t), y(t)) ˙ ,
(5)
A Kind of Linearization Method in Fuzzy Control System Modeling
91
where the variable coefficients a(y(t), y(t)), ˙ · · ·, d(y(t), y(t)) ˙ depend on the timespace structure with the conditions
a(y(t), y(t)) ˙ =
p−1 q−1
a(i,j) , b(y(t), y(t)) ˙ =
i=1 j=1
c(y(t), y(t)) ˙ =
p−1 q−1 i=1 j=1
p−1 q−1
b(i,j) ,
i=1 j=1
c(i,j) , d(y(t), y(t)) ˙ =
p−1 q−1
d(i,j) ,
i=1 j=1
where a(i,j) , · · · , d(i,j) are the local coefficients on the (i, j)-th piece defined as the following: when (y(t), y(t)) ˙ ∈[yi , yi+1 ] × [y˙ j , y˙ j+1 ], a(i,j) = b(i,j) = c(i,j) = (i,j) d = 0, and when (y(t), y(t)) ˙ ∈ [yi , yi+1 ] × [y˙ j , y˙ j+1 ], they are respectively defined by y˙ j (¨ yij+1 − y¨i+1j+1 ) + y˙ j+1 (¨ yi+1j − y¨ij ) , (yi − yi+1 )(y˙ j − y˙ j+1 ) yi (¨ yi+1j − y¨i+1j+1 ) + yi+1 (¨ yij+1 − y¨ij ) = , (yi − yi+1 )(y˙ j − y˙ j+1 ) y¨ij − y¨ij+1 − y¨i+1j + y¨i+1j+1 = , (yi − yi+1 )(y˙ j − y˙ j+1 ) yi+1 (y˙ j+1 y¨ij − y˙ j y¨ij+1 ) yi (y˙ j y¨i+1j+1 − y˙ j+1 y¨i+1j ) + . = (yi − yi+1 )(y˙ j − y˙ j+1 ) (yi − yi+1 )(y˙ j − y˙ j+1 )
a(i,j) =
(6)
b(i,j)
(7)
c(i,j) d(i,j)
(8) (9)
Note 1. The nonlinear differential equation with variable coefficients as (5) is called a (second order) HX equation. When (y(t), y(t)) ˙ ∈ [yi , yi+1 ]×[y˙ j , y˙ j+1 ] (i.e. when on the (i, j)-th piece), the HX equation degenerates a local HX equation: y¨(t) = a(i,j) y(t) + b(i,j) y(t) ˙ + c(i,j) y(t)y(t) ˙ + d(i,j) ,
(10)
which is a nonlinear differential equation with constant coefficients. This means that an HX equation is formed by (p−1)×(q−1) local HX equations. So, in order to solve an HX equation, we should solve every local HX equation piecewisely.
3
Marginal Linearization Method on Input-Output Models
In order to obviate the nonlinear problem of previous model, here we propose a method called marginal linearization which can approximately transform the nonlinear equation in the form of (5) into a kind of linear equation. Therefore, the membership functions of Ai are first changed from “triangle wave” to “rectangle wave” by 1, yi− 12 ≤ y(t) < yi+ 12 , Ai (y(t)) = (11) 0, otherwise , where i = 1, 2, · · · , p, and we stipulate that y1− 12 = y1 and yp+ 12 = yp .
92
H. Li, J. Wang, and Z. Miao
In the sense of signal processing, this changing is equal to the fact that a series of triangle waves are turned to a series of rectangle waves. And in the meaning of sets, it equivalents to the case that the membership functions of fuzzy sets are replaced by the characteristic functions of crisp sets. From the angle of interpolation, it is for some linear(first degree) interpolation base functions being changed into zero-degree interpolation base functions. No matter in which sense, the essence is some simplification being done. Theorem 2. Under the previous assumptions and conditions of (11), the inputoutput model of the second order system based on (1) can be represented as a second order differential equation with variable coefficients: y¨(t) + P1 (y(t), y(t)) ˙ y(t) ˙ = Q1 (y(t), y(t)) ˙ .
(12)
Proof. When (y(t), y(t)) ˙ ∈ [yi− 12 , yi+ 12 ] × [y˙ j , y˙ j+1 ], considering the structures of Ai and Bj , we have p q y¨(t) = F (y(t), y(t)) ˙ = Ai (y(t))Bj (y(t))¨ ˙ yij i=1 j=1
y(t) ˙ − y˙ j+1 y(t) ˙ − y˙ j y¨ij + y¨ij+1 y˙ j − y˙ j+1 y˙ j+1 − y˙ j y¨ij+1 − y¨ij y˙ j y¨ij+1 − y˙ j+1 y¨ij =− y(t) ˙ + . (13) y˙ j − y˙ j+1 y˙ j − y˙ j+1 = Bj (y(t))¨ ˙ yij + Bj+1 (y(t))¨ ˙ yij+1 =
(i,j)
(i,j)
We define local coefficients P1 and Q1 as follows: y¨ij+1 − y¨ij , (y(t), y(t)) ˙ ∈ [yi− 12 , yi+ 12 ] × [y˙ j , y˙ j+1 ] , (i,j) P1 = (14) y˙ − y˙ j+1 0, j otherwise , y˙ j y¨ij+1 − y˙ j+1 y¨ij , (y(t), y(t)) ˙ ∈ [yi− 12 , yi+ 12 ] × [y˙ j , y˙ j+1 ] , (i,j) Q1 = (15) y˙ j − y˙ j+1 0, otherwise . (i,j)
(i,j)
Thus, (13) can be written as y¨(t) = −P1 y(t) ˙ + Q1 , which means that we get a differential equation with constant coefficients on the local (i, j)-th piece: (i,j)
y¨(t) + P1
(i,j)
y(t) ˙ = Q1
.
(16)
Now we respectively take that P1 (y(t), y(t)) ˙ =
p q−1
(i,j)
P1
,
and Q1 (y(t), y(t)) ˙ =
i=1 j=1
p q−1 i=1 j=1
then on the whole, i.e. ∀ (y(t), y(t)) ˙ ∈ Y × Y˙ , we have y¨(t) = F (y(t), y(t)) ˙ =
p q−1
(i,j)
(−P1
i=1 j=1
= −P1 (y(t), y(t))y(t) ˙ + Q1 (y(t), y(t)) ˙ . This is (12) that we want to prove.
(i,j)
y(t) + Q1
)
(i,j)
Q1
,
A Kind of Linearization Method in Fuzzy Control System Modeling
93
In what we did previously, the nonlinear equation (5) is transformed into the linear equation (12) by changing the shape of membership functions of the fuzzy sets on the “edge” Y . In same way, we can get another kind of linear equation by changing the fuzzy sets on the edge Y˙ . For that, the membership functions of Bj are changed into rectangle waves: 1, y˙ j− 12 ≤ y(t) ˙ < y˙ j+ 12 , Bj (y(t)) ˙ = (17) 0, otherwise , where j = 1, 2, · · · , q and we stipulate that y˙ 1− 12 = y˙ 1 and y˙ q+ 12 = y˙ q . Theorem 3. Under the previous assumptions and condition of (17), the inputoutput model of the second order system based on (1) can be represented as a second order differential equation with variable coefficients: y¨(t) + P2 (y(t), y(t))y(t) ˙ = Q2 (y(t), y(t)) ˙ ,
(18)
where P2 (y(t), y(t)) ˙ =
p−1 q
(i,j)
P2
, and Q2 (y(t), y(t)) ˙ =
i=1 j=1 (i,j)
in which P2 by (i,j)
P2
(i,j)
Q2
(i,j)
and Q2
p−1 q
(i,j)
Q2
,
i=1 j=1
are the local coefficients on the (i, j)-th piece, defined
y¨i+1j − y¨ij , (y(t), y(t)) ˙ ∈ [yi , yi+1 ] × [y˙ j− 12 , y˙ j+ 12 ] , = y − yi+1 0, i otherwise , yi y¨i+1j − yi+1 y¨ij , (y(t), y(t)) ˙ ∈ [yi , yi+1 ] × [y˙ j− 12 , y˙ j+ 12 ] , = yi − yi+1 0, otherwise .
(19)
(20)
The proof of the theorem is omitted for it is the same as the one for Theorem 2. Note 2. In the local (i, j)-th piece, (18) degenerates into a second order differential equation with constant coefficients: (i,j)
y¨(t) + P2
(i,j)
y(t) = Q2
.
(21)
Therefore, in order to process (18), we only need to consider (21) piecewisely. Note 3. Equations (12) and (18) are all with “lacking term” phenomenon with respect to so-called a standard linear differential equation, i.e. the former lacks the term y(t) and the latter lacks the term y(t). ˙ Especially for the local equations (16) and (21), this lacking term phenomenon may easily cause us some misconception. For example, for the stability of a system, we may regard the
94
H. Li, J. Wang, and Z. Miao
linear systems (16) and (21) as unstable systems based on Routh criterion if we are careless. However, this is generally wrong. In fact, Routh criterion is available only with respect to a whole linear system, not for some local parts of it. Thus we can not directly use Routh criterion for the linear systems (16) and (21) which are all local. If a local system such as (16), (21) or (10) is viewed as a whole system, then it is easy to know that it generally is a stable system because the variables of it are bounded. It is noteworthy that there does not exist obvious necessary relationship between a whole system and its all parts.
4
The Joint Equation on Edge Linearization Model
The “joint equation” here means to bring (12) together with (18) so as to form a whole equation by some method. We have such an idea for the following reasons. (i) Equations (12) and (18) respectively represent the approximate linear models obtained by using the edge linearization with respect to the nonlinear model (5). These two linear models emphasize particularly on some respective meanings that contain certain imbalance between y(t) and y(t). ˙ Clearly, this kind of imbalance can be compensated if the two linear models are jointed. (ii) Just as the statement of Note 3 in Sect. 2, (12) and (18) are all of “lacking term” phenomenon. When the inference rules degenerate into only one rule, i.e. p = q = 1, the local equations will be extended to a whole equation. Thus, this kind of lacking term phenomenon may make Routh criterion being available so that we have the conclusion that the system is unstable. A method on jointing models thought out easily is the mean value superposition between models (12) and (18), i.e. 1 1 y¨(t) + P1 (y(t), y(t)) ˙ y(t) ˙ + P2 (y(t), y(t))y(t) ˙ 2 2 1 = [Q1 (y(t), y(t)) ˙ + Q2 (y(t), y(t))] ˙ .. 2
(22)
Denoting b1 (y(t), y(t)) ˙ = 12 P1 (y(t), y(t)), ˙ b2 (y(t), y(t)) ˙ = 12 P2 (y(t), y(t)) ˙ and
b∗ (y(t), y(t)) ˙ =
1 [Q1 (y(t), y(t)) ˙ + Q2 (y(t), y(t))] ˙ , 2
(22) can be written as y¨(t) + b1 (y(t), y(t)) ˙ y(t) ˙ + b2 (y(t), y(t))y(t) ˙ = b∗ (y(t), y(t)) ˙ .
(23)
Now we consider the local expressions of variable coefficients b1 (y(t), y(t)), ˙ b2 (y(t), y(t)) ˙ and b∗ (y(t), y(t)). ˙ First of all, we formally assume that
bn (y(t), y(t)) ˙ =
p q i=1 j=1
b(i,j) , (n = 1, 2, ∗) , n
(24)
A Kind of Linearization Method in Fuzzy Control System Modeling
95
(i,j)
where bn is the local coefficient on (i, j)-th piece. Attentively, here (i, j)-th piece means that (y(t), y(t)) ˙ ∈ [yi− 12 , yi+ 12 ] × [y˙ j− 12 , y˙ j+ 12 ] . Obviously the local equation (23) is represented as the following: (i,j)
y¨(t) + b1
(i,j)
y(t) ˙ + b2
(i,j)
y(t) = b∗
.
(25)
Because the domains of definition of the local equations (12) and (18) are different, and in order to make the coefficients of (25) have uniform and clear expressions, we should make the local domain of definition of (25) [yi− 12 , yi+ 12 ] × [y˙ j− 12 , y˙ j+ 12 ] to be partitioned into four parts again, i.e. (i, j)-th is partitioned into four smaller pieces denoted respectively by (i, j)1 , (i, j)2 , (i, j)3 and (i, j)4 : (i, j)1 : [yi− 12 , yi ] × [y˙ j− 12 , y˙ j ], (i, j)2 : [yi− 12 , yi ] × [y˙ j , y˙ j+ 12 ] , (i, j)3 : [yi , yi+ 12 ] × [y˙ j− 12 , y˙ j ], (i, j)4 : [yi , yi+ 12 ] × [y˙ j , y˙ j+ 12 ] . In this way, the local equation are divided into four local sub-equations as follows: (i,j)m
y¨(t) + b1
(i,j)m
y(t) ˙ + b2
(i,j)m
y(t) = b∗
,
(26)
where m = 1, 2, 3, 4, and the relationship between the local coefficients and their local sub-coefficients is b(i,j) = n
4
m b(i,j) , (n = 1, 2, ∗) . n
(27)
m=1 (i,j)
By lengthy deducing, the expressions of local sub-coefficients bn m are clearly represented as the following: when (y(t), y(t)) ˙ is not on the sub-piece (i,j)m (i, j)m , (m = 1, 2, 3, 4), bn = 0 (n = 1, 2, ∗, m = 1, 2, 3, 4); when (y(t), y(t)) ˙ (i,j) is not on the sub-piece (i, j)m , (m = 1, 2, 3, 4), bn m are determined by the following expressions: (i,j)1
b1
(i,j)3
b2
(i,j)1
b∗
(i,j)3
b∗
5
1 (i,j−1) 1 (i,j) (i,j) (i,j) P1 , b1 2 = b1 4 = P1 , 2 2 1 (i,j) 1 (i−1,j) (i,j) (i,j) (i,j) = b2 4 = P2 , b2 1 = b2 2 = P2 , 2 2 1 (i,j−1) 1 (i,j) (i−1,j) (i,j) (i−1,j) = (Q1 + Q2 ), b∗ 2 = (Q1 + Q2 ) , 2 2 1 (i,j−1) 1 (i,j) (i,j) (i,j) (i,j) = (Q1 + Q2 ), b∗ 4 = (Q1 + Q2 ) . 2 2 (i,j)3
= b1
=
Simulation of Edge Linearization Method on Input-Output Models
Given a system, for example, we regard Var der Pol equation as the real model of the system, y¨(t) − µ(1 − y 2 (t))y(t) ˙ + y(t) = 0 , (28)
96
H. Li, J. Wang, and Z. Miao
where we put µ = 1. The aim and operating steps on the simulation have been introduced in [1]. Here we only give the simulations of three kinds of edge linearization methods. We take T = 20 as the simulation time and assume that y(0) = 2 and y(0) ˙ = 0. Example 1. The simulation results on edge linearization model (12). Case 1. Taking p = 7 and q = 8, the simulation results are shown in Fig. 1. We can learn that the simulation curves of edge linearization model (12) approximate the curves of real model well. 3
3 (a)
1.7
(b)
2
2
1
0.7
⋅
−0.3
−1.3
1 2
0
y(t)
y(t)
y(t)
1
(c)
1
2
⋅
0
−1
−1
−2
−2 1
−2.3 0
4
8 12 Time (Second)
16
20
−3 0
4
8 12 Time (Second)
16
20
−3
2 −2
−1
0
y(t)
1
2
−2
Fig. 1. The simulation curves of the linearization model (12) when p = 7 and q = 8. Curves 1 and 2 respectively represent the simulation curves of the linearization model and the real model. (a) and (b) respectively express the simulation results of y and y, ˙ and (c) is the simulation result of the phase plane
Case 2. When p and q are increased in a double number, i.e. p = 14 and q = 16, the simulation curves of edge linearization model (12) approximate the curves of real model very well, in other words the simulation curves are almost coincided with the curves of real model. We omit these simulation figures for space limit. Example 2. The simulation results on edge linearization model (18). Case 1. Taking p = 7 and q = 8, the simulation results are shown in Fig. 2. Clearly, the effect of the edge linearization model approximating to the real model is not better than the effect of model (12) to the real one, because of the lacking term phenomenon, i.e. lacking term y˙ in model (18). Case 2. When p and q are increased in double times, i.e. p = 14 and q = 16, the simulation curves of edge linearization model (18) approximate the curves of real model very well, of which the simulation curves are almost coincided with the curves of real model. We omit these simulation figures for space limit. Example 3. The simulation results on the joint model (23) of edge linearization models. Case 1. Taking p = 7 and q = 8, the simulation results are shown in Fig. 3. Case 2. In order to increase the approximating effect of the joint model (23) to the real model, p and q are respectively increased as taking p = 12 and q = 14. The simulation curves of edge linearization model (18) approximate the curves of real model very well, of which the simulation curves are almost coincided with the curves of real model. These simulation figures are omitted for space limit.
A Kind of Linearization Method in Fuzzy Control System Modeling 2.5
3
3
(a)
(b)
2
1.5
⋅
1
0
y(t)
y(t)
1
y(t)
⋅
2
(c)
2
1
1
0.5
97
2
0
−0.5 −1
−1
−2
−2
1 2
−1.5
−2.5 0
4
8 12 Time (Second)
16
−3 0
20
4
8 12 Time (Second)
16
20
−3
−2.5
−1.5
−0.5
0.5
1.5
2.5
y(t)
Fig. 2. The simulation curves of the linearization model (18) when p = 7 and q = 8 3 1.8
3
(a)
(c)
(b) 2
2
1
1
−0.2
1
0
⋅
y(t)
⋅
y(t)
y(t)
0.8
1
2
0
2 −1
−1
−2
−2
1
−1.2
−2.2 0
4
8
12
16
20
−3 0
Time (Second)
4
8 12 Time (Second)
16
20
−3 −2.5
2
−1.5
−0.5
y(t)
0.5
1.5
2.5
Fig. 3. The simulation curves of the joint model (23) when p = 7 and q = 8
6
Results and Conclusions
As we know, fuzzy control can be applied to such control cases with fuzzy environment that we may hardly modeled on them by using typical methods (see [11] and [12]). Usually we apply fuzzy inference to real systems. Then some algebraic models are formed by these fuzzy inference systems so that fuzzy controllers can be directly designed for the real systems that often have good control effect. However we all know that a mature kind of control theory is always based on some mathematical models of the plants in real systems. Therefore, the problem that the plants in fuzzy control systems can hardly be modeled is just a bottleneck of development on fuzzy control theory. In [1], a kind of modeling method for fuzzy control systems based on fuzzy inference is proposed so that the bottleneck problem is almost solved. But the mathematical models obtained by using the methods in [1] are almost nonlinear differential equations with variable coefficients. So, how to approximately transform these nonlinear models into linear ones is undoubtedly a very important problem. The edge linearization method proposed in this paper can solve above problem. It is easy to learn that the key way of the method is to change the membership functions on some universes(also called edges) from triangle wave shape to rectangle wave shape so that the basis variables in the universes (such as basis variable y(t) in universe Y or basis variable y(t) ˙ in universe Y˙ , etc.) disappear in the original nonlinear equation. This reaches the aim for the linearization. For a second order system, we can implement the linearization by using this method on only one of the universes. For a third order system, we can implement the linearization by using this method on any two of universes Y , Y˙ and
98
H. Li, J. Wang, and Z. Miao
Y¨ . And so on, for an n-th system, we can implement the linearization by using this method on any n − 1 edges of n edges Y , Y˙ , Y¨ , · · ·, Y (n−1) . In order to avoid the lacking term phenomenon in some cases, we propose a concept called Joint equation on edge linearization model, which can offer some convenience for considering qualitative or quantitative analysis. The simulation results show that the edge linearization model does have stronger effect for its approximating original nonlinear equation.
References 1. Li, H.X., Wang, J.Y., Miao, Z.H.: Modeling on fuzzy control systems. Science in China, Ser(A) 45 (2002) 1506–1517 2. Li, H.X.: Interpolation mechanism of fuzzy control. Science in China, Ser(E) 41 (1998) 312–320 3. Li, H.X.: Adaptive fuzzy controllers based on variable universe. Science in China, Ser(E) 42 (1999) 10–20 4. Li, H.X.: Relationship between fuzzy controllers and PID controllers. Science in China, Ser(E) 42 (1999) 215–224 5. Li, H.X.: Fuzzy logic systems are equivalent to feedforward neural networks. Science in China, Ser(E) 43 (2000) 42–54 6. Li, H.X.: To see the success of fuzzy logic from mathematical essence of fuzzy control. Fuzzy Systems and Mathematics (in Chinese) 9 (1995) 1–14 7. Wang, G.J.: On the foundation of fuzzy reasoning. Lecture in Fuzzy Mathematics and Computer Science, Omaha: Creighton University 4 (1997) 1–24 8. Li, H.X., Chen, P., Huang, H.P.: Fuzzy Neural Intelligent Systems. Florida: CRC Press (2001) 9. Li, H.X., Yen, V.C.: Fuzzy Sets and Fuzzy Decision-Making. Florida: CRC Press (1995) 10. Koo, T.J.: Stable model reference adaptive fuzzy control of a class of nonlinear systems. IEEE Transactions on Fuzzy Systems 9 (2001) 624–636 11. Sun, Z.Q.: Theorem and Technology of Intelligence Control. Beijing: Qinghua Publishing House (1997) 16–123 12. Zhang, N.Y.: Structure analysis of typical fuzzy controllers. Fuzzy Systems and Mathematics (in Chinese) 11 (1997) 10–21
A Common Framework for Rough Sets, Databases, and Bayesian Networks S.K.M. Wong and D. Wu Department of Computer Science, University of Regina Regina, Saskatchewan, Canada, S4S 0A2 {wong, danwu}@cs.uregina.ca
1
Introduction
It has been pointed out that there exists an intriguing relationship between propositional modal logic and rough sets [8, 2]. In this paper, we use first order modal logic (FOML) to formulate a common framework for rough sets, databases, and Bayesian networks. The relational view of the semantics of first order modal logic provides a unified interpretation of many related concepts.
2
First Order Modal Logic and Its Relational Representation
We first briefly describe the language of first order modal logic (FOML). Consider a system with n agents. We use T to denote a set of relation symbols, function symbols, and constant symbols. Each relation symbol or function symbol has an associated arity, which corresponds to the number of arguments it can take. We use K to denote a set {K1 , . . . , Kn } of n modal operators, each corresponding to an agent. We refer to the set T ∪ K of relation symbols, function symbols, constant symbols, and modal operators as the vocabulary of the language of FOML. We assume an infinite supply of variables, which we usually write as a, b, c, u, . . ., possibly with subscripts. Constant symbols and variables are called terms. They are used to describe the individuals in the domain. We can form more complicated terms by using function symbols. In other words, variable, constant symbols and terms are all used in the language of FOML to denote an individual in the domain. More formally, the set of terms is defined inductively by starting with variable symbols and constant symbols, and closing off under the application of function symbols. That is to say, if f is a k arity function symbol, and if v1 , . . . , vk are terms, then f (v1 , . . . , vk ) is a term. Terms are used to define formulas. An atomic formula is either of the form ϕ(v1 , . . . , vk ), where ϕ is a k arity relation symbol and v1 , . . . , vk are terms, or of the form v1 = v2 . If ψ and φ are formulas, then so are ¬ψ and ψ ∧ φ. If ψ is a formula, then so is Ki ψ. In addition, we can form formulas by using quantifiers. If ψ is a formula and u is a variable, then ∃u ψ is also a formula. The formula ψ ∨ φ G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 99−103, 2003. Springer-Verlag Berlin Heidelberg 2003
100
S.K.M. Wong and D. Wu
is an abbreviation for ¬(¬ψ ∧ ¬φ), and the formula ∀u ψ is an abbreviation for ¬∃u ¬ψ. The semantics of FOML uses relational Kripke structure [1]. A relational Kripke structure for n agents over the vocabulary T ∪ K is a tuple M = (S, π, K1 , . . . , Kn ), where S is a set of states (possible worlds), π associates with each state s ∈ S a normal interpretation π(s) for the first order logic, and Ki is a binary equivalence relation on S. We assume a common domain D, i.e., the domain is the same at every state. A valuation t on M is a function that assigns to each variable a member of D. Under the relational Kripke structure M , we define truth of formulas in a straightforward way. For a state s ∈ S, a valuation t on M , and a formula ϕ, we write (M, t, s) |= ϕ to mean that the formula ϕ is true at state s of M under valuation t. In the case of Ki ϕ, we define (M, t, s) |= Ki ϕ, if for every s such that (s, s ) ∈ Ki , (M, t, s ) |= ϕ. In the following, we demonstrate how we can conveniently use relations to represent the semantics of a formula in FOML. Definition 1 Consider a formula ϕ in FOML and a relational Kripke structure M . Let Stϕ = {s ∈ S | (M, t, s) |= ϕ}, where t is a valuation and s is a state of M . We call Stϕ the target states of formula ϕ for the valuation t. The above definition indicates that Stϕ denotes all the states under which ϕ is true with a fixed valuation t. Consider a formula ϕ(u) = ϕ(u1 , . . . , un ) with n variables, where u = (u1 , u2 , . . . , un ). We can represent the formula ϕ(u) by the relation r(ϕ) in Fig. 1.
u1 u2 t1 (u1 ) t1 (u2 ) r(ϕ) = t2 (u1 ) t2 (u2 ) .. .. . . tm (u1 ) tm (u2 )
. . . un . . . t1 (un ) . . . t2 (un ) .. .. . . . . . tm (un )
W Stϕ1 Stϕ2 .. . ϕ S tm
Fig. 1. A relational representation of the formula ϕ(u).
Note that the attributes of the relation r(ϕ) are those variables u 1 , . . . , un in the formula ϕ(u). An additional column with attribute name W is used to denote the target states Stϕi . Here we have established the connection between a formula ϕ(u) and its relational representation r(ϕ). The relation r(ϕ) shown in Fig. 1 is called the meta relation of the formula ϕ(u).
A Common Framework for Rough Sets, Databases, and Bayesian Networks
3
101
Rough Sets
Consider a formula Kϕ(u) where u = (u1 , . . . , un ) and K ∈ K. We can represent this formula by a meta relation as shown in Fig.2, in which StKϕ = {s ∈ S | (M, t, s) |= Kϕ}.
u1 u2 t1 (u1 ) t1 (u2 ) r(Kϕ) = t2 (u1 ) t2 (u2 ) .. .. . . tm (u1 ) tm (u2 )
. . . un . . . t1 (un ) . . . t2 (un ) .. .. . . . . . tm (un )
W StKϕ 1 StKϕ 2 .. . StKϕ m
Fig. 2. A relational representation of the formula Kϕ(u).
We call StKϕ the lower bound of Stϕ for a fixed valuation t. Let S ϕ = Stϕi and S Kϕ = StKϕ . We refer to S Kϕ as the lower bound of i ϕ S . However, one may alternatively define the lower bound of S ϕ using the following definition.
Definition 2 (M, s) |= Ki ϕ(u), if for every s , (s, s ) ∈ Ki , there exists a valuation t such that (M, t, s ) |= ϕ(u). Let SˆKϕ = {s ∈ S | (M, s) |= Kϕ(u)}. We want to point out that SˆKϕ is actually equal to the conventional lower bound of S ϕ derived from the propositional modal logic. In fact, S Kϕ ⊆ SˆKϕ ⊆ S ϕ . We say SˆKϕ is a more refined lower bound of S ϕ . Therefore, our approach provides a granular view of S ϕ in terms of different lower bounds.
4
Databases
In Fig.1, r(ϕ) is the relational representation of the formula ϕ(u). If we omit the column W in r(ϕ), then the table shown in Fig. 3 becomes a standard relation r(ϕ). This omission corresponds to ignoring the usage of possible worlds in our knowledge system. Traditionally, the relational database model is based on first order logic. Thus, in our approach database relations can be viewed as a special representation of the FOML formulas.
102
S.K.M. Wong and D. Wu u1 u2 t1 (u1 ) t1 (u2 ) r(ϕ) = t2 (u1 ) t2 (u2 ) .. .. . . tm (u1 ) tm (u2 )
. . . un . . . t1 (un ) . . . t2 (un ) .. .. . . . . . tm (un )
Fig. 3. The relation r(ϕ) obtained from meta relation r(ϕ) in Fig.1 by omitting the column W .
5
Bayesian Networks
In this section, we augment the FOML by introducing numeric operators Φ¯i so that if ϕ(u) is a formula, then Φ¯i (ϕ(u)) is a numeric term. The interpretation of the numeric operator Φ¯i , denoted Φi , is a function from 2S to R+ . We have shown that the formula ϕ(u) can be represented by a meta relation (see Fig. 1). Similarly, the numeric term Φ¯i (ϕ(u)) can be represented as a meta relation as shown in Fig.4. u1 u2 t1 (u1 ) t1 (u2 ) ¯ Φϕ i (u) = r(ϕ(u), Φi ) = t2 (u1 ) t2 (u2 ) .. .. . . tm (u1 ) tm (u2 )
. . . un W . . . t1 (un ) Φi (Stϕ1 ) . . . t2 (un ) Φi (Stϕ2 ) .. .. .. . . . . . . tm (un ) Φi (Stϕm )
Fig. 4. A relational representation of the term Φ¯i (ϕ(u)).
If we interpret the numeric operator, say Φ¯0 as a probability operator, then the function Φϕ i (u) in Fig.4 becomes a probability distribution. Consider the following formula ϕ(a, b, c) which can be expressed as: ϕ(a, b, c) ← ϕ1 (a) ∧ ϕ2 (b, a) ∧ ϕ3 (c, a). If we adopt the following interpretations: 1 Φϕ 1 (a) ⇐⇒ p(a), ϕ2 Φ2 (b, a) ⇐⇒ p(b|a),
3 Φϕ 3 (c, a) ⇐⇒ p(c|a), Φϕ 0 (a, b, c) ⇐⇒ p(a, b, c),
where p(a, b, c) is a probability joint distribution. Then the formula ϕ(a, b, c) can be interpreted as: ϕ1 ϕ2 ϕ3 Φϕ 0 (a, b, c) = Φ1 (a) · Φ2 (b, a) · Φ3 (c, a).
A Common Framework for Rough Sets, Databases, and Bayesian Networks
103
That is, p(a, b, c) = p(a) · p(b|a) · p(c|a). The above expression is in fact a Bayesian factorization of the joint probability distribution p(a, b, c) in terms of the conditional probability distributions p(a), p(b|a), and p(c|a). In other words, such a factorization represents a Bayesian network [3], whose graphical structure is depicted by the directed acyclic graph as shown in Fig.5.
a b
c
Fig. 5. The graphical representation of the Bayesian network.
6
Conclusion
Within the framework of first order modal logic, we have shown that the relational representation of formulas encompasses rough sets, databases, and Bayesian networks. Therefore, FOML can serve as a common framework for these three apparently different but related knowledge systems [7, 6, 4, 5].
References [1] R. Fagin, J. Halpern, Y. Moses, and Vardi M. Reasoning About Knowledge. MIT Press, Cambridge, Massachusetts, 1996. [2] E. Orlowska. Logical aspects of learning concepts. International Journal of Approximate Reasoning, 2:349–364, 1988. [3] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco, California, 1988. [4] S.K.M. Wong. A logical approach for modeling uncertainty. In 6th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems, volume 1, pages 129–135, 1996. [5] S.K.M. Wong. An extended relational data model for probabilistic reasoning. Journal of Intelligent Information Systems, 9:181–202, 1997. [6] S.K.M. Wong, C.J. Butz, and Y. Xiang. A method for implementing a probabilistic model as a relational database. In Eleventh Conference on Uncertainty in Artificial Intelligence, pages 556–564. Morgan Kaufmann Publishers, 1995. [7] S.K.M. Wong, Y. Xiang, and Xiaopin Nie. Representation of bayesian networks as relational databases. In Fifth International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems, pages 159–165, 1994. [8] Y.Y. Yao and T.Y. Lin. Generalization of rough sets using modal logic. Intelligent and Automation and Soft Computing, an International Journal, 2(2):103–120, 1996.
Rough Sets, EM Algorithm, MST and Multispectral Image Segmentation Sankar K. Pal and Pabitra Mitra Machine Intelligence Unit, Indian Statistical Institute, Calcutta 700 108, India. {sankar,pabitra r}@isical.ac.in
Segmentation is a process of partitioning an image space into some nonoverlapping meaningful homogeneous regions. The success of an image analysis system depends on the quality of segmentation. Two broad approaches to segmentation of remotely sensed images are gray level thresholding and pixel classification [1]. Multispectral nature of most remote sensing images make pixel classification the natural choice for segmentation. A general method of statistical multispectral image segmentation is to represent the probability density function of the data as a mixture model, which asserts that the data is a combination of k individual component densities (commonly Gaussians), corresponding to k clusters. The task is to identify, given the data, a set of k populations in it, and provide a model (density distribution) for each of the populations. The EM algorithm is an effective and popular technique for estimating the mixture model parameters. Rough set theory [2] provides an effective means for analysis of data by synthesizing or constructing approximations (upper and lower) of set concepts from the acquired data. An important use of rough set theory and granular computing has been in generating logical rules for classification and association [3]. These logical rules correspond to different important regions of the feature space, which represent data clusters. In this article we exploit the above characteristics of the rough set theoretic logical rules to obtain initial approximation of Gaussian mixture model parameters. The crude mixture model, after refinement through EM, leads to accurate clusters. Here, rough set theory offers a fast and robust (noise insensitive) solution to the initialization besides reducing the local minima problem of iterative refinement clustering. Also the problem of choosing the number of mixtures is circumvented, since the number of Gaussian components to be used is automatically decided by rough set theory. The problem of modelling non-convex clusters is addressed by constructing a minimal spanning tree (MST) with each Gaussian component as nodes and Mahalanobis distance between them as edge weights. Since MST clustering is performed on the Gaussian models rather than the individual data points and G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 104–105, 2003. c Springer-Verlag Berlin Heidelberg 2003
Rough Sets, EM Algorithm, MST and Multispectral Image Segmentation
105
the number of models is much less than the data points, the computational time requirement is significantly small. Block diagram of the integrated segmentation methodology is shown in Fig. 1. Discretization of the feature space, for the purpose of rough set rule generation, is performed by gray level thresholding of the image bands individually.
Fig. 1. Block diagram of the proposed clustering algorithm
Experiments were performed on two four band IRS-1A satellite images. Comparison is made both in terms of a cluster quality index [1] and computational time, in order to demonstrate the effect of the individual components.
References 1. S. K. Pal, A. Ghosh, and B. Uma Shankar, “Segmentation of remotely sensed images with fuzzy thresholding, and quantitative evaluation,” International Journal of Remote Sensing, vol. 21(11), pp. 2269–2300, 2000. 2. Z. Pawlak, Rough Sets, Theoretical Aspects of Reasoning about Data, Kluwer Academic, Dordrecht, 1991. 3. A. Skowron and C. Rauszer, “The discernibility matrices and functions in information systems,” in Intelligent Decision Support, Handbook of Applications and Advances of the Rough Sets Theory, R. Slowi´ nski, Ed., pp. 331–362. Kluwer Academic, Dordrecht, 1992.
Rough Mereology: A Survey of New Developments with Applications to Granular Computing, Spatial Reasoning and Computing with Words Lech Polkowski Polish–Japanese Institute of Information Technology, Koszykowa 86, 02008 Warsaw, Poland {Lech.Polkowski, polkow}@pjwstk.edu.pl
Abstract. In this paper, we present a survey of new developments in rough mereology, i.e. approximate calculus of parts, an approach to reasoning under uncertainty based on the notion of an approximate part (part to a degree) along with pointers to its main applications. The paradigms of Granule Computing (GC), Computing with Words (CWW) and Spatial Reasoning (SR) are particularly suited to a unified treatment by means of Rough Mereology (RM). Keywords: Rough mereology, computing with words, granular computing, spatial reasoning, rough sets
1
Rough Sets: First Notions
Rough set theory approaches the problem of in–exact concepts cf. [6] with representation of knowledge in the form of an information system A = (U, A) where U is the set of objects and A is the set of attributes, each a ∈ A being a map a : U → Va . Definable concepts X ⊂ U are then expressed as unions of equivalence classes [x]IN DA of the indiscernibility relation IN DA = {(x, y) : ∀a ∈ A.a(x) = a(y)}. For an arbitrary concept X ⊂ U , two approximations are formed viz. AX = {x ∈ U : [x]IN DA ⊆ X} (the lower approximation) and AX = {x ∈ U : [x]IN DA ∩ X = ∅} (the upper approximation).
2
Classical Mereology Theory
Our presentation of Mereology is based on [4], [5] in the first place. First we resort to Ontology by assuming that we have a family of concepts divided into G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 106–113, 2003. c Springer-Verlag Berlin Heidelberg 2003
Rough Mereology: A Survey of New Developments
107
two categories: the class AT of atomic (individual) concepts and the class CM P of non–atomic (complex) concepts (names). The predicate AT (x) will take value 1 when x is an atomic concept, and 0 otherwise. The formula xεY will mean that an individual x answers to a complex name Y i.e. AT (x), CM P (Y ) hold by default. However,we point to the fact that in the formula xεY , the object Y may be an individual as well; then, x = Y holds. 2.1
Mereology Axioms
(A1) xpty =⇒ AT (x) ∧ AT (y); this means that the functor pt of part is defined for individual concepts only. (A2) xpty ∧ yptz =⇒ xptz; this means that the functor pt is transitive, i.e. a part of a part is a part. (A3) non(xptx); this means that the functor pt is non-reflexive (or, equivalently, if xpty, then non(yptx). We define an element as follows: xely ⇐⇒ xpty ∨ x = y In terms of the predicate el an important Inference Rule may be stated [4], [11]. (IR) (The Inference Rule) [∀z.(zelx ⇒ ∃w, q.welz ∧ welq ∧ qely)] ⇒ xely. The remaining axioms of mereology are related to the class functor that converts distributive classes (complex concepts) into individual concepts: it may be used to represent “United States” as an individual comprising all US states. The class operator Cls is a principal tool in applications of rough mereology to problems of distributed systems, knowledge granulation, computing with words, [11], [12], [13]. Here we see a formal advantage of mereology: we have to deal only with objects, not with their families. For a non–empty concept Y , the class of Y , Cls(Y ) is defined as follows: x = Cls(Y ) ⇐⇒ ∃z.zεY ∧ ∀z.(zεY =⇒ zelx) ∧ ∀z.(zelx =⇒ ∃u, w.(uεY ∧ welu ∧ welz). The class functor is subject to the following postulates. (A4) xεCls(Y ) ∧ zεCls(Y ) =⇒ x = z; this means that Cls(Y ) is an individual unique concept, for any (nonempty) Y. (A5) ∃z.zεY ⇐⇒ ∃z.zεCls(Y ); meaning that Cls(Y ) exists (i.e. is a nonempty individual name) if and only if Y is a nonempty name.
3
Rough Mereology: A Calculus of Approximate Parts
Rough mereology extends mereology by considering the functor µr of a part to a degree r for r ∈ [0, 1] cf. [11], [12], [13]. The following is a list of basic postulates of rough mereology. We assume the predicate of part pt already defined, so we discuss a fixed mereological context. We introduce a family µr , where r ∈ [0, 1], called a rough inclusion which would satisfy
108
L. Polkowski
(RM 1) xµ1 y ⇔ xely (RM 2) xµ1 y ⇒ ∀z.(zµr x ⇒ zµr y) (RM 3) x = y ∧ xµr z ⇒ yµr z (RM 4) xµr y ∧ s ≤ r ⇒ xµs y
(1)
The postulate (RM1) relates rough inclusion to mereology: xµ1 y is equivalent to x being an element of y. In this way, the given exact mereological structure is embedded into rough mereological structure. It also follows that µr is defined on individual objects only. The postulate (RM2) does express monotonicity of µ with respect to the relation el. By (RM3), µr is a congruence with respect to identity. (RM4) sets the meaning of µr : it means a degree at least r. The function µ is called a rough inclusion; the term was introduced in [11]. Example 1. A generalized rough membership function | Xµr Y ⇐⇒ |X∩Y |X| ≥ r in case X nonempty, 1 else, cf.[8], [11] where X, Y are (either exact or rough) subsets (concepts) in the universe U of an information system (U, A) is an example of a rough inclusion on concepts (= sets of objects) regarded as individual elements of the mereological universe. It is evident that we cannot in general say more about properties of µr : in particular, we lack in general the transitivity property. The class operator Cls may be recalled now. We make use of it in defining a granule notion in the rough mereological universe of individual objects. 3.1
Rough Mereological Knowledge Granulation
For given r < 1 and x, we let gr (x) to denote the class Cls(Ψr ) where Ψr (y) ⇔ yµr x. The class gr (x) collects in a single class–concept all individuals satisfying the class definition with the concept Ψr . From (RM1)–(RM4), the following properties may be deduced. Proposition 1. 1. xµr y ⇒ xelgr (y) 2.xµr y ∧ yelz ⇒ xelgr (z) 3.∀z.[zely ⇒ ∃w, q.welz ∧ welq ∧ qµr x] ⇒ yelgr (x) 4.yelgr (x) ∧ zely ⇒ zelgr (x) 5.s ≤ r ⇒ gr (x)elgs (x).
(2)
Proof. 1 follows by definition of gr and class definition. 2 is implied by (RM2). For 3, use the inference rule (IR). 4 follows by transitivity of the relation el of being an element. 5 is a consequence of (RM4), (IR), and of the class definition. The class gr (x) may be regarded as a neighborhood (cluster) of (about) x of radius r. Let us observe that g1 (x) = x is the class of elements of x hence x itself.
Rough Mereology: A Survey of New Developments
3.2
109
Rough Mereology in Information Systems: A Case Study of the Linear Gaussian Rough Inclusion and of the L ukasiewicz Rough Inclusion
We will single out some propositions for a rough inclusion in an information system A = (U, A), as case studies. We define for x, y ∈ U , the set DIS(x, y) = {a ∈ A : a(x) = a(y)}
(3)
The linear gaussian rough inclusion. With help of DIS(x, y), we define a linear gaussian rough inclusion (LGRI, for short) µA r by letting − wa a∈DIS(x,y) xµA ≥r (4) r y ⇔e where wa ∈ (0, ∞) is a weight associated with the attribute a for each a ∈ A. It remains to verify that (4) is a rough inclusion indeed. We consider as individual objects the classes of indiscernibility relation IN DA and we define the notion of an element as follows xely ⇔ DIS(x, y) = ∅ (5) This notion of an element is then identical with = and corresponds to the empty part relation. Proposition 2. µA r satisfies (RM1)–(RM4) with the notion of element as in (5). Proof. For (RM1), we have xµA 1 y if and only if DIS(x, y) = ∅ if and only if xely. For (RM2), clearly, DIS(x, y) = ∅ implies DIS(x, z) = DIS(y, z) and same argument justifies (RM3). (RM4) follows by definition (4). Properties of linear gaussian inclusions are collected in the next proposition. We denote with the symbol grA the neighborhood induced by µA r . Proposition 3. 1. xIN DA y ⇒ xµA 1 y A A 2.∃η(r, s) = r · s.xµA y ∧ yµ r s z ⇒ xµη(r,s) z
(6)
Proof. 1 follows directly from the proof of Proposition 2. To prove 2: indeed, from DIS(x, z) ⊆ DIS(x, y) ∪ DIS(y, z) (7) −| wa | −| wa | −| wa | a∈DIS(x,z) a∈DIS(x,y) a∈DIS(y,z) we get e ≥e ·e . It follows from Proposition 3, 2 that LGRI µA r does satisfy the following transitivity law: A xµA r y, yµs z A xµr·s z
(8)
110
L. Polkowski
The reader undoubtedly recognizes the Menger t–norm P rod(x, y) = x · y as the function realizing the transitivity scheme (8). A look at (4) shows that µA r is constant on indiscernibility classes of IN DA . In the sequel, we will use µA r on objects as well as on indiscernibility classes tacitly. Proposition 4. (i) xelgr (y) ⇔ xµA r y (ii) xelgr y ⇒ ∀t ∈ [0, 1].gt (x)elgt r(y)
(9)
Proof. For (i), assume that xelgr (y); then, for every velx, by class definition, A A we find w, q such that welv, welq, qµA r y. Thus wµ1 q hence wµ1 · ry and finally A v, xµr y by (RM2). Proof of (ii) follows on same lines with the usage of the inference lemma (IR). L ukasiewicz rough inclusion, L RI. We begin with the information system A = (U, A), and the sets DIS(x, y) defined as above in (3). As in sect. 3.2, we exploit these sets but in a different way. For x, y ∈ U , we let xµL ry ⇔ 1−
|DIS(x, y)| ≥r |A|
(10)
L calling µ ukasiewicz rough inclusion. We recall the L ukasiewicz tensor r the L product (x, y) defined via the formula (x, y) = max{0, x + y − 1} (11)
The following is the counterpart to Proposition 3,2 stating a transitivity property of L RI. Proposition 5. Transitivity of the L ukasiewicz rough inclusion is expressed by the following scheme L xµL r y, yµs z (12) L xµ(r,s) z L L Proof. Assume that xµL r y, yµs z, xµt z; we need an estimate of t. By (7), we have 1 − t ≤ 1 − r + 1 − s hence t ≥ r + s − 1. As obviously t ≥ 0, we have finally t ≥ max{0, r + s − 1} = (r, s).
Thus, the L ukasiewicz rough inclusion does correspond to the L ukasiewicz product (t–norm). As with LGRI, one checks here that Proposition 4, 1, holds viz. xelgr (y) if and only if xµL r y.
4
Rough Mereological Granular Computing
We define an intelligent unit modeled on a classical perceptron [1].
Rough Mereology: A Survey of New Developments
4.1
111
Rough Mereological Perceptron
We exhibit the structure of a rough mereological perceptron (RM P ). It consists of an intelligent unit int − ag denoted ia. The input to ia is a finite set of connections Linkia,in = link1 , ..., linkm ; each linkj has as the source an information system Aj = (Uj , Aj ) endowed with a linear gaussian rough inclusion µjr . The output of ia is a connection linkia,out to an information system Aia = (Uia , Aia ) equipped with the linear gaussian rough inclusion µia . The operation (function) realized in RM P is denoted with Oia ; thus, for every tuple < x1 , x2 , ..., xm >, where xi ∈ Ui ,the object x = Oia (x1 , x2 , ..., xm ) ∈ Uia . In each Uj as well as in Uia , finite sets Tj , Tia are selected with the properties that (i) for each t ∈ tia there exist t1 ∈ T1 , ..., tm ∈ Tm with t = Oia (t1 , .., tm ) (ii) for each choice of ti ∈ Ti , i = 1, 2, ..., m, there is t ∈ Tia with t = Oia (t1 , ..., tm ). In case t = Oia (t1 , ..., tm ), we will say that < t1 , ..., tm , t > is an admissible set of references. The set of all admissible reference sets is denoted by Σ. The operation of an RM P may be expressed in terms of the functor ωia defined as follows. ωia : Σ × [0, 1]m → [0, 1] f or σ ∈ Σ, with σ =< t1 , .., tm , t >, ωia (σ, r1 , .., rm ) ≥ r if and only if xµia r t whenever i xi µri f or i = 1, 2, .., m where x = Oia (x1 , .., xm ) 4.2
(13)
Granular Computations
The functor ωia factors thru granule operator viz. (13)(ii) may be expressed as follows g ωia (σ, gr1 (t1 ), ..., grm (tm ) = gωia (σ,r1 ,...,rm ) (t) (14) which defines as well the factored functor ω g acting on granules. Let us observe that the acting of RM P may as well be described as that of a granular controller viz. the functor ω g may be described via a decision algoritm consisting of rules of the form if gr1 (t1 ) and gr2 (t2 ) and ...and grm (tm ) then gr (t)
(15)
with r = ωia (σ, r1 , ..., rm ) where σ =< t1 , ..., tm , t >. It is worth noticing that the functor ωia is defined from given information systems Aj , Aia and it is not subject to an arbitrary choice. Composition of RM P ’s involves a composition of the corresponding functors ω viz. given RM P1 , ..., RM Pk , RM P with links to RMP being outputs from RM P1 , .., RM Pk , each RM Pj having inputs Linkj = {link1j , ..., linkkj j }, m = k Σj=1 kj , the composition IA = RM P ◦ (RM P1 , RM P2 , ..., RM Pk ) of m inputs does satisfy the formula ωIA = ωRM P ◦ (ωRM P1 , ..., ωRM Pk )
(16)
112
L. Polkowski
under the tacit condition that admissible sets of references are composed as well. Thus RMP’s may be connected in networks subject to standard procedures e.g. learning by backpropagation etc. Computing with Words. The paradigm of computing with words (Zadeh) assumes that syntax of the computing mechanism is given as that of natural language while the semantics is given in a formal computing mechanism involving numbers. Let us observe how this idea may be implemented in an RM P . Assume there is given a set N of noun phrases {n1 , n2 , ..., nm , n} corresponding to information system universes U1 , ..., Um . A set ADJ of adjective phrases is also given and to each σ ∈ Σ, a set adj1 , .., adjm , adj is assigned Then the decision rule (15) may be expressed in the form if n1 is adj1 and ....and nm is adjm then n is adj
(17)
The semantics of (17) is expressed in the form of (15). The reader will observe that (17) is similar in form to decision rules of a fuzzy controller, while the semantics is distinct. Composition of RM P ’s as above is reflected in compositions of rules of the form (17) with semantics expressed by composed functors ω.
5
Spatial Reasoning
Spatial reasoning is based usually on a functor of connection C [2], [10] which satisfies the following conditions xCx xCy ⇒ yCx ∀z.zCx ⇔ zCy ⇒ x = y
(18)
In terms of C other spatial relations are defined like being a tangential/nontangential part or being an interior [10]. Spatial reasoning is related to spatial objects hence we depart here from our setting of the LGRI, L RI and we consider an example in which objects are located in Euclidean spaces. Example 2. The universe of objects will consist of squares of the form [k +
j j+1 i i+1 , k + s ] × [l + s , l + s ] s 2 2 2 2
with k, l ∈ Z, i, j = 1, 2, ..., 2s − 1, s = 1, 2, .... We define a rough inclusion µ as follows: xµr y ⇔ |x∩y| |x| ≥ r where |x| is the area of x. Then we define a connection Cu with u ∈ [0, 1], as follows xCu y ⇔ ∃z.∃r, s ≥ u.zµr x ∧ zµs y Then one verifies directly that Cu with u > 0 is a connection.
(19)
Rough Mereology: A Survey of New Developments
113
Assume now an RM P as above endowed with connections Cj , Cia for j = 1, 2, ..., k. Then cf. [10] Proposition 6. If xj Cj,rj tj for j = 1, 2, ..., k then xCia,inf {ω(σ,s1 ,...,sk ):sj ≥rj ,j+1,2,...,k} t with σ = {t1 , ..., tk , t}. By means of this proposition, one may consider networks of RM P ’s for constructing more and more complex sets from simple primitive objects like squares in Example 2.
References 1. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon, Oxford, 1997. 2. A.G. Cohn. Calculi for qualitative spatial reasoning. In J. Calmet, J.A. Campbell, J. Pfalzgraf (eds.), LNAI 1138: 124–143. Springer, Berlin. 3. J. Srzednicki, S. J. Surma, D. Barnett, and V. F. Rickey, editors. Collected Works of Stanislaw Le´sniewski, Kluwer, Dordrecht, 1992. 4. S. Le´sniewski. Grundz¨ uge eines neuen Systems der Grundlagen der Mathematik. Fundamenta Mathematicae, 14: 1–81, 1929. 5. S. Le´sniewski. On the foundations of mathematics. Topoi, 2: 7–52, 1982. 6. Z. Pawlak. Rough sets, algebraic and topological approach. International Journal Computer Information Sciences, 11: 341–366, 1982. 7. Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer, Dordrecht, 1992. 8. Z. Pawlak and A. Skowron. Rough membership functions. In R. R. Yager, M. Fedrizzi, and J. Kacprzyk, editors, Advances in the Dempster-Schafer Theory of Evidence, 251–271, Wiley, New York, 1994. 9. L. Polkowski. Rough Sets. Mathematical Foundations. Physica, Heidelberg, 2002. 10. L. Polkowski. On connection synthesis via rough mereology. Fundamenta Informaticae, 46(1/2): 83–96, 2001. 11. L. Polkowski, A. Skowron. Rough mereology: a new paradigm for approximate reasoning. International Journal of Approximate Reasoning, 15(4): 333–365, 1997. 12. L. Polkowski, A. Skowron. Adaptive decision-making by systems of cooperative intelligent agents organized on rough mereological principles. Intelligent Automation and Soft Computing. An International Journal, 2(2): 123–132, 1996. 13. L. Polkowski, A. Skowron. Grammar systems for distributed synthesis of approximate solutions extracted from experience. In G. Paun, and A. Salomaa, editors, Grammatical Models of Multi-Agent Systems, 316–333, Gordon and Breach, Amsterdam, 1999.
A New Rough Sets Model Based on Database Systems Xiaohua Tony Hu , T.Y. Lin , and Jianchao Han 1
1
2
3
College of Information Science and Technology, Drexel University, Philadelphia, PA 19104 2 Dept. of Computer Science, San Jose State University, San Jose, CA 94403 3 Dept. of Computer Science, California State University, Dominguez Hills, CA 90747 Abstract. In this paper we present a new rough sets model based on database systems. We borrow the main ideas of the original rough sets theory and redefine them based on the database theory to take advantage of the very efficient set-oriented database operation. We present a new set of algorithms to calculate core, reduct based on our new database based rough set model. Almost all the operations used in generating core, reduct in our model can be performed using the database set operations such as Count, Projection. Our new rough sets model is designed based on database set operations, compared with the traditional rough set models, ours is very efficient and scalable.
1 Introduction Rough sets theory was first introduced by Pawlak in the 1980’s [9] and it has been applied in a lot of applications such as machine learning, knowledge discovery, expert system [6] since then. Many rough sets models have been developed by rough set community in the last decades including such as Ziarko’s VPRS [10], Hu’s GRS [2], to name a few. These rough set models focus on extending the limitations of the original rough sets such as handling statistical distribution or noisy data, not much attention/attempts have been made to design new rough sets model to generate the core, reduct efficiently to make it efficient and scalable in large data set. Based on our own experience of applying VPRS and GRS in large data sets in data mining applications, we found one of the strong drawbacks of rough set model is the inefficiency of rough set methods to compute the core, reduct and identify the dispensable attributes, which limits its suitability in data mining applications. Further investigation of the problem reveals that rough set model does not integrate with the relational database systems and a lot of computational intensive operations are performed in flat files rather than utilizing the high performance database set operations. In considering of this and influenced by [5], we borrow the main ideas of rough sets theory and redefine them based on database set operations to take advantages of the very efficient set-oriented database operations. Almost all the operations used in generating reduct, core etc in our method can be performed using the database operations such as Cardinality, Projection etc. Compared with the traditional rough set approach, our method is very efficient and scalable. The rest of the paper is organized as follows: We give an overview of rough set theory with some
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 114–121, 2003. © Springer-Verlag Berlin Heidelberg 2003
A New Rough Sets Model Based on Database Systems
115
examples in Section 2. In Section 3, we discuss how to redefine the main concepts and methods of rough set based on database set operations. In Section 4, we describe rough set based feature selection. We conclude with some discussions in Section 5.
2 Overview of Rough Sets Theory We assume that our dataset is stored in a relational table with this form Table(condition-attributes decision-attributes), C is used to denote the condition th attributes, D for decision attributes, C∩D=Φ, tj denotes the j tuple. Rough sets theory defines three regions based on the equivalent classes induced by the attribute values: lower approximation, upper approximation and boundary as shown in Figure 1. Lower approximation contains all the objects which are classified surely based on the data collected. Upper approximation contains all the objects which can be classified probably, while the boundary is the difference between lower approximation and upper approximation. Below we give the formal definition. Suppose T={C, D} is a database table, we define two tuples ti and tj are in the same equivalent class induced by attributes S (S is a subset of C or D) if ti(S) = tj(S) . (The tuples in the same equivalent class has the same attribute value for all the attributes in S). Let [D]= {D1, .., Dk} denote the equivalent classes induced by D, ∀A ⊆ C, [A]= {A1,…Am} denotes the equivalent classes induced by A (Dj, Ai are called an equivalent class or elementary set). Definition 1: For a set Dj, the lower approximation Lower of Dj under A ⊆ C is [A]/Dj the union of all those equivalent classes Ai, each of which is contained by Dj: Lower = {∪Ai | Ai ⊆ Dj, i=1,…m}. For any object ti ∈ Lower , ti can be [A]/Dj [A]/Dj classified certainly to Dj, Lower[A]/[D] = ∪{Dj∈ [D]Lower[A]/[Dj] , j=1,…k} Definition 2: For a set Dj, the upper approximation Upper of Dj under A⊆ C is the [A]/Dj union of those equivalent classes Ai, each of which has a non-empty intersection with Dj : Upper = {∪Ai | Ai ∩ Dj ≠ Φ , i=1,..m}. For any object ti ∈ Upper , ti can [A]/Dj [A]/Dj be classified probably to Dj. Upper[A]/[D] = ∪{Dj ∈ [D]Upper , j=1,…,k} [A]/[Dj]
Definition 3: The boundary Boundary[A]/[D] = Upper[A]/[D] − Lower[A]/[D] Example 1: Suppose we have a collection of 8 cars (t1 to t8) with information about the Door, Size, Cylinder and Mileage. Door, Size and Cylinder are the condition attributes and Mileage is the decision attribute. (the Tupel_id is just for explanation purpose) Table 1. 8 Cars with {Door, Size, Cylinder, Mileage} Tuple_id t1 t2 t3 t4 t5 t6 t7 t8
Door 2 4 4 2 4 4 4 2
Size compact sub compact compact compact compact sub sub
Cylinder 4 6 4 6 4 4 6 6
Mileage high low high low low high low low
116
X.T. Hu, T.Y. Lin, and J. Han
[Mileage] = [Mileage=low] [Mileage=high] [Door Size Cylinder] Lower[Door Size Cylinder]/[Mileage] Upper[Door Size Cylinder]/[Mileage] Boundary[Door Size Cylinder]/[Mileage]
{[Mileage=high], [Mileage=low]} = {t2, t4,t5,t7,t8} = {t1,t3, t6} = {{t1},{t2,t7},{t3,t5,t6},{t4},{t8}} = {t2, t7,t4,t8, t1} = {t2, t7,t3,t5, t6, t4, t8, t2, t7,t3,t5, t6, t4, t1} = {t3, t5, t6}
5 cars t2, t7, t4, t8, t1 belong to the lower approximation Lower[Door Size Cylinder]/[Mileage], which means, that relying on the information about the Door, Size, Cylinder, the data collected are not complete, it is only good enough to make a classification model for the above 5 cars. In order to classify t3, t5, t6 (which belong to the boundary region), more information need to be collected about the car. Suppose we add the Weight of each car and the new data is presented in Table 2. Table 2. 8 Cars with {Weight, Door, Size Cylinder, Mileage} Tuple_id t1 t2 t3 t4 t5 t6 t7 t8
Weight low low med high high low high low
Door 2 4 4 2 4 4 4 2
Size compact Sub compact compact compact compact sub sub
Cylinder 4 6 4 6 4 4 6 6
Mileage high low high low low high low low
Based on the new data, we get the lower approximation, upper approximation and boundary as below: [Door Weight Size Cylinder] = {{t1},{t2},{t3},{t4}{t5}{t6},{t7},{t8}} Lower[Door Weight Size Cylinder]/[Mileage] = {t2, t4, t5 ,t7, t8, t1, t3, t6} Upper[Door Weight Size Cylinder]/[Mileage] = {t2, t4, t5 ,t7, t8, t1, t3, t6} Boundary[Door Weight Size Cylinder]/[Mileage] = Φ After the Weight information is added, then a classification model for all 8 cars can be built. One of the nice features of rough sets theory is that rough sets can tell whether the data is complete or not based on the data itself. If the data is incomplete, it suggests more information about the objects need to be collected in order to build a good classification model. On the other hand, if the data is complete, rough sets theory can also determine whether there are more than enough or redundant information in the data and find the minimum data needed for classification model. This property of rough sets theory is very important for applications where domain knowledge is very limited or data collection is very expensive/laborious because it makes sure the data collected is just good enough to build a good classification model without sacrificing the accuracy of the classification model or wasting time and effort to gather extra information about the objects. Furthermore, rough sets theory classifies all the attributes into three categories: core attributes, reduct attributes and dispensable attributes. Core attributes have the essential information to make correct classification for the data set and should be retained in the data set, dispensable attributes are the redundant ones in the data set and they should be eliminated while
A New Rough Sets Model Based on Database Systems
117
reduct attributes are in the middle between. Depending on the combination of the attributes, in some combinations, a reduct attribute is not necessary while in other situation it is essential. Definition 4: An Attribute Cj∈C is a dispensable attribute in C with respect to D if Lower[C]/[D] = Lower[C-Cj]/[D] In Table 2, Lower[Door Weight Size Cylinder]/[Mileage] = Lower[Weight Size Cylinder]/[Mileage] , so Door is a dispensable attribute in C with respect to Mileage Definition 5: An Attribute Cj∈ C is a core attribute in C with respect to D if Lower[C]/[D] ≠ Lower[C-Cj]/[D] Lower[Door Weight Size Cylinder]/[Mileage] ≠ Lower[Dorr Size Cylinder]/[Mileage], Weight is a core attribute in C with respect to Mileage Definition 6: An attribute Cj∈ C is a reduct attribute if Cj is part of a reduct.
3 The New Rough Sets Model Based on Database Systems There are two major limitations of rough sets theory which restricts its suitability in practice: (1) Rough sets theory uses the strict set inclusion definition to define the lower approximation, which does not consider the statistical distribution/noise of the data in the equivalence class. This drawback of the original rough set model has limited its applications in domains where data tends to be noisy or dirty. Some new models have been proposed to overcome this problem such as Ziarko’s Varied Precise Rough Set Model (VPRS) [10] and our previous research work on Generalized Rough Set Model (GRS Model) [2]. A detailed discussion of these new models are beyond the scope of our paper, for interested readers, please refer to the reference papers [2,6,7,10]Another drawback of rough sets theory is the inefficiency in computation, which limits its suitability for large data sets. In order to find the reducts, core, dispensable attributes, rough sets need to construct all the equivalent classes based on the attribute values of the condition and decision attributes. This is a very time consuming process and is very inefficient and infeasible and doesn’t scale for large data set, which is very common in data mining applications. Our research investigation of the inefficiency problem of rough sets model finds out that rough set model does not integrate with the relational database systems and a lot of basic operations of these computations are performed in flat files rather than utilizing the high performance database set operations. In considering of this and influenced by [5], we borrow the main ideas of rough sets theory and redefine them in the database theory to utilize the very efficient set-oriented database operations. Almost all the operations in rough sets computation used in our method can be performed using the database Count, Projection etc. (In this paper, we use Card to denote the Count operation, Π for Projection operation). Below we first give our new definitions of core, dispensable and reduct based on database operations and then present our rough set based feature selection algorithm. Definition 7: An attribute Cj is a core attribute if it satisfies the condition Card(Π(C−Cj+D)) ≠Card(Π (C−Cj)), In Table 2, Card(Π(Door, Size, Cylinder, Mileage)) = 6, Card(Π(Door, Size, Cylinder)) = 5, so Weight is a core attribute in C with respect to Mileage. We can check whether attribute Cj is a core attribute by using some SQL operations. We only need to take two projections of the table: one on the attribute set C−Cj+D, and the
118
X.T. Hu, T.Y. Lin, and J. Han
other on C−Cj. If the cardinality of the two projection tables is the same, then it means no information is lost in removing attributes Cj, otherwise, it indicates that Cj is a core attribute. Put it in a more formal way, using database term, the cardinality of two projections being compared will be different iff there exist at least two tuples tl and tk such that for any q ∈ C – Cj, s.t. tl.q = tk.q, tl.Cj ≠ tk.Cj and tl.D ≠ tk.D. In this case, a projection on C−Cj will be one fewer row than the projection on C−Cj+D because tl and tk being identical in C−Cj are being combined in this projection. However, in the projection C−Cj+D, tl, tk are still distinguishable. So eliminating attribute Cj will lose the ability to distinguish tuple tl and tk. Intuitively this means that some classification information is lost after Cj is eliminated For example, in Table 2, t5 and t6 have the same values on all the condition attributes except Weight; the two tuples belong to different classes because they are different on the value on Weight. If Weight is eliminated, then t5, t6 are indistinguishable. So Weight is a core attribute for the table. All the core attributes are indispensable part of every reduct. So it is very important to have a very efficient way to find all the core attributes in order to get the reduct, the minimum subset of the entire attributes. In traditional rough set models, a popular method to get the core attribute is to construct a decision matrix first, then search all the entries in the decision matrix to find all those entries with only one attribute. If the entry in the decision matrix contains only one attribute, that attribute is a core attribute [1]. This method is very inefficient and it is not realistic to construct a decision matrix for millions of tuples, which is a typical situation for data mining applications. We propose a new algorithm based on the database operations to get the core attributes of a decision table. Compared with the original rough set approach, our algorithm is efficient and scalable Algorithm 1: Core Attributes Algorithm Input: a decision table T(C,D) Output: Core –the core attribute of table T Set Core = ΦFor each attribute Ci ∈C { If Card(Π (C-Ci+D)) ≠ Card(Π (C-Ci)) Then Core = Core ∪ C } Definition 8: An attribute Cj∈C is a dispensable attribute with respect to D if the classification result of each tuple is not affected without using Cj. In database term, it means Card(Π(C−Cj+D))= Card(Π (C−Cj)) . This definition means that an attribute is dispensable if each tuple can be classified in the same way no matter whether the attribute is present or not. We can check whether attribute Cj is dispensable by using some SQL operations. We only need to take two projections of the table: one on the attribute set C−Cj+D, and the other on C−Cj. If the cardinality of the two projection tables is the same, then it means no information is lost in removing attributes Cj, otherwise, it indicates that Cj is relevant ad should be reinstated. For example, in Table 2, since Card(Π (Weight, Size, Cylinder, Mileage)) =6, Card(Π (Weight, Size, Cylinder)=6, so Door is a dispensable attribute in C with respect to Mileage. Definition 9: The degree of dependency K(REDU, D) between the attribute REDU ⊆ C and attribute D in decision table T(C,D) is K(REDU,D) = Card(Π(REDU+D))/ Card(C+D)
A New Rough Sets Model Based on Database Systems
119
The value K(REDU,D) is the proportion of these tuples in the decision table which can be classified. It characterizes the ability to predict the class D and the complement ¬D from tuples in the decision table. Definition 10: The subset of attributes RED (RED ⊆ C) is a reduct of attributes C with respect to D if it is a minimum subset of attributes which has the same classification power as the entire collection of condition attributes. K(RED, D) = K(C, D) K(RED, D) ≠ K(R’, D) ∀R’ ⊂ RED For example, for Table 2, there are two reducts: {Weight, Size} and {Weight, Cylinder} (in next section we will present the algorithm to find a reduct) Definition 11: The merit value of an attribute Cj in C is defined as Merit(Cj, C, D) = 1 – Card(Π(C-Cj+D))/Card((Π(C+D)). Merit(Cj, C, D) reflects the degree of contribution made by the attribute Cj only between C and D. For example, in Table 2,Card(Π(Door,Size,Cylinder,Mileage))=6, Card(Π(Door,Weight,Size,Cylinder,Mileage))=8, Merit(Weight,{Door,Weight,Size,Cylinder}, Mileage)=1-6/8=0.25
4 Rough Set Based Feature Selection All feature selection algorithms fall into two categories: (1) the filter approach and (2) the wrapper approach. In the filter approach, feature selection is performed as a preprocessing step to induction. Some of the well-known filter feature selection algorithms are RELIEF [4] and PRESET [8]. Filter approach is ineffective in dealing with the second feature redundancy. In the wrapper approach [3], the feature selection is “wrapped around” an induction algorithm, so that the bias of the operators that define the search and that of the induction algorithm interact. Though the wrapper approach suffers less from feature interaction, nonetheless, its running time would make the wrapper approach infeasible in practice, especially if there are many features because the wrapper approach keeps running the induction algorithm on different subsets from entire attributes until a desirable subset are identified. We intend to keep the algorithm bias as small as possible and would like to find a subset of attributes, which can generate good results by applying a suite of data mining algorithms. We focus on an algorithm-independent feature selection. Our goal is very clear: to have a reasonably fast algorithm that can find a relevant subset of attributes and eliminate the two kinds of unnecessary attributes effectively. With these considerations in mind, we proposed a rough set based filter feature selection algorithm. Our algorithm has many advantages over existing methods: (1) it is effective and efficient in eliminating irrelevant and redundant features with strong interaction, (2) it is feasible for applications with hundreds or thousands of features, and (3) it is tolerant to inconsistencies in the data. A decision table may have more than one reduct. Anyone of them can be used to replace the original table. Finding all the reducts from a decision table is NP-Hard [1]. Fortunately in many real applications, it is usually not necessary to find all of them, one is sufficient enough. A natural question is which reduct is the best if there are more than one reduct. The selection depends on the optimality criterion associated with the attributes. If it is possible to assign a cost function to attributes, then the
120
X.T. Hu, T.Y. Lin, and J. Han
selection can be naturally based on the combined minimum cost criteria. In the absence of an attribute cost function, the only source of information to select the reduct is the contents of the table [10]. In this paper we adopt the criteria that the best reduct is the one with the minimal number of attributes and that if there are two or more reducts with the same number of attributes, then the reduct with the least number of combinations of values of its attributes is selected. In our algorithm we first rank the attribute based on the merit, then we adopt the backward elimination approach to remove those redundant attributes until a reduct is generated. Algorithm 2: Compute a minimal attribute subset (reduct) Input: A decision Table T(C,D) Output: A set of minimum attribute subset (REDUCT) 1. Run Algorithm 1 to get the core attributes of the table CO 2. REDU= CO; 3. AR = C – REDU 4. Compute the merit values for all attributes of AR; 5. Sort attributes in AR based on merit valuess in decreasing order 6. Choose a attribute Cj with the biggest merit values (if there are several attributes with the same merit value, choose the attribute which has the least number of combinations with those attributes in REDU) 7. REDU = REDU ∪{Cj}, AR = AR − {Cj} 8. If K(REDU, D) = 1, then terminate, otherwise go back to Step 4 There are a lot of algorithms developed to find a reduct, but all these algorithms suffer from the performance problem because they were not integrated into the relational database systems and all the related computation operations were performed on a flat file [6,7]. In our algorithm, all the calculations such as Core, merit values are utilizing the database set operations. Based on this algorithm we can get a reduct {Weight Size}. For each reduct, we can derive a reduct table from the original table. For example, the reduct table T4 based on reduct {Weight, Size} is created by projecting out the attributes Weight and Size from Table 2, which can still make a correct classification model. {Weigh Size} is a minimum subset and can’t reduce further without sacrificing the accuracy of the classification model. Suppose we create another table T5 from T4 by moving Size, it cannot correctly distinguish between tuples t1, t6 and tuples t2, t8 because these tuples have the same Weight values but belong to different classes which were distinguishable in the reduct table T4. Table 3. Reduct Table for {Weight Size} Tuple_id t 1, t 6 t 2, t 8 t3 t 4, t 5 t7
Weight low low med high high
Size compact sub compact compact sub
Mileage high low high low low
A New Rough Sets Model Based on Database Systems
121
Table 4. Reducd Table for {Weight} Tuple_id t 1, t 6 t 2, t 8 t3 t4, t5, t7
Weight low low med high
Mileage high low high low
5 Conclusion In this paper we present a new database operation based rough set model. Most rough set models do not integrate with the databases systems, a lot of computational intensive operations such as generating core, reduct and rule induction are performed on flat file, which limit its applicability for large data set in data mining applications. We borrow the main ideas of rough sets theory and redefine them based on the database theory to take advantage of the very efficient set-oriented database operation. We present a new set of algorithms to calculate core, reduct based on our new database based rough set model. Our feature selection algorithm identifies a reduct efficiently and reduces the data set significantly without losing essential information. Almost all the operations used in generating core, reduct, etc in our method can be performed using the database set operations such as Count, Projection. Our new rough set model is designed based on database set operations, compared with the traditional rough set based data mining approach, our method is very efficient and scalable.
References [1]
Cercone N., Ziarko W., Hu X., Rule Discovery from Databases: A Decision Matrix Approach, Methodologies for Intelligent System, Ras Zbigniew, Zemankova Maria (eds), 1996 [2] Hu, X., Cercone N., Han, J., Ziarko, W, GRS: A Generalized Rough Sets Model, in Data Mining, Rough Sets and Granular Computing, T.Y. Lin, Y.Y.Yao and L. Zadeh (eds), Physica-Verlag [3] John, G., Kohavi, R., Pfleger,K., Irrelevant Features and the Subset Selection Problem, In Proc. ML-94, 1994 [4] Kira, K.; Rendell, L.A. The feature Selection Problem: Traditional Methods and a New Algorithm, In Proc. AAAI-92 [5] Kumar A., A New Technique for Data Reduction in A Database System for Knowledge Discovery Applications, Journal of Intelligent Systems, 10(3) [6] Lin T.Y., Yao Y.Y. Zadeh L. (eds), Data Mining, Rough Sets and Granular Computing, Physica-Verlag, 2002 [7] T. Y. Lin, H. Cao, "Searching Decision Rules in Very Large Databases using Rough Set Theory." In Rough sets and Current Trends in Computing, Ziarko & Yao (eds) [8] Modrzejewski, M. Feature Selection Using Rough Sets Theory, in Proc. ECML-93 [9] Pawlak Z., Rough Sets, International Journal of Information and Computer Science, 11(5), 1982 [10] Ziarko, W., Variable Precision Rough Set Model, Journal of Computer and System Sciences, Vol. 46, No. 1, 1993
A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm Zheng Zheng, Guoyin Wang, and Yu Wu Institute of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065,P.R.China [email protected]
Abstract. As a special way of human brains in learning new knowledge, incremental learning is an important topic in AI. It is an object of many AI researchers to find an algorithm that can learn new knowledge quickly based on original knowledge learned before and the knowledge it acquires is efficient in real use. In this paper, we develop a rough set and rule tree based incremental knowledge acquisition algorithm. It can learn from a domain data set incrementally. Our simulation results show that our algorithm can learn more quickly than classical rough set based knowledge acquisition algorithms, and the performance of knowledge learned by our algorithm can be the same as or even better than classical rough set based knowledge acquisition algorithms. Besides, the simulation results also show that our algorithm outperforms ID4 in many aspects.
1 Introduction Being a special intelligent system, human brains have super ability in learning and discovering knowledge. It can learn new knowledge incrementally, repeatedly and increasedly. And this learning and discovering way of human is essential sometimes. For example, when we are learning new knowledge in an university, we need not to learn the knowledge we have already learned in elementary school and high school again. We can update our knowledge structure according to new knowledge. Based on this understanding, AI researchers are working hard to simulate this special learning way. Schlimmer and Fisher developed a decision tree based incremental learning algorithm ID4 [1, 2]. Utgoff designed ID5R algorithm [2] that is an improvement of ID4. G.Y. Wang developed several parallel neural network architectures (PNN’s) [3, 4, 5], Z.T. Liu presented an incremental arithmetic for the smallest reduction of attributes [6], and Z.H. Wang developed an incremental algorithm of rule extraction based on concept lattice [7], etc. In recent years, many rough set based algorithms about the smallest or smaller reduction of attributes and knowledge acquisition are developed. They are almost based on static data. However, real databases are always dynamic. So, many researchers suggest that knowledge acquisition in databases should be incremental. Incremental arithmetic for the smallest reduction of attributes [6] and incremental
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 122–129, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm
123
algorithm of rule extraction based on concept lattice [7] have been developed, but there are few incremental rough set based algorithms about knowledge acquisition. On the basis of former results, we develop a rough set and rule tree based incremental knowledge acquisition algorithm (RRIA) in this paper. Simulation results show that our algorithm can learn more quickly than classical rough set based knowledge acquisition algorithms, and the performance of knowledge learned by our algorithm can be the same as or even better than classical rough set based knowledge acquisition algorithms. Besides, we compare our algorithm with ID4 algorithm. The results show that the rule quality and the recognition rate of our algorithm are both better than ID4.
2 Basic Concepts of Decision Table For the convenience of description, we introduce some basic notions at first. Definition 1. A decision table is defined as S=, where U is a finite set of objects and R=C D is a finite set of attributes. C is the condition attribute set and D is the decision attribute set. With every attribute a ³ R,set of its values Va is associated. Each attribute has a determine function f : U×R→V. Definition 2. Let S= < U,R,V,f > denote a decision table, and let B ⊆ C. Then a rule set generated from S is defined as F={ f 1 DB , f 2 DB ,..., f r DB }, where
f i B = {∑ a → d | a ∈ C and d ∈ D} (i = 1, ..., r ) d
d
and r is the number of rules in F. In f i B , if some attributes is reduced, then the value of the reduced attributes are supposed to be “*” which is different from any possible values of these attributes. For example, in a decision system with 5 condition attributes (a1, …, a5), (a1=1) ∧ (a3=2) ∧ (a4=3) → d=4 is a rule. In this paper, we write it as (a1=1) ∧ (a2=*) ∧ (a3=2) ∧ (a4=3) ∧ ( a5=*) → d=4 according to Def.2.
3 Related Incremental Algorithms In order to compare our algorithm with other related algorithms, we introduce and discuss some decision table based incremental algorithms at first. 3.1 ID4 Algorithm [1, 2] Based on the concept of decision tree, Schlimmer and Fisher designed an incremental learning algorithm ID4. Decision trees are the tree structures of rule sets. They have higher speed of searching and matching than ordinary rule set. Comparing with ordinary rule set, a decision tree has unique matching path. Thus, using decision tree can avoid confliction of rules and accelerate the speed of searching and matching. However, there are also some drawbacks of decision tree. After recursive partitions, some data is too little to express some knowledge or concept. Besides, there are also some problems of the constructed tree, such as overlap, fragment and replication [12]. The results of our experiments also show that the rule set expressed by decision tree
124
Z. Zheng, G. Wang, and Y. Wu
has lower recognition rate than most of the rough set based knowledge acquisition algorithms. So, we should find a method that has the merit of decision tree and avoid its drawback. 3.2 ID5R Algorithm [2] In 1988, to improve ID4 algorithm’s learning ability, Utgoff developed ID5 decision algorithm and afterwards the algorithm is updated to ID5R. The decision trees generated by ID5R algorithm are the same as those generated by ID3. Each node of decision tree generated by ID5R must save a great deal of information, so the space complexity of ID5R is higher than ID4. Besides, when the decision tree’s nodes are elevated too many times, the algorithm’s time complexity will be too high. 3.3 Incremental Arithmetic for the Smallest Reduction of Attributes [6] Based on the known smallest reduction of attributes, this algorithm searches the new smallest reduction of attributes when some attributes are added to the data set. 3.4 Incremental Algorithm of Rule Extraction Based on Concept Lattice [7] Concept lattice is an effective tool for data analysis and rule generation. Based on concept lattice, it’s easy to establish the dependent and causal relation model. And concept lattice can clearly describe the extensive and special relation. This algorithm is based on the incremental learning idea, but it has some problems in knowledge acquisition: first, it’s complexity is too high; second, it takes so much time and space to construct the concept lattice and Hasse map; third, rule set cannot be extracted from concept lattice directly.
4 Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm (RRIA) Based on the above discussion about incremental learning algorithms and our understanding of incremental learning algorithms, we develop a rough set and rule tree based incremental knowledge acquisition algorithm in this section. In this part, we introduce the concepts of rule tree at first and present our incremental learning algorithm then. At last, we analyze the algorithm’s complexity and performance. For the convenience of description, we might suppose that OTD represents the original training data set in our algorithm, ITD represents the incremental training data set, ORS represents the original rule set and ORT represents the original rule tree. 4.1 Rule Tree 4.1.1 Basic Concept of Rule Tree Definition 3. A rule tree is defined as follows: (1) A rule tree is composed of one root node, some leaf nodes and some middle nodes.
A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm
125
(2) The root node represents the whole rule set. (3) Each path from the root node to a leaf node represents a rule. (4) Each middle node represents an attribute testing. Each possible value of an attribute in a rule set is represented by a branch. Each branch generates a new child node. If an attribute is reduced in some rules, then a special branch is needed to represent it and the value of the attribute in this rule is supposed as “*”, which is different from any possible values of the attribute (Ref. Definition 2). 4.1.2 Algorithms for Building Rule Tree Algorithm 1: CreateRuleTree(ORS) Input: ORS; Output: ORT. Step 1. Arrange the condition attributes in ascending of the number of their values in ORS. Then, each attribute is the discrenibility attribute of each layer of the tree from top to bottom; Step 2. For each rule R of ORS {AddRule(R)} Algorithm 2: AddRule(R) Input: a rule tree and a rule R; Output: a new rule tree updated by rule R. Step 1. CN←root node of the rule tree; Step 2. For i=1 to m /*m is the number of layers in the rule tree*/ { If there is a branch of CN representing the ith discernibility attribute value of rule R, then CN←node I;/*node I is the node generated by the branch*/ else {create a branch of CN to represent the ith discernibility attribute value; CN←node J /*node J is the node generated by the branch*/}} We suppose the attribute with fewer number of attribute values is the discernibility attribute in higher layer of a rule tree. By this supposition, the number of nodes in each layer should be the least, and the searching nodes should also be the least and it could speed up the searching and matching process. 4.2 Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm (RRIA) Algorithm 3: Rough set and rule tree based incremental knowledge acquisition algorithm Input: OTD, ITD={ object1, object2, … , objectt}; Output: a rule tree. Step 1. Use rough set based algorithm to generate ORS from OTD; Step 2. ORT=CreateRuleTree(ORS); Step 3. For i=1 to t (1) CN←root node of ORT; (2) Create array Path[]; /*Array Path records the current searching path in MatchRule() and the initial value of Path is NULL*/ (3) MatchRules=MatchRule(CN, objecti, Path); (4) R=SelectRule (MatchRules); (5) If R exists and the decision value of R is different from objecti’s, then UpdateTree(objecti,R); /*R is conflict with objecti*/ (6) If R doesn’t exit then AddRule(objecti).
126
Z. Zheng, G. Wang, and Y. Wu
Where, algorithm AddRule is described in section 4.1 and algorithms MatchRule, SelectRule and UpdateTree will be further illustrated in the following section. 4.3 Child Algorithms of RRIA In the incremental learning process, we need match the object to be learned incrementally with rules of ORT at first. The matching algorithm is as follows. Algorithm 4: MatchRule(CN, Obj, Path) Input: the current matched node CN, the current object Obj to be learned incrementally, the current searching path Path and ORT; Output: MatchRules. /*which records the matched rules of Obj in ORT.*/ Step 1. matchedrulenum=0; Step 2. For I=1 to number of child nodes of CN {If the branch generating child node I represents the current discernibility attribute value of Obj or represents a possible reduced attribute, then {If node I is a leaf node, then { MatchRules[matchedrulenum++]←Path; path=NULL} else { CN←node I; Path[layerof(I)] ←value(I); /*The result of value(I) is the current discernibility attribute value that the branch generating node I represents and the result of layerof(I) is the layer that node I is in ORT. “*” represents a possible reduced attribute value(Ref. Definition 2)*/ MatchRule(node I, Obj, Path)}} else Path=NULL} /* If neither of these two kinds of child nodes exist */ This algorithm is recursive. The maximal number of nodes in ORT to be checked is m×b, and the complexity of this algorithm is O(mb). There maybe more than one rules in MatchRules. There are several methods to select the best rule from MatcheRules, such as High confidence first principle [8], Majority principle [8] and Minority principle [8]. Here we design a new principle, that is Majority Principle of Incremental Learning Data, which is more suitable for our incremental learning algorithm. Algorithm 5: SelectRule(MatchRules) Input: the output of algorithm MatchRule, that is, MatchRules; Output: the final matched rule. During the incremental learning process, we consider that the incremental training data is more important than the original training data. The importance degree of a rule will be increased by 1 each time it is matched to an unseen object to be learned in the incremental learning process. We choose the rule with the highest importance degree to be the final matched rule. If we can’t find the rule absolutely matched to the object to be learned, but we can find a rule R matching its condition attribute values, that is , the object to be learned is conflict with rule R, then we should update the rule tree. The algorithm is as follows. Suppose {C1, C2, …, Cp} is the result of arranging the condition attributes reduced in R in ascending of the number of their values. Algorithm 6: UpdateTree(Obj, R) Input: the current object Obj to be learned incrementally, rule R, ORT, OTD and ITD; Output: a new rule tree.
A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm
127
Step 1. Check all objects of OTD and the objects that have been incrementally learned in ITD and suppose object1, object2, …, objectq are all objects matched to rule R; Step 2. For i=1 to q { dis_num[i]=0; For j=1 to p {mark[i][j]=0}} Step 3. For i=1 to q For j=1 to p If Obj and objecti have different values on attribute Cj, then { mark[i][j]=1; dis_num[i]++} Step 4. For i=1 to p { delattr[i]=Ci} Step 5. For i=1 to q { If dis_num[i]=0, then { delattr=NULL; go to Step 7} else If dis_num[i]=1, then { j=1; While (mark[i][j++] ≠ 1); delattr=delattr-{Cj}}} Step 6. For i=1 to q { discerned=FALSE; If dis_num[i]>1, then { j=1; While (discerned=FALSE) and (j
5 Experiment Results In order to test the validity and ability of RRIA algorithm, we implement all algorithms using VC6.0. We compare RRIA with classical rough set based knowledge acquisition algorithm and ID4 algorithm. Some classical data sets from UCI and other data sets used by many researchers are used in our experiment. Experiment 1: Comparison among RRIA, classical rough set based knowledge acquisition algorithm and ID4 algorithm.
128
Z. Zheng, G. Wang, and Y. Wu
Step 1. Choose β % objects from the total data set to be OTD. Use classical rough set based knowledge acquisition algorithm to generate ORS; Step 2. Build ORT from ORS with Algorithm 1; Step 3. Choose α % objects from the total data set to be ITD. And incrementally learn ITD with RRIA; Step 4. Test the recognition rate of the rule set generated in step 3 on the total data set; Step 5. Generate a rule set from OTD+ITD with the classical rough set based knowledge acquisition algorithm used in step 1.Test the recognition rate of the generated rule set on the total data set; (OTD+ITD represents the union of OTD and ITD) Step 6. Use ID3 algorithm to build a decision tree from OTD; Step 7. Incrementally learn ITD with ID4 algorithm; Step 8. Test the recognition rates of the rule set generated in Step 7 on the total data set. The experiment results are shown in Table 1. A1 is the discernibility matrix based knowledge acquisition algorithm [8] and A2 is the general knowledge acquisition algorithm [8]. For the convenience of comparison, we might suppose n is the number of total objects of a data set, α % of each data set is selected as its ITD, β % of each data set is selected as its OTD, t is the running time of RRIA in the incremental learning process, A1(or A2) and ID4 separately whose measurement unit is ms, c% is the correct recognition rate of the generated rules, l is the average length of generated rules and a is the number of the generated rules. From experiment 1, we can find that RRIA algorithm has higher speed than the classical rough set based knowledge acquisition algorithm, and there is no much difference between the recognition rate of RRIA and the classical rough set knowledge acquisition algorithm. In some cases, the recognition rate of RRIA is better than that of classical rough set knowledge acquisition algorithm. The results also show that the recognition rate of our algorithm is better than ID4. Besides, the average length of the rules generated by RRIA is shorter than ID4 and the number of rules generated by RRIA is less than ID4. Table 1. Comparison among the algorithms Data set AUTO MPG AUTO MPG AUTO MPG BEASTCANCE R BEASTCANCE R BEASTCANCE R HUMIDITY HUMIDITY HUMIDITY
RRIA
Alg. β α A1 A1 A1 A1 A1 A1 A2 A1 A2
40 33 25 40 33 25 40 33 25
10 17 25 10 17 25 10 17 25
t <1 <1 10 10 10 30 50 60 111
c 64.8 65.1 61.6 87.7 88.7 67.0 84.2 84.1 83.7
L (a) 2.07(165) 1.94(159) 3.78(166) 1.33(192) 2.21(179) 6.02(276) 3.11(170) 3.88(174) 4.60(195)
Classical Rough Set Knowledge Acquisition Algorithm t c l(a) 170 62.8 1.01(156) 170 62.8 1.03(156) 170 62.8 1.03(156) 2013 88.1 1.04(233) 2013 88.1 1.04(233) 2013 88.1 1.04(233) 471 83.6 1.79(161) 471 83.6 1.79(161) 471 83.6 1.79(161)
ID4 t <1 10 10 10 20 30 10 10 <1
c 56.0 60.6 51.0 52.5 52.1 50.1 69.5 65.6 65.1
l(a) 2.71(164) 2.51(162) 4.57(166) 2.77(322) 3.88(335) 6.74(342) 4.27(172) 4.93(177) 7.21(188)
A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm
129
6 Conclusion Incremental learning is an important topic in AI. Many rough set based algorithms about the smallest or smaller attribute reduction and rule set of static data have been developed in the past years. Unfortunately, databases in real life are always dynamic. So, many researchers suggest that knowledge acquisition in databases should be incremental. Incremental learning algorithms can be used not only in the fields that classical algorithms can be used in with higher speed but also in some other special fields. Incremental arithmetic for the smallest reduction of attributes and incremental algorithms of rule extraction based on concept lattice have already been studied, but there are few incremental rough set based algorithms about knowledge acquisition. In this paper, after studying some incremental learning algorithms, we develop RRIA algorithm. Through analysis and simulation test, we find that our algorithm has higher speed than classical rough set based knowledge acquisition algorithm, and the performance of knowledge learned by our algorithm can be the same as or even better than classical rough set based knowledge acquisition algorithms. We also compare RRIA with ID4 algorithm. The results show that some capabilities of RRIA algorithm, such as rule length, number of rules, recognition rate, etc, are also better than ID4. Acknowledgements. This paper is partially supported by National Natural Science Foundation of P.R. China (No.69803014), National Climb Program of P.R. China, Foundation for University Key Teacher by the State Education Ministry of P.R. China (No.GG-520-10617-1001), Scientific Research Foundation for the Returned Overseas Chinese Scholars by the State Education Ministry of P.R. China, and Application Science Foundation of Chongqing.
References 1. Schlimmer, J. C., Fisher, D A, Case study of incremental concept induction. Proceedings of the Fifth National Conf. on Artificial Intelligence, Los Altos, 1986. 2. Lu, R. Y., Artificial Intelligence, Scientific Press, 2002. 3. Wang, G. Y., Shi, H. B., Deng, W., A parallel neural network architecture based on nara model and sieving method, Chinese Journal of Computer, 19(9), 1996. 4. Wang, G. Y., Nie, N., PMSN: A Parallel Multi-Sieving Neural Network Architecture. Journal of Computer Research and Development, 36(Suppl.), pp.21–25, 1999. 5. Wang, G. Y., Shi, H. B., Parallel Neural Network Architectures and Their Applications. In: Proceedings of Int. Conf. on Neural Networks, Perth, Australia, pp.1234–1239, 1995. 6. Liu, Z. T., An Incremental Arithmetic for the Smallest Reduction of Attributes, Chinese Journal of Electronics, 27(11), pp.96–98, 1999. 7. Wang, Z. H., Liu, Z. T, General and Incremental Algorithms of Rule Extraction Based on Concept Lattice, Chinese Journal of Computer, 22(1), pp.66–70, 1999. 8. Wang, G. Y., Rough Set Theory and Knowledge Acquisition, Xi’an: Xi’an Jiaotong University Press, 2001. 9. Su, J., Gao, J., Metainformatio Based Rough Set Incrementally Rule Extraction Algorithm, Pattern Recognition and Artificial Intelligence, 14(4), pp.428–433, 2001.
Comparison of Conventional and Rough K-Means Clustering Pawan Lingras, Rui Yan, and Chad West Department of Math and Computer Science, Saint Mary’s University Halifax, Nova Scotia, Canada, B3H 3C3.
Abstract. This paper compares the results of clustering obtained using a modified K-means algorithm with the conventional clustering process. The modifications to the K-means algorithm are based on the properties of rough sets. The resulting clusters are represented as interval sets. The paper describes results of experiments used to create conventional and interval set representations of clusters of web users on three educational web sites. The experiments use secondary data consisting of access logs from the World Wide Web. This type of analysis is called web usage mining, which involves applications of data mining techniques to discover usage patterns from the web data. Analysis shows the advantages of the interval set representation of clusters over conventional crisp clusters. Keywords: Rough Sets, Clustering, K-means, Web Usage mining, Unsupervised Learning.
1
Introduction
Clustering analysis is one of the important functions in data mining, which groups users or data items with similar characteristics. Joshi and Krishnapuram [1] argued that the clustering operation in many applications involves modeling an unknown number of overlapping sets. They proposed the use of fuzzy clustering [2, 3, 4] for grouping web documents. Lingras [5] described how a rough set theoretic clustering scheme could be represented using a rough set genome. The resulting genetic algorithms (GAs) were used to evolve groupings of highway sections represented as interval or rough sets. Lingras [6] applied the unsupervised rough set clustering based on GAs for grouping web users of a first year university course. He hypothesized that there are three types of visitors: studious, crammers, and workers. Studious visitors download notes from the site regularly. Crammers download most of the notes before an exam. Workers come to the site to finish assigned work such as lab and class assignments. Generally, the boundaries of these clusters will not be precise. Preliminary experimentation by Lingras [6] illustrated the feasibility of rough set clustering for developing user profiles on the web. However, the clustering process based on GAs seemed computationally expensive for scaling to a larger data set. Lingras and West [7] provided a theoretical and experimental analysis of a modified K-means clustering based on the properties of rough sets. The modified K-means approach is suitable for large data sets. This paper compares the clusters obtained G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 130–137, 2003. © Springer-Verlag Berlin Heidelberg 2003
Comparison of Conventional and Rough K-Means Clustering
131
from conventional and modified K-means algorithms. Three web sites that are used for the experimentation catered to two first year and one second year courses.
2 Review of K-Means Let us assume that the objects are represented by m-dimensional vectors. The objective is to assign these n objects to k clusters. Each of the clusters is also represented by an m-dimensional vector, which is the centroid vector for that cluster. The process begins by randomly choosing k objects as the centroids of the k clusters. The objects are assigned to one of the k clusters based on the minimum value of distance d (x, v ) between the object vector v and cluster vector x. The distance
d (x, v ) can be the standard Euclidean distance. After the assignment of all the objects to various clusters, the new centroid vectors of the clusters are calculated as:
xj =
Here
∑v
j object v was assigned to cluster x
x
, where
1≤ j ≤ m.
(1)
x is the cardinality of cluster x. The process stops when the centroids of
clusters stabilize, i.e. the centroid vectors from previous iteration are identical to those generated in the current iteration.
3 Modified K-Means Based on Rough Set Theory Rough sets were proposed using equivalence relations. However, it is possible to define a pair of upper and lower bounds
( A( X ), A( X )) or a rough set for every set
X ⊆ U as long as the properties specified by Pawlak [8] are satisfied. Yao et al. [10] described various generalizations of rough sets by relaxing the assumptions of an underlying equivalence relation. Skowron and Stepaniuk [9] discussed a similar generalization of rough set theory. If one adopts a more restrictive view of rough set theory, the rough sets developed in this paper may have to be looked upon as interval sets. Incorporating rough sets into K-means clustering requires the addition of the concept of lower and upper bounds. Calculations of the centroids of clusters need to be modified to include the effects of lower as well as upper bounds. The modified centroid calculations for rough sets are given by:
132
P. Lingras, R. Yan, and C. West
∑v j ∑v j v∈A( x ) v∈( A ( x ) − A( x )) , if A(x) − A(x) ≠ ∅ wlower × A(x) + wupper × A(x) − A(x) (2) xj = v ∑ j wlower × v∈A( x ) , otherwise A(x) where 1 ≤ j ≤ m . The parameters wlower and wupper correspond to the relative importance of lower and upper bounds. It can be shown that eq. 2 is a generalization of eq. 1. If the upper bound of each cluster were equal to its lower bound, the clusters will be conventional clusters. Therefore, the boundary region A( x) − A( x) will be empty, and the second term in the equation will be ignored. Thus, eq. 2 will reduce to the conventional K-means calculations given by eq. 1. Following rough mereology [9], rough sets are used as patterns for classification. Relevant patterns are discovered by tuning the parameters in such a way that the lower approximation, boundary region, and complement of the upper approximation are relevant, i.e. they are sufficiently included in (or close to) target concepts. The next step in the modification of the K-means algorithms for rough sets is to design criteria to determine whether an object belongs to the upper or lower bound of a cluster. For each object vector, v , let d ( v, x i ) be the distance between itself and
X i . The differences d ( v, x i ) − d ( v, x j ) , 1 ≤ i, j ≤ k , were used to determine the membership of v as follows. 1. if d ( v, x i ) is the minimum for 1 ≤ i ≤ k , and the centroid of cluster
d ( v, x i ) − d ( v, x j ) ≤ threshold , for any pair (i, j ) then v ∈ A(x i ) and
v ∈ A(x j ) . Furthermore, v is not part of any lower bound. 2. Otherwise,
v ∈ A(x i ) such that d ( v, x i ) is the minimum for 1 ≤ i ≤ k . In
addition, by properties of rough sets,
v ∈ A(x i ) .
The rough K-means algorithm, described above, depends on three parameters, wlower , wupper , and threshold . Experimentation with various values of the parameters is necessary to develop a reasonable rough set clustering. This paper describes the design and results of such experiments.
4
Study Data and Design of the Experiment
The study data was obtained from web access logs of three courses. These courses represent a sequence of required courses for computing science programme at Saint Mary’s University. Two courses were for the first year students. The third course was for the second year students. More details about the courses can be found in [11].
Comparison of Conventional and Rough K-Means Clustering
133
Lingras et al. [6, 7, 11] showed that visits from students attending these courses could fall into one of the following three categories: 1. Studious: These visitors download the current set of notes. Since they download a limited/current set of notes, they probably study class-notes on a regular basis. 2. Crammers: These visitors download a large set of notes. This indicates that they have stayed away from the class-notes for a long period of time. They are planning for pretest cramming. 3. Workers: These visitors are mostly working on class or lab assignments or accessing the discussion board. Table 1. Centroids of interval clusters
Course
First
Second
Third
Cluster Name Studious Crammers Workers Studious Crammers Workers Studious Crammers Workers
Campus Day/Night Lab Day Access Time 0.67 0.75 0.43 0.61 0.71 0.33 0.67 0.75 0.49 0.14 0.69 0.05 0.60 0.72 0.32 0.97 0.91 0.88 0.70 0.74 0.48 0.55 0.72 0.42 0.62 0.75 0.51
Hits 3.19 4.29 1.00 0.63 2.97 0.64 4.18 5.54 1.54
Document Requests 3.21 9.53 0.86 0.58 4.31 0.47 4.12 11.18 1.11
The web logs were preprocessed to create an appropriate representation of each user corresponding to a visit. Lingras et al. [6, 7, 11] decided to use the following attributes for representing each visitor: 1. On campus/Off campus access. 2. Day time/Night time access: 8 a.m. to 8 p.m. were considered to be the daytime. 3. Access during lab/class days or non-lab/class days: All the labs and classes were held on Tuesdays and Thursdays. Visitors on these days are more likely to be workers. 4. Number of hits. 5. Number of class-notes downloads. The first three attributes had binary values of 0 or 1. The last two values were normalized. The distribution of the number of hits and the number of class-notes was analyzed for determining appropriate weight factors. Different weighting schemes were studied. The numbers of hits were set to be in the range [0,10]. Since the classnotes were the focus of the clustering, the last variable was assigned higher importance, where the values ranged from 0 to 20.
134
P. Lingras, R. Yan, and C. West Table 2. Average vectors for the lower bound
Course Cluster Name Campus Day/Night Lab Day Access Time Studious 0.67 0.75 0.43 First Crammers 0.61 0.72 0.33 Workers 0.67 0.75 0.49 Studious 0 0.68 0 Second Crammers 0.60 0.72 0.32 Workers 1 0.95 1 Studious 0.70 0.74 0.48 Third Crammers 0.55 0.72 0.42 Workers 0.62 0.75 0.51
5
Hits 3.27 4.31 0.97 0.58 3.03 0.59 4.18 5.54 1.50
Document Requests 3.30 9.67 0.83 0.55 4.48 0.43 4.12 11.18 1.11
Results and Discussion
Tables 1–5 show centroid and/or average vectors for the three data sets. It was possible to classify the three clusters as studious, workers, and crammers, from the results obtained using the conventional as well as the modified K-means algorithm. The crammers had the highest number of hits and class-notes in every data set. The average numbers of notes downloaded by crammers varied from one set to another. Table 3. Average vectors for the upper bound
Course
Cluster Name Studious First Crammers Workers Studious Crammers Second Workers Studious Crammers Third Workers
Campus Day/Night Lab Day Access Time 0.67 0.75 0.43 0.61 0.72 0.32 0.67 0.75 0.48 0.57 0.71 0.18 0.63 0.71 0.34 0.87 0.80 0.51 0.69 0.74 0.48 0.56 0.71 0.43 0.62 0.75 0.51
Hits 2.94 4.23 1.10 0.80 2.81 0.79 3.97 5.46 1.66
Document Requests 2.97 9.10 0.95 0.66 3.78 0.61 3.67 10.70 1.20
The studious visitors downloaded the second highest number of notes. The distinction between workers and studious visitors for the second course was also based on other attributes. For example, in the second data set, the workers were more prone to come on lab days, from on-campus locations, and during the daytime. It is also interesting to note that the crammers had higher ratios of document requests to hits. The workers, on the other hand, had the lowest ratios of document requests to hits. The lower bounds seemed to provide more distinctive vectors than any other representations of
Comparison of Conventional and Rough K-Means Clustering
135
clusters. It is interesting to note that the centroid vectors for the conventional vectors seemed to lie between the boundary regions and the lower bounds of the clusters. Table 6 shows the cardinalities of lower bounds, upper bounds, boundary regions, and conventional clusters. The cardinalities of lower bounds, upper bounds, and conventional clusters were comparable. For the first two data sets, the cardinalities of conventional clusters seemed to be more than the lower bound and less than the upper bound in most cases. The actual numbers in each cluster vary based on the characteristics of each course. For example, the first term course had significantly more workers than studious visitors, while the second term course had more studious Table 4. Average vectors for the boundary region
Course
Cluster Name Studious First Crammers Workers Studious Second Crammers Workers Studious Third Crammers Workers
Campus Day/Night Lab Day Access Time 0.67 0.75 0.43 0.62 0.81 0.28 0.67 0.75 0.44 0.80 0.72 0.25 0.72 0.69 0.37 0.80 0.72 0.25 0.67 0.73 0.50 0.60 0.70 0.50 0.68 0.74 0.50
Hits 2.30 3.82 2.16 0.89 2.28 0.89 3.50 4.88 3.33
Document Requests 2.33 6.32 1.96 0.71 2.11 0.71 2.67 7.43 2.01
Table 5. Average vectors for the conventional clustering
Course
Cluster Name Studious First Crammers Workers Studious Second Crammers Workers Studious Third Crammers Workers
Campus Day/Night Lab Day Access Time 0.67 0.76 0.44 0.62 0.72 0.32 0.67 0.74 0.49 0.56 0.00 0.27 0.62 0.72 0.33 0.73 1.00 0.44 0.69 0.75 0.50 0.60 0.71 0.44 0.62 0.74 0.50
Hits 2.97 4.06 0.98 0.76 2.90 0.71 3.87 5.30 1.41
Document Requests 2.78 8.57 0.85 0.66 4.06 0.56 3.15 10.20 1.10
visitors than workers. The increase in the percentage of studious visitors in the second term seems to be a natural progression. It should be noted that the progression from workers to studious visitors was more obvious with interval clusters than the
136
P. Lingras, R. Yan, and C. West Table 6. Cardinalities of the interval and sets
Course
Cluster Lower bound Upper bound Boundary Conventional Name Studious 1333 2023 690 1814 First Crammers 281 339 58 406 Workers 5316 5948 632 5399 Studious 1177 4184 3007 1286 Second Crammers 318 451 133 391 Workers 1540 4544 3004 4371 Studious 208 301 93 318 Third Crammers 67 77 10 89 Workers 906 989 83 867 conventional clusters. Interestingly, the second year course had significantly large number of workers than studious visitors. This seems counter-intuitive. However, it can be explained based on the structure of the web sites. Unlike the two first year courses, the second year course did not post the class-notes on the web. The notes downloaded by these students were usually sample programs that were essential during their laboratory work. Crammers constituted less than 10% of the visits. The experiments used exactly the same setup for all the three web sites. The characteristics of the first two sites were similar. The third web site was somewhat different in terms of the site contents, course size, and types of students. The results discussed in this section show many similarities between the interval set clustering for the three sites. The differences between the results can be easily explained based on further analysis of the web sites. It is interesting to see that the rough set adaptation of the K-means clustering captured the subtle differences between the web sites better than the conventional K-means algorithm.
6
Summary and Conclusions
The paper compares experimental results from conventional and rough set based Kmeans algorithms. Web visitors for three courses were used in the experiments. It was expected that, the visitors would be classified as studious, crammers, or workers. Since some of the visitors may not precisely belong to one of the classes, the clusters were represented using interval sets. The experiments produced meaningful clustering of web visitors using both the conventional and rough set based approach. The study of variables used for clustering made it possible to clearly identify the three clusters as studious, workers, and crammers. There were many similarities and a few differences between the characteristics of conventional and interval clusters for the three web sites. The interval set representation of clusters made it easier to identify these subtle differences between the three courses than the conventional K-means approach. The classes considered in this study are imprecise. Therefore, the use of
Comparison of Conventional and Rough K-Means Clustering
137
rough sets seems to provide good results. Fuzzy clustering is another alternative for representing these classes of web users. Comparison of conventional and rough Kmeans with fuzzy C-means clustering is currently underway. Results of these additional comparisons will be reported in future publications.
Acknowledgment. The authors would like to thank NSERC for their financial support.
References 1.
A. Joshi and R. Krishnapuram, Robust Fuzzy Clustering Methods to Support Web Mining, in the proceedings of the workshop on Data Mining and Knowledge Discovery, SIGMOD '98, 1998, 15/1–15/8. 2. R. Hathaway and J. Bezdek, Switching Regression Models and Fuzzy Clustering, IEEE Transactions of Fuzzy Systems, vol. 1, no. 3, 1993, 195–204. 3. R. Krishnapuram, H. Frigui, and O. Nasraoui, Fuzzy and Possibilistic Shell Clustering Algorithms and their Application to Boundary Detection and Surface Approximation, Parts I and II, IEEE Transactions on Fuzzy Systems, vol. 3, no. 1, 1995, 29–60. 4. R. Krishnapuram and J. Keller, A Possibilistic Approach to Clustering, IEEE Transactions, vol. 1, no. 2, 1993, 98–110. 5. P. Lingras, Unsupervised Rough Set Classification using GAs, Journal of Intelligent Information Systems, vol. 16, no. 3, 2001, 215–228. 6. P. Lingras, Rough Set Clustering For Web Mining, in the proceedings of 2002 IEEE International Conference on Fuzzy Systems, 2002. 7. P. Lingras, and C. West, Interval Set Clustering of Web Users with Rough K-means, submitted to Journal of Intelligent Information Systems, 2002. 8. Z. Pawlak, Rough Sets, International Journal of Information and Computer Sciences, vol. 11, 1982, 145–172. 9. A. Skowron and J. Stepaniuk, Information Granules in Distributed Environment, in New Directions in Rough Sets, Data Mining, and Granular-Soft Computing, Setsuo Ohsuga, Ning Zhong, Andrzej Skowron, Ed., Springer-Verlag, Lecture Notes in Artificial Intelligence 1711, Tokyo, 1999, 357–365. 10. Y. Y. Yao, X. Li, T. Y. Lin and Q. Liu, Representation and Classification of Rough Set Models, in the proceedings of Third International Workshop on Rough Sets and Soft Computing, 1994, 630–637. 11. P. Lingras, M. Hogo and M. Snorek, Interval Set Clustering of Web Users using Modified Kohonen Self-Organization Maps based on the Properties of Rough Sets, submitted to Web Intelligent and Agent Systems: an International Journal, 2002.
An Application of Rough Sets to Monk’s Problems Solving Duoqian Miao1 and Lishan Hou2 1 Dpt. of Computer Science, Tongji University or Tongji Branch, National High Performance Computing Center, Shanghai, 200092, P. R. China, [email protected] 2 Dpt of Mathematics, Shanxi University, Taiyuan 030006, P. R. China, [email protected]
Abstract. In this paper, the main techniques of inductive machine learning are united to the knowledge reduction theory based on rough sets from the theoretical point of view. And then the Monk’s problems are resolved again employing rough sets. As far as accuracy and conciseness are concerned, the learning algorithms based on rough sets have remarkable superiority to the previous methods.
1
Introduction
During the 2nd Europe Summer School on Machine Learning, held in Coresendonk Priory of Belgium in 1991, many popular machine learning algorithms at that time were discussed extensively. Which algorithm would be optimal? And did there exist some relations among the algorithms? As a consequence of these confusions, researchers having attended to that conference created three problems (Monk’s problems)[2,4]. They are different in data scale, target concept, with and without noise in the training examples. However, the comparison to various inductive learning techniques was limited in learning results, learning efficient and so on. The inherent connections among the methods weren’t explained from the theoretical angle. In this paper, several representative machine learning algorithms are compared to rough sets and Monk’s problems are analyzed and dealt with again. Comparison to further conclusions made in accuracy and conciseness of rules is inspiring.
2
Analyzing Theoretically
In 1982, Prof. Pawlak, put forward that rough sets could be used to study the representation, learning and induction of the uncertain, incomplete and imprecise information. For the sake of illustration, let us review some basic points of Rough Sets firstly. Definition 1.Let U be a universe and P, Q be equivalence relations on U , then P OSP (Q) = P (X), X ∈ U/Q is called position region of Q employing P in U. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 138–145, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Application of Rough Sets to Monk’s Problems Solving
139
Definition 2. DT = U, C ∪ D, V, f is called decision table, in which C and D are condition attribute set and decision attribute set respectively, C ∩ D = ∅, U is universe, V = ∪ Va (a ∈ C ∪ D) , Va is value set of a, f is information function, f :U × (C ∪ D) → V and fx (a) = a (x) for every x ∈ U and a ∈ C ∪ D. Knowledge reduction is one of the most important features of rough sets in terms of previous methods. It is necessary especially deposing a large scale data, otherwise the cost bought about by superfluous knowledge is very huge in both time and space. 2.1
Relations to ID-Family [6,7]
Different knowledge representations lead to different knowledge reduction algorithms. At present there exist several popular reduction algorithms such as X. H. Hu algorithm, algorithms based on Pawlak’s attribute importance, algorithms based on discernibility matrix and its improvement, algorithms based on information entropy and so on. X. H. Hu algorithm is: Input DT = U, C ∪ D, V, f Output {B ⊆ C|P OSB (D) = P OSC (D)} 1) Compute relative core Co. 2) B ⇐ Co. 3) For any c ∈ C − B, compute Sig (c, B, D). If Sig (c , B, D) = max {Sig (c, B.D)}, then B ⇐ B ∪ {c }. c∈C−B
4) If P OSB (D) = P OSC (D), output B and end; else go to 3). in which: card P OSB∪{c} (D) − card (P OSB (D)) Sig (c, B, D) = card (P OSC (D)) The difference between X.H. Hu algorithm and algorithm based on Pawlak’s attribute importance is the definition of the function Sig (c, B, D). In the latter, card P OSB∪{c} (D) − card (P OSB (D)) Sig (c, B, D) = cardU By computing simply, we can prove that the orders of attribute importance generated from the two algorithms are same. As a result, the two ideas are equivalent. ID3 is one of the most important models in inductive machine learning. It bases mainly on the partition of space, and this partition is limited to the structure of decision tree. According a given strategy, chose an attribute a and part U into Ta1 , · · · , Tan , in which the superscript k of T will be referred to as a label of a in Va . Successively, take Tak as new universe (example sets) and redo the process mentioned above until all the examples belong to the same decision classification. Thus, we get a decision tree.
140
D. Miao and L. Hou
According to rough sets, U can be parted by every attribute, U/a, U/b, · · · a, b · · · ∈ C, and X. H. Hu algorithm just bases on this partition. For the sake of illustration, let us take attribute a as an example. U/a is correspon 1 n k U ding to T a , · ·· , Ta of the decision tree and Ta ∈ /a, k = 1, · · · , n, in which n = card U/a . Therefore, creating a decision is equivalent to partitioning the universe by attributes substantively [5]. So we can say, ID3 is specialization of rough sets under some restrictions, and equivalent of existing heuristic reduction algorithms. On the basis of ID3, some new algorithms suiting to different requests have been developed through using some strategies. For instance, ID3 with windowing, ID5R that is an incremental decision tree learning algorithm, IDL that uses some heuristics to stimulate a bi-directional search for a tree [3]. 2.2
Relations to AQ-Family [4,10]
Discernibility matrix is an important mutation of rough sets. It transformed the reduction problem of a database into the simplification problem of a matrix due to introducing algebra theory. Skowron has proved that the theory of Discernibility matrix is equivalent to rough sets[9]. Let DT = U, C ∪ D, V, f be a decision table, whose corresponding Discernibility matrix is defined as follows: M (DT ) = (cij )n×n ,
n = card (U ) ,
in which, cij =
∅ [xi ]IN D(D) = [xj ]IN D(D) . {a ∈ C|a (xi ) = a (xj )} otherwise
In such a way, study to decision table is shifted to discernibility matrix. For every attribute, higher frequency in matrix, more examples are distinguished by it, more importance. The detailed description of algorithm is omitted here. Comparisons with AQ algorithms are in order. In fact, selecting seed example in AQ is equivalent to selecting a line from the discernibility matrix. According to the constitution of discernibility matrix, if some element of selected column is empty, its corresponding example belongs to positive examples for a consistent table; removing positive examples in AQ is equivalent to removing corresponding rows of discernibility matrix. The evaluate function is due to the frequency of the attribute in the line. It can be proved that it equals to AQ algorithms reducing directly attribute value without dealing with attributes. In the same way, as a basic method, AQ-algorithm lead to many effective algorithms appropriate to specific types of problems by adding hypothesis-driven or combining genetic algorithm[4]. Information entropy is another representation of knowledge. Its idea is almost same with X. H. Hu’s, but the use of probability leads a higher efficiency.
An Application of Rough Sets to Monk’s Problems Solving
141
Table 1. Rules by RS to Monk-1 U 1 2 3 4
3
a1 1 * 2 3
a2 1 * 2 3
a5 * 1 * *
D 1 1 1 1
Rule if a1=1 if a5=1 if a1=2 if a1=3
and a2=1 then d=1 then d=1 and a2=2 then d=1 and a2=3 then d=1
Comparisons and Analysis
Monk’s problems were created specially to test the capability of learning algorithms. And we will make experiments on Monk’s problems with the algorithms mentioned above so as to compare and analyze. 3.1
Monk-1 Problem
In Monk-1, there are 124 training examples, 62 positive and 62 negative and without any noise. Its target concept is “(a1=a2) or (a5=1)”. Employment of Rough Sets. After reducing the attributes, the core is {a1, a2, a5} and the reduct is just the core; Then reduce the superfluous attribute values. The rules which belong to the positive are showed in Table 1: As you know, the value sets of a1 and a2 are both {1, 2, 3}, so we can merge rule 1, 2, 4 together. In other words, the learning result to Monk-1 by rough sets can be described as (a1=a2) or (a5=1), which is coincident entirely with the target concept. It is exciting that the accuracy accounts for 100% at the test sets (216 positive and 216 negative). Employment of AQ-Family. We are going to handle Monk-1 employing AQ17-HCI, AQ15-GA and AQR [2]. AQ17-HCI is a module employed in the AQ17 attribute based multi-strategy constructive learning system. This model implements a new iterative constructive induction capability in which how attributes are generated based on the analysis of the hypotheses produced in the previous iteration. Rule is (pos16=false) in which pos16 is attribute constructed from the original ones, or intermediate ones, as defined below: c01:: (a1=1) & (a2=2,3) & (a51) c05:: (a1=2) & (a2=1,3) & (a51) c08:: (a1=3) & (a2=1,2) & (a51) c10:: (a1=1) & (a2=1) c12:: (a5=1) c13:: (a1=2) & (a2=2) c15:: (a1=3) & (a2=3) pos:: (c10=false) & (c12=false) & (c13=false) & (c15=false) neg:: (c01=false) & (c05=false) & (c08=false)
142
D. Miao and L. Hou Table 2. Results by ID-family to Monk-1
ID3 ID3 no windowing ID5R IDL
nodes leaves accuracy 13 28 98.6% 32 62 83.2% 34 52 79.7% 36 26 97.2%
Size embodies the conciseness of rules; Nonempty leaf includes the classification information, thus the number of nonempty leaves equals to that of rules.
Actually, (pos16=false) means any one of (c10=false), (c12=false), (c13=false) and (c15=false) is valid. It is coincident with the target concept and the accuracy is also 100%. But it is easy to see that the rule of AQ17-HCI is quite obscure and can’t be used to make decisions directly. AQ15-GA is the fusion of all subsets of a given attribute set. Each of the selected attribute subsets is evaluated by invoking AQ15 and measuring the recognition rate of the rules produced. The approach traverses the whole space of subset. Huge cost of computing brings about excellent results and its accuracy is 100% too. The AQR algorithm is an implementation of the AQ-family. It produces a rule for each decision class. Monk-1 is a two-class problem, so learning rule is below: (a2=1) & (a1=1) or (a5=1) or (a3=1) & (a2=2) & (a1=2) or (a2=2) & (a1=2) or (a6=1) & (a2=3) & (a1=3) or (a6=2) & (a1=3) & (a2=3). class ‘1’. This rule includes 5 attributes, a1, a2, a3, a5 and a6. But the reduct only includes 3 attributes a1, a2 and a5 according to rough sets, that is, 3 attributes are necessary to keep the capacity of classifying the data. In the rule of the AQR algorithm, there are two irrelevant attributes, so the rule maybe either contain superfluous information or describe the concept too strictly. All these degrade the ability of generation. In this approach, the accuracy is 95.9% and lower than that of the two noted above.
Employment of ID-Family. ID-family is a series of algorithms derived from decision trees via introducing some strategies. Its rules are limited to the structure of the corresponding decision tree[2]. It seems like that ID algorithms describe rules intuitionally, but their accuracy is not satisfying. It is quite obvious that the rule numbers of ID-family algorithms are much greater than that rough sets. Its accuracy is relatively low, see Table 2. Remark. We learned rules on Monk-1 using three groups of algorithms. The knowledge reduction of rough sets is very satisfying. Its rule numbers is fewer and accuracy is high. We can say we gained all the knowledge comprised in Monk-1.
An Application of Rough Sets to Monk’s Problems Solving
143
Table 3. Results by AQ-group to Monk-2
Accuracy
AQ17-HCL AQ17-FCLS AQ15-GA AQR 93.1% 92.6% 86.8% 79.7%
Table 4. Results by ID-family to Monk-2 nodes ID3 66 ID3 no windowing 64 ID5R 64 IDL 170
3.2
leaves 110 110 99 107
accuracy 67.9% 69.1% 69.2% 66.2%
Monk-2 Problem
In Monk-2, there are 169 training examples, 64 positive and 105 negative. Again, there is no noise. Its target concept is: “exactly two of (a1=1), (a2=1), (a3=1), (a4=1), (a5=1) and (a6=1) are valid”, which is very complex. After the attribute reduction, the core is {a1, a2, a3, a4, a5, a6}, and the reduct is just the core. Then reduce the superfluous attribute values. The disposal process is similar to that of Monk-1 and 5 rules are produced. The knowledge we obtained is so complicated and different from the target concept in a certain extent. The reason maybe be that too many negative examples exist in Monk-2. The accuracy is only 75%. Thus it can be seen that the reduction algorithms on the basis of rough sets need not special but general examples. Although the new attributes in AQ17-HCI is intricate and AQ17-FCLS summarizes 18 complicated rules, anyway, the accuracies of AQ-family on Monk-2 is high, showed as Table 3: Results by ID-family are showed in Table 4: Obviously, the number of the rules is relatively great and the accuracy is quite low. The algorithm is inferior to others on this problem. 3.3
Monk-3 Problem
In monk-3, there are 122 training examples, 60 negative and 62 positive. The number of negative examples is less than that of positive of examples. There are 5% misclassifications, i.e. noise in the training set. Its target concept is “(a5=3 and a4=1) or (a54 and a23)” which can be decomposed to 7 rules. According to rough sets, the reduct is {a1, a2, a4, a5}, which has more attributes than target concepts by one. The noise led to this. We get 23 rules in all, in which the values of a1 are well distributed, so we can draw the rule as a5=3 & a4=1 or (a54 & a23) or (a5=4 & a4=1) by neglecting a1, which is similar to the target concept. Rough sets is adaptive and rectifiable to some degree. But there is a intersection between the first two rules. We can’t tell the class of the examples belonging to the intersection. For
144
D. Miao and L. Hou Table 5. Results by AQ-group to Monk-3
Accuracy
AQ17-HCI AQ17-FCLS AQ15-GA AQR 100% 97.2% 100% 87.0%
Table 6. Results by ID-family to Monk-3
ID3 ID3 no windowing ID5R
nodes leaves accuracy 13 29 94.4% 14 31 95.6% 14 28 95.2%
example, we don’t know the classes of the examples satisfying (a2=3 & a4=1 & a5=3). The reason is the data set of Monk-3 is inconsistent and with noise. Inconsistent data maybe lead to inconsistent rules. In the test set, there are 12 examples we can’t tell the classes and 4 examples we misclassify. The accuracy is 96.4%, which is acceptable. For comparison let us list the results of AQ-family and ID-family by Table 5 and Table 6. On Monk-3, the gaps among the accuracies of the three groups of algorithms are not very large, but the rules of rough sets are very concise.
4
Conclusions
In this paper, the inherent connections among the reduction theory of rough sets and several typical machine learning algorithms are narrated in detail; some experiments and comparisons are made on Monk’s problems. Monk’s problems are specially created for testing the quality of learning methods. So the conclusions on them are believable and valuable. In Rough Sets, reducing superfluous attributes values can revise the results of the former, for instance, the disposal to a1 in Monk-3. This is another important feature of rough sets. AQ-family and ID-family are typical algorithms in traditional machine learning. The adaptive AQ-family can be used to different data sets and have good quality, but their rules are complicated and not accessible. ID-family are simple and easily available, but their results are inferior to AQ-family and rough sets. Acknowledgements. This research has been supported by grant No. 60175016 from the National Natural Science Foundation of China and grant No. 2001001 from the Youth Natural Science Foundation of Shanxi.
An Application of Rough Sets to Monk’s Problems Solving
145
References [1] Z. Pawlak. Rough Sets, Theoretical Aspects of Reasoning about Data. Warsaw, Poland, 1990. [2] The MONK’s Problems, A Performance Comparison of Different Learning Algorithms. 1991. [3] Cestnik, B., Bratho. I. On Estimating Probabilities in Tree Pruning. Proc. of EWSL 91, Porto, Portugal, March 6–8, 1991. [4] Quinlan, J.R. Learning Logical Definitions from Relations. Machine Learning 5(3), 239–266. [5] J. Wang. Contributions on Rough Set Theory to Inductive Machine Learning. Computer Science 2001 285: 5–7. [6] D.Q. Miao, G.R. Hu. A heuristic algorithm of knowledge reduction. Computer Research and Development, 366: 681–684. [7] D.Q. Miao, J. Wang. An Information Representation of Concepts and Operations in Rough Sets. Journal of Software, 1999, 102: 113–116. [8] Y. Yao, T.Y. Lin. A Review of Rough Set Methods. Rough Set and Data Mining. Kluwer Academic Publishers, 1997, 47–71. [9] Polkowski L., Skowron A. Rough Sets: Perspectives, Rough Sets in Knowledge Discovering. In: Polkowski L, Skowron A, eds, Physica-Verlag, 1998, 1–27. [10] D.Q. Miao, Lishan Hou. A Heuristic Algorithm foy Reduction of Knowledge Based on Discernibility Matrix. International Conference on Intelligent Information Technology, 2002, Beijing, 276–279.
Pre-topologies and Dynamic Spaces Piero Pagliani Research Group on Knowledge and Communication Models Via Imperia, 6. 00161 Rome, Italy [email protected]
Abstract. Approximation Spaces were introduced in order to analyse data on the basis of Indiscernibility Spaces, that is, spaces of the form U, E, where U is the universe of data and E is an equivalence relation on U . Various authors suggested considering spaces of the form U, R, where R is any relation. This paper aims at introducing a further step consisting in spaces of the form U, {R}i∈I , where {R}i∈I is a family of relations on U , that we call “Dynamic Spaces”, because they make it possible to account for different forms of dynamics. While Indiscernibility Spaces induce 0-dimensional topological spaces (Approximation Spaces), Dynamic Spaces induce various types of pre-topological spaces.
1
Introduction
Usually “data” are interesting to the extent they are organised in conceptually meaningful patterns. Approximation Spaces were introduced in order to organize data on the basis of Indiscernibility Spaces, that is, spaces of the form U, E, where U is the universe of data and E is an equivalence relation (see [6]). Various authors developed Pawlak’s suggestion as to generalise this picture by considering families of Indiscernibility Spaces (see Skowron and Stepaniuk’s works, for instance [7], and Pagliani’s [4]). Lin and Yao (see [8] and [3]) consider spaces of the form U, R, where R is any relation, by exploiting the notion of an R-neighbourhood (if x ∈ U , then the R-neighbourhood of x, R(x), is {x ∈ U :< x, x >∈ R}). This paper aims at suggesting a further generalisation consisting in spaces of the form U, {R}i∈I , where {R}i∈I is a family of relations on U . We shall call these spaces Dynamic Spaces, because by means of this generalisation we can account for some form of dynamics, as we are going to explain. A dynamic analysis can be required by two basic situations and a mixed one: (a) (synchronic dynamics) at a given point in time the behaviour of the elements of U may vary along different contexts; (b) (diachronic dynamics) given a context, their behaviour may vary over time; (c) a third situation is given mixing the previous two. Classical Rough Set Theory does not account for this kind of dynamic phenomena. Indeed, as far as we are confined to a single Information System, we can deal just with a picture taken at a particular point in time and at a particular point in space. In all these cases we need families of relations on a set U , henceforth Dynamic Spaces. Whereas Indiscernibility Spaces induce 0-dimensional topological spaces, under particular conditions Dynamic Spaces G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 146–155, 2003. c Springer-Verlag Berlin Heidelberg 2003
Pre-topologies and Dynamic Spaces
147
induce pre-topological spaces (cf. [1]). Here we shall start a systematic investigation of the formal relationships between Dynamic Spaces, the R-neighbourhood approach and pre-topological spaces. Clearly, Indiscernibility Spaces are, under this respect, particular instances of Dynamic Spaces. Notice: all the proofs, omitted for space limitation, can be found in [5].
2
Pre-topological Spaces
Let U be a set. A generalised nearness relation associate any element p of U with one or more elements of ℘(U ), obtaining a family of subsets of U associating, intuitively, the elements of U that are close to p under a specific respect. This intuition is packed in the following definition: Definition 1. Let U be a set, X ⊆ U and p ∈ U . Then, 1. a neighbourhood map is a total function n : U −→ ℘(℘(U )). The image n(p) of p along n will be usually denoted by Np and called a neighbourhood family of p; 2. (i) the family N (U ) = {Nx : x ∈ U } is called a neighbourhood system; (ii) the pair U, N (U ) is called a neighbourhood space; 3. if G(X) = {x : X ∈ Nx }, then G is called the core map induced by N (U ); 4. if F (X) = −G(−X) = {x : −X ∈ / Nx }, then F is called the vicinity map induced by N (U ). A vicinity map (a core map) is a generalisation of the usual notion of a topological closure map (interior map). The main difference, intuitively, is that vicinity maps reflect the notion of “x is close to a set X” under one ore more than one possible point of view, while topological closure operators account for single cumulative points of view, by gluing all the elements of Nx through the imposition for Nx to be a filter. Consider the following conditions on N (U ). For any x ∈ U , 1. U ∈ Nx ; 0. ∅ ∈ / Nx ; Id. for any A ⊆ U , if x ∈ G(A) then G(A) ∈ Nx ; N1. x ∈ N for all N ∈ Nx ; N2. if N ∈ Nx and N ⊆ N , then N ∈ Nx ; N3. if N, N ∈ Nx , then N ∩ N ∈ Nx ; N4. there is an N = ∅ such that Nx =↑ N . Lemma 1. Let < U, N (U ) > be a neighbourhood space. Then, for any X, Y ⊆ U the following equivalences hold: Condition 1 0 Id N1 N2 N3
Equivalent properties of G Equivalent properties of F G(U ) = U (normality) F (∅) = ∅ (normality) G(∅) = ∅ (co-normality) F (U ) = U (co-normality) G(X) ⊆ G(G(X)) F (F (X)) ⊆ F (X) G(X) ⊆ X (deflation) X ⊆ F (X) (inflation) X ⊆ Y ⇒ G(X) ⊆ G(Y ) (isoton.) X ⊆ Y ⇒ F (X) ⊆ F (Y ) G(X ∩ Y ) ⊆ G(X) ∩ G(Y ) F (X ∪ Y ) ⊇ F (X) ∪ F (Y ) G(X ∩ Y ) ⊇ G(X) ∩ G(Y ) F (X ∪ Y ) ⊆ F (X) ∪ F (Y )
148
P. Pagliani
Conditions 1, 0, Id, N1, N2 and N3 characterise topological spaces, which are also characterised by the following condition (τ ): if N ∈ Nx then ∃N ∈ Nx such that for all y ∈ N , N ∈ Ny . In view of Lemma 1, (τ ) is connected with idempotency of G and F and with N1, N2 and Id. Proposition 1. Let N (U ) be a neighbourhood system. Then, (1) if N (U ) satisfies N1, then, G and F are idempotent if and only if Id holds; (2) (i) if N (U ) satisfies Id, it satisfies (τ ). (ii) If N (U ) satisfies N2 and (τ ), it satisfies Id. However in general G and F are not required to be idempotent. Intuitively, the lack of idempotency reflects a sort of flowing situation in which boundaries are not fixed once for all. Now we name some combinations of the above properties. From now on by the symbol B we shall intend the set of properties 0, 1, N1. If all the elements of a neighbourhood system N (U ) satisfy B, then N (U ) is said to be of type N1 . B+N2 give N2 systems. B+N2+N3 give N3 systems, while N3 systems augmented with N4 will be said to be of type N4 . If (τ ) is added, we obtain systems of type N1Id , N2Id , N3Id and, respectively, N4Id . It is immediate to verify that in order to fulfil N2, a neighbourhood family must be an order filter w. r. t. ⊆, while it must be a filter to fulfil also N3. In what follows we shall investigate the basic relationships between these properties and Dynamic Spaces. Definition 2. (i) A pre-topological space is a triple U, ε, κ such that: (a) U is a set, (b) ε : ℘(U ) −→ ℘(U ) is an increasing map such ε(∅) = ∅, (c) κ : ℘(U ) −→ ℘(U ) is a decreasing map dual of ε. (ii) For any x ∈ U the family κx = {Z ⊆ U : x ∈ κ(Z)} is called the family of κ-neighbourhoods of x. We shall call ε an “expansion operator” and κ a “contraction operator”. Clearly, if f is an expansion operator and for any X ⊆ U, g(X) = −f (−X), then g is a contraction operator, and dually. Moreover, if U, ε, κ is a pre-topological space, then κ(U ) = U . Indeed the core map (the vicinity map) induced by a neighbourhood system of type (at least) N1 is a contraction operator (is an expansion operator). Proposition 2. Let U, ε, κ be a pre-topological space induced by a neighbourhood system N (U ) of type N1 . Then: (1) the family N κ (U ) = {κx }x∈U is a neighbourhood system of type N1 ; (2) κ is the core map induced by N κ (U ). (3) N κ (U ) = N (U ). Given a pre-topological space P = U, ε, κ, the family Ωκ (U ) = {κ(A)}A⊆U will be called the pre-topology of U . By extension, by the symbol P we shall also intend the pre-topology Ωκ (U ). If P1 = U, ε , κ and P2 = U, ε , κ are two pre-topological spaces on the same universe U , then we say that (the pre-topology of) P1 is finer than (the pre-topology of) P2 , in symbols P2 P1 , if for any X ⊆ U , κ (X) ⊆ κ (X). One can prove that P2 P1 if and only if κx ⊆ κx , any x ∈ U .
Pre-topologies and Dynamic Spaces
149
The following facts are directly derivable from Lemma 1 and Proposition 2 (see also [1]). N κ (U ) U, ε, κ is iff U, ε, κ satisfies: is of type: said of type: N1Id κ(κ(X)) = κ(X) [ε(ε(X)) = ε(X)] VId N2 X ⊆ Y ⇒ κ(X) ⊆ κ(Y ) [ε(X) ⊆ ε(Y )] VI N3 ε(X ∪ Y ) = ε(X) ∪ ε(Y ) [κ(X ∩ Y ) = κ(X) ∩ κ(Y )] VD N2Id ε [κ] is a closure [interior] operator VCl N4 ε(X) = x∈X ε({x}) VS N3Id , N4Id ε [κ] is a topological closure [interior] operator topological
Facts: If U, ε, κ is of type VI , then for any X ⊆ U , ε(X) = {x ∈ U : ∃N (N ∈ Nx & N ⊆ X)} and κ(X) = {x ∈ U : ∀N (N ∈ Nx ⇒ N ∩ X = ∅)}. So, in case of neighbourhood systems of type N2 the vicinity map (the expansion operator) is defined in the usual topological way: a point x is close to a set X if and only if all the elements of Nx have non null intersection with X. If U is finite, then the notions of VS and VD pre-topological spaces coincide.
3
Pre-topological Spaces and Dynamic Spaces
In general the notion of an expansion (contraction) operator cannot be immediately linked to that of an R-neighbourhood. In fact, X ⊆ R(X) is valid only if R is reflexive. In turn, both ε and κ lack the isotonicity law which, on the contrary, is valid for R-neighbourhoods. Finally, differently from R-neighbourhoods, neither ε nor κ are required to distribute over disjunctions or conjunctions. Therefore R-neighbourhood systems and neighbourhood systems of type N1 are not twin notions. However, reflexivity may be assumed on a quite intuitive base (a “point” x should be near to itself). As for distribution, we can have different attitudes. Actually, two points taken together could carry more information than the sum of the two pieces of information carried by them singularly taken. However, Rneighbourhood formation is additive ad if we want to represent compositional phenomena we need pre-topological spaces of type VD . 3.1
Some Technical Backgrounds
If X is a family of subsets of a given set U , then by ⇑ X we shall denote the set {Y ⊆ U : ∃X(X ∈ X & X ⊆ Y )}. Definition 3. Let U be a set, F, F1 and F2 order filters or filters of elements of ℘(U ). Moreover let B be a family of subsets of U . Then: (1) If F =⇑ B, then B is called a basis system for F and we say that B induces F. (2) If F1 ⊆ F2 , then F2 is said to be finer than F1 . Clearly, if for B1 , B2 ⊆℘(U ), we have F1 =⇑ B1 , F2 =⇑ B2 and B2 ⊆ B1 , then F1 is finer than F2 . The converse usually does not hold.
150
P. Pagliani
Corollary 1. Let F, B ⊆ ℘(U ). Let F be a filter such that F =⇑ B. Then (1) F = B ∈ F. In view of the above Definition, if we are given a family of filters induced by a family of basis systems, in order to compute ε(X) and κ(X) it is sufficient to consider the basis systems: Proposition 3. (see [1]) Let U be a set. Let N (U ) be a neighbourhood system of type N2 or N3 and N (U ) a neighbourhood system of type N4 . Assume that, for any x, Nx = ⇑ Bx for some Bx ⊆ ℘(U ) and Nx = ⇑ {Qx } for some Qx ⊆ U . Then for any X ⊆ U the following equations hold: (i) {x ∈ U : ∃N (N ∈ Nx & N ⊆ X)} = {x ∈ U : ∃A(A ∈ Bx & A ⊆ X)}, (ii) {x ∈ U : ∀N (N ∈ Nx ⇒ N ∩ X = ∅) = {x ∈ U : ∀A(A ∈ Bx ⇒ A ∩ X = ∅)}, (iii) {x ∈ U : ∀N (N ∈ Nx ⇒ N ∩ X = ∅)} = {x ∈ U : Qx ∩ X = ∅}. Definition 4. Let N (U ) be a neighbourhood system such that N (U ) = {⇑ Bx }x∈U . If a pre-topological space P is induced by N (U ), then we say that P is induced by S = {Bx }x∈U , too, and that S is a basis for P (is a basis for N (U )). In this case to define κ and ε we shall also use the right side of the equations (i) and, respectively, (ii) of Proposition 3 above. 3.2
Pre-topological Spaces from Dynamic Spaces
Suppose we are given a Dynamic Spaces U, {Ri }1≤i≤n . Along the interpretation sketched in the Introduction, we can develop interesting information analyses. We can basically list n × 2 basic “use cases” of the information carried by this Dynamic Spaces: n use cases involving the expansion process, and n involving the contraction process: (Expansion): We say that x ∈ εm (A), for 1 ≤ m ≤ n, if A contains at least a y such that < x, y >∈ Ri in at least m cases. Otherwise stated: x ∈ εm (A) if R1≤i≤n (x) ∩ A = ∅ for at least m indices. So, for instance, assume n = 3, then x ∈ ε1 (A) if R1 (x) ∩ A = ∅, or R2 (x) ∩ A = ∅, or R3 (x) ∩ A = ∅ (i. e. (R1 (x) ∪ R2 (x) ∪ R3 (x)) ∩ A = ∅). (Contraction): We say that x ∈ κm (A), for 1 ≤ m ≤ n, if every y such that < x, y >∈ Ri is included in A, at least in m cases. Otherwise stated: x ∈ κm (A) if R1≤i≤n (x) ⊆ A for at least m indeces. So, for instance, assume n = 3, then x ∈ κ2 (A) if R1 (x) ⊆ A and R2 (x) ⊆ A, or if R1 (x) ⊆ A and R3 (x) ⊆ A, or if R2 (x) ⊆ A and R3 (x) ⊆ A (i. e. R1 (x) ∪ R2 (x) ⊆ A, or R1 (x) ∪ R3 (x) ⊆ A, or R2 (x) ∪ R3 (x) ⊆ A). According to these use cases, we can compute the families of expansion and contraction operators, ε1≤m≤n and κ1≤m≤n , by transforming the various Ri −neighbourhoods into appropriate bases and applying eventually Proposition 3: Definition 5. Let U be a set and R = {Ri }1≤i≤n a system of n binary relations on U . For 1 ≤ m ≤ n, let Γ be the family of combinations of n elements taken
Pre-topologies and Dynamic Spaces
151
m at a time, γ a combination from Γ . Then let us set: (1) εm : ℘(U ) −→ ℘(U ); εm (A) = {x ∈ U : ∀F (F ∈ Fxm ⇒ F ∩ A = ∅)}; m (2) κ : ℘(U ) −→ ℘(U ); κm (A) = {x ∈ U : ∃F (F ∈ Fxm & F ⊆ A)}, m where:(a) Fx is the (order) filter induced by the basis Bxm , (b) Bxm = {Xγ : Xγ = l∈γ Rl (x)}. Let us apply all the above definitions to the following Dynamic Space {a, b, c} , {R1 , R2 , R3 }: R1 a b c
a 1 0 0
b 1 1 0
c 0 0 1
R2 a b c
a 1 0 0
b 0 1 0
c 1 0 1
R3 a b c
a 1 1 0
b 0 1 0
c 1 0 1
In view of the above definitions we can compute the basis B m (U ) = {Bxm }x∈U , the induced neighbourhood system F m (U ) = {Fxm }x∈U and, finally, the operator ε, for 1 ≤ m ≤ 3: m m Bx Ba Bbm 1 Bx = {R1 (x), R2 (x), R3 (x)} {{a, b}, {a, c}} {{b}, {a, b}} 2 Bx = {R1 (x) ∪ R2 (x), R1 (x) ∪ R3 (x), R2 (x) ∪ R3 (x)} {{a, c}, {a, b, c}} {{b}, {a, b}} 3 Bx = {R1 (x) ∪ R2 (x) ∪ R3 (x)} {{a, b, c}} {{a, b}}
Bcm {{c}} {{c}} {{c}}
Fxm Fam Fbm Fcm Fx1 {{a, b}, {a, c}, {a, b, c}} {{b}, {a, b}, {b, c}, {a, b, c}} {{c}, {a, c}, {b, c}, {a, b, c}} Fx2 {{a, c}, {a, b, c}} {{b}, {a, b}, {b, c}{a, b, c}} {{c}, {a, c}, {b, c}, {a, b, c}} Fx3 {{a, b, c}} {{a, b}, {a, b, c}} {{c}, {a, c}, {b, c}, {a, b, c}} εm ε1 ε2 ε3
∅ ∅ ∅ ∅
{a} {a} {a} {a, b}
{b} {b} {b} {a, b}
{c} {c} {a, c} {a, c}
{a, b} {a, b} {a, b} {a, b}
{a, c} {a, c} {a, c} {a, b, c}
{b, c} {a, b, c} {a, b, c} {a, b, c}
{a, b, c} {a, b, c} {a, b, c} {a, b, c}
We can notice, for instance, that Fa1 is not a filter, because {a, b} ∩ {a, c} ∈ / Hence the pre-topological space < U, ε1 , κ1 > is not of type VD . Also, we can directly observe the relationship between ε and κ distributivity and filters. Indeed, since Fa1 is not a filter, there are two minimal distinct elements A = {a, b} and B = {a, c} of Fa1 such that a ∈ A and a ∈ B. Let us set Y = B ∩ −A = {b}, Z = A ∩ −B = {c}. Therefore, the subset Y ∪ Z = {b, c} has empty intersection neither with A nor B; hence Y ∪ Z has empty intersections with no members of Fa1 , because A and B are minimal. It follows that a belongs to ε(Y ∪ Z). But a ∈ / ε(Y ) and a ∈ / ε(Z). Henceforth ε(Y ) ∪ ε(Z) ⊂ ε(Y ∪ Z). Dually for κcodiscontinuity: indeed, notice that κ({a, b}) ∩ κ({a, c}) = {a, b} ∩ {a, c} = {a}. But κ({a, b}) ∩ κ({a, c}) = κ({a}) = ∅. Fa1 .
3.3
Associating Dynamic Spaces and Pre-topological Spaces
After recalling that pre-topologies can be associated with relations that are at least reflexive, let us now formalise the construction so far discussed.
152
P. Pagliani
Definition 6. Let U be a set and let R = {Ri }i∈I be a system of reflexive relations on U . Then the pre-topological space induced by the basis B m (U ) = {Bxm }x∈U is said to be m−associated with the system R and denoted by Pm (R) = U, εm , κm . In particular if R = {R}, then the pre-topological space induced by the basis {R(x)}x∈U is said to be associated with the relation R and denoted by P(R) = U, εR , κR . Proposition 4. Let P1 (R) = U, ε1 , κ1 be a pre-topological space 1-associated with a system of relations R = {Ri }i∈I . Then; (1) B1 (U ) = {Bx1 }x∈U = {⇑ 1 1 {R i∈I,x∈U ; (2) κ (X) = {x : ∃Ri (Ri ∈ R & Ri (x) ⊆ X)}; (3) ε (X) = i (x)}} 1 i∈I Ri (X); (4) P (R) is of type VI . Thus, spaces of type VI arise as particular syntheses of Dynamic Spaces of reflexive relations. Now we associate Dynamic Spaces with an important class of pretopological spaces. A pre-topological space P = U, ε, κ is said to be of type VCl , or a closure system, if the operator ε is a closure operator (inflationary, isotonic and idempotent) or, dually, if κ is an interior operator. These kind of systems are induced by neighbourhood systems of type N2Id . About the connections between closure pre-topological spaces and Dynamic Spaces we now prove the following Theorem 1. Let U, {Ri }i∈I be such that R = {Ri }i∈I is a family of preorder relations on U . Then: (1) the pre-topological space P1 (R) = U, ε1 , κ1 is of type VCl ; (2) the family {⇑ Bx1 }x∈U is a neighbourhood system of type N2Id . The converse of the above Theorem does not hold. So, we have found that the important mathematical notion of a closure (interior) operator is connected, in a particular direction, with Dynamic Spaces where all the relations feature a specific property. When we move to single relation based pre-topologies we have the following first set of results: Corollary 2. Let P(R) = U, εR , κR be a pre-topological space associated with a reflexive binary relation R. Then for any X ⊆ U , (1) P(R) is of type VS ; (2) κR (X) = {x : R(x) ⊆ X}; (3) κR (X) = {Y : R(Y ) ⊆ X}; (4) εR (X) = R (X); (5) κR x =⇑ {R(x)}, for all x ∈ U . Dually, given a pre-topological space, we can define a reflexive binary relation associated with it: Proposition 5. (T-association) Let P = U, ε, κ be a pre-topological space. Let us set: x, y ∈ R iff y ∈ {X : X ∈ κx }. Then R is a reflexive binary relation on U . If the above condition holds, we shall say that R is T-associated with the pretopology P and denote it by RT (P) or RT whenever the pretopological space P is understood. In general, given a pre-topological space P, it is possible to link it to a pretopological space that is T-associated with a reflexive binary relation, in a unique way.
Pre-topologies and Dynamic Spaces
153
Proposition 6. Let P = U, ε, κ be a pre-topological space. Then, (1) P P(RT (P)); (2) If P is of type VS , then P = P(RT (P)); (3) If P is associated with a relation R, then R is T-associated with the pre-topological space P, that is, R = RT (P(R)). It is possible to show that if P is of type VI , then P(RT (P)) is the coarsest pre-topology among those of type VS that are finer than P. We can easily associate a pre-topological space with a tolerance (i. e. reflexive and symmetric) relation: Proposition 7. (B-association) Let P = U, ε, κ bea pre-topological space. Let us set: x, y ∈ R iff y ∈ {X : X ∈ κx } ⇒ x ∈ {Y : Y ∈ κy }. Then R is a reflexive and symmetric relation. If the above condition holds, we shall say, that R is B-associated with the pre-topological space P and denote it by RB (P). From Proposition 6 we know that P P(RT (P)) and RT (P (R)) = R. So we wonder what P(RB (P)) and RB (P(R)) are. We shall give the answer as a corollary of the following more general statement about Dynamic Spaces. Proposition 8. Let U be a set and {Rj }1≤j≤n a family of n reflexive binary relations on U . Let 1 ≤ m ≤ n and let Pm = U, εm , κm , with the operators m ∗ εm and κ defined by Definition 5. Moreover let us set R = 1≤j≤n Rj and R∗ = 1≤j≤n Rj . Then: (1) RT (Pn ) = R∗ ; (2) RT (P1 ) = R∗ ; (3) RB (Pn ) is the largest tolerance relation included in R∗ ; (4) RB (P1 ) is the largest tolerance relation included in R∗ . Corollary 3. For any family of n reflexive binary relations, P(RT (Pn )) P(RB (Pn )). Therefore, trivially, if we are given just one reflexive binary relation R, then RB (P(R)) ⊆ R, since RB (P(R)) is the largest tolerance relation included in R, while if we are given a pre-topological space P, then P(RB (P)) is the pre-topological space associated with the largest tolerance relation included in RT (P). It follows that P P(RB (P)) (the equality is not uniformly valid even if P is of type VS ; in fact in this case we have, generally, P = P(RT (P)) P(RB (P))). Corollary 4. (see also [1]) Let P = U, ε, κ be a pre-topological space. Then P(RB (P)) is the coarsest pre-topology among the pre-topological spaces finer then P and associated with a tolerance relation. Corollary 5. Let U be a set and {Rj }1≤j≤n a system of n reflexive binary relations on U , such that R∗ (such that R∗ ) is transitive. Then, P(RB (Pn )) (P(RB (P1 ))) is the Approximation Space induced by the largest tolerance relation included in R∗ (included in R∗ ).
154
P. Pagliani
Particularly, if RT (P) is a binary transitive relation on U , then P(RB (P)) is the Approximation Space induced by the largest tolerance relation included in RT (P), so that we can say that P(RB (P)) is the coarsest Approximation Space finer than the pre-topological space P. 3.4
Topological Spaces and Binary Relations
We have seen how Dynamic Spaces are associated with closure systems. Now we have to understand what happens in case of a topological space. First of all, let us define topological spaces and understand the basic differences between them and their closest pretopological relatives. Definition 7. A pre-topological space U, ε, κ of type VD is a topological space if for any x, κx satisfies property (τ ). Theorem 2. Let P = U, εR , κR be a pre-topological space associated with a reflexive binary relation R ⊆ U × U . Then P is a topological space if and only if R is transitive. Conversely: Theorem 3. Let P = U, ε, κ be a pre-topological space of type VS . Then RT (P) is a preorder if and only if P is a topological space. The above results are the core of well-known facts, such as the duality between certain Kripke models and topological spaces. Proposition 9. Let P = U, εR , κR be a topological space associated with a relation R. Then, for any X ⊆ U , εR (X) = {R (Z) : X ⊆ R (Z)}. Notice that Proposition 9 is not that trivial. Indeed, if isotonicity or idempotency fails, then this result does not hold any longer. Moreover, it is now obvious that if R is an equivalence relation, then, in view of Proposition 9 and Corollary 2.(3), εR and κR are the upper and, respectively, lower approximations induced by the Indiscernibility Space U, R. We can restate the property (τ ) in terms of idempotency of εR and κR : Corollary 6. Let P = U, εR , κR be a pre-topological space associated with a reflexive binary relation R ⊆ U × U . Then εR and κR are idempotent if and only if R is transitive.
References 1. Belmandt, Z.: Manuel de pr´etopologie et ses applications. Hermes, 1993. 2. Cech E.: Topological Spaces. John Wiley and Sons, 1966. 3. Lin T. Y.: Granular Computing on Binary Relations. I and II. In Polkowski L. & Skowron A. (Eds.) Rough Sets in Knowledge Discovery. 1: Methodology and Applications, Physica-Verlag, 1988, pp. 107–121 and 122–140.
Pre-topologies and Dynamic Spaces
155
4. Pagliani P.: Modalizing Relations by means of Relations. In Proceeding of IPMU ’98, July, 6–10, 1998. “La Sorbonne”, Paris, France, pp. 1175–1182. 5. Pagliani P.: Concrete neighbourhood systems and formal pretopological spaces. (Draft, 2002) 6. Pawlak, Z.: Rough sets. Theoretical Aspects of Reasoning about Data. Kluwer Acad. Publ., 1991. 7. Skowron A. & Stepaniuk J.: Approximation of Relations. In W. Ziarko (ed.): Rough Sets, Fuzzy Sets and Knowledge Discovery, Springer-Verlag, 1994, pp. 161–166. 8. Yao Y. Y. & Lin T. Y.: Graded Rough Set Approximations based on Nested Neighborhood Systems. In Proceeding of EUFIT ’97. Vol 1. ELIT-Foundation, Aachen, 1997, pp. 196–200.
Rough Sets and Gradual Decision Rules 1
2
Salvatore Greco , Masahiro Inuiguchi , and Roman 6áRZL VNL
3
1
2
Faculty of Economics, University of Catania, Corso Italia, 55, 95129 Catania, Italy; [email protected] Graduate School of Engineering, Osaka University, 2-1, Yamadaoka, Suita, Osaka 565-0871, Japan; [email protected] 3 Institute of Computing Science, Poznan University of Technology, 60-965 Poznan, Poland; [email protected]
Abstract. We propose a new fuzzy rough set approach which, differently from all known fuzzy set extensions of rough set theory, does not use any fuzzy logical connectives (t-norm, t-conorm, fuzzy implication). As there is no rationale for a particular choice of these connectives, avoiding this choice permits to reduce the part of arbitrary in the fuzzy rough approximation. Another advantage of the new approach is that it is based on the ordinal property of fuzzy membership degrees only. The concepts of fuzzy lower and upper approximations are thus proposed, creating a base for induction of fuzzy decision rules having syntax and semantics of gradual rules. Keywords: Rough sets, Fuzzy sets, Decision rules, Gradual rules, Credibility
1 Introduction It has been acknowledged by different studies that fuzzy set theory and rough set theory are complementary because of handling different kinds of uncertainty. Fuzzy sets deal with possibilistic uncertainty, connected with imprecision of states, perceptions and preferences [5]. Rough sets deal, in turn, with uncertainty following from ambiguity of information [15]. The two types of uncertainty can be encountered together in real-life problems. For this reason, many approaches have been proposed to combine fuzzy sets with rough sets (see for example [1, 4, 7, 8, 9, 13, 14, 16, 17, 18, 19]). The main preoccupation in almost all these studies was related to a fuzzy extension of Pawlak’s definition of lower and upper approximations using fuzzy connectives (t-norm, t-conorm, fuzzy implication). In fact, there is no rule for the choice of the “right” connective, so this choice is always arbitrary to some extent. Another drawback of fuzzy extensions of rough sets involving fuzzy connectives is that they are based on cardinal properties of membership degrees. In consequence, the result of these extensions is sensitive to order preserving transformation of membership degrees. The doubt about the cardinal content of the fuzzy membership degree shows the need for methodologies which consider the imprecision in perception typical for fuzzy sets but avoid as much as possible meaningless transformation of information through fuzzy connectives.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 156–164, 2003. © Springer-Verlag Berlin Heidelberg 2003
Rough Sets and Gradual Decision Rules
157
The approach we propose for fuzzy extension of rough sets takes into account the above request. It avoids arbitrary choice of fuzzy connectives and not meaningful operations on membership degrees. Our approach belongs to the minority of fuzzy extensions of the rough set concept that do not involve fuzzy connectives and cardinal interpretation of membership degrees. Within this minority, it is related to the approach [14] using a-cuts on fuzzy similarity relation between objects. We propose a methodology of fuzzy rough approximation that infers the most cautious conclusion from available imprecise information. In particular, we observe that any approximation of knowledge about Y using knowledge about X is based on positive or negative relationships between premises and conclusions, i.e: i) “the more x is X, the more it is Y” (positive relationship), ii) “the more x is X, the less it is Y” (negative relationship). The following simple relationships illustrate i) and ii): “the larger the market share of a company, the larger its profit” (positive relationship) and “the larger the debt of a company, the smaller its profit” (negative relationship). These relationships have been already considered within fuzzy set theory under the name of gradual rules [3]. In this article, we are extending the concept of gradual rules by handling ambiguity of information through fuzzy rough approximations. Remark that the syntax of gradual rules is based on monotonic relationship that can also be found in dominance-based decision rules induced from preference-ordered data. From this point of view, the fuzzy rough approximation proposed in this article is related to the dominance-based rough set approach [7, 10, 11]. The plan of the article is the following. In the next section, we are defining syntax and semantics of considered gradual rules; we also show how they represent positive and negative relationships between fuzzy sets corresponding to premise and conclusion of a decision rule. In section 3, we are introducing fuzzy rough approximations consistent with the considered gradual rules; the fuzzy rough approximations create a base for induction of decision rules. Section 4 includes conclusions and remarks on further research directions.
2 Decision Rules with Positively or Negatively Related Premise and Conclusion We aim to obtain gradual decision rules of the following types: lower-approximation rules with positive relationship (LP-rule): "if condition X has credibility C(X)a, then decision Y has credibility C(Y)b"; lower-approximation rules with negative relationship (LN-rule): "if condition X has credibility C(X)a, then decision Y has credibility C(Y)b"; upper-approximation rule with positive relationship (UP-rule): "if condition X has credibility C(X)a, then decision Y could have credibility C(Y)b"; upper-approximation rule with negative relationship (UN-rule): "if condition X has credibility C(X)a, then decision Y could have credibility C(Y)b". These decision rules will be represented by triples <X,Y,f+>, <X,Y,f >, <X,Y,g+> and <X,Y,g >, respectively, where X is a given condition, Y is a given decision and f+,
158 -
S. Greco, M. Inuiguchi, and R. 6áRZL VNL -
f , g+, g :[0,1][0,1] are functions relating the credibility of X with the credibility of Y in lower- and upper-approximation rules, respectively. More precisely, functions f+, f , g+ and g permit to rewrite the above decision rules as follows: LP-rule: "if condition X has credibility C(X)a, then decision Y has credibility C(Y)b=f+(a)"; LN-rule: "if condition X has credibility C(X)a, then decision Y has credibility C(Y)b=f (a)"; UP-rule: "if condition X has credibility C(X)a, then decision Y could have credibility C(Y)b=g+(a)". UN-rule: "if condition X has credibility C(X)a, then decision Y could have credibility C(Y)b=g (a)". An LP-rule can be regarded as a gradual rule [3]; it can be interpreted as: the more object x is X, the more it is Y. In this case, the relationship between credibility of premise and conclusion is positive and certain. LN-rule can be interpreted in turn as: the less object x is X, the more it is Y, so the relationship is negative and certain. On the other hand, the UP-rule can be interpreted as: the more object x is X, the more it could be Y, so the relationship is positive and possible. Finally, UN-rule can be interpreted as: the less object x is X, the more it could be Y, so the relationship is negative and possible. Example A. Let us consider a hypothetical car selection problem in which the maximum speed is used for evaluation of cars. We assume that the decision maker evaluates 10 sample cars, listed in Table 1, from the viewpoint of acceptability for his/her purchase. The evaluation score is given as a degree of acceptability between 0 (strong rejection) and 1 (strong acceptance), as shown in the bottom row of Table 1. In the decision maker’s opinion, a car whose maximum_speed is more than 230 km/h is excellent and a car whose maximum_speed is less than 190 km/h is poor. Therefore, we may define fuzzy set speedy_car for the decision maker by the following membership function: mspeedy_car(maximum_speed(x)) 0 if max imum_speed (x ) < 190 = (max imum_speed (x ) − 190 ) / 40 if 190 ≤ max imum_speed (x ) ≤ 230 1 if max imum_speed (x ) > 230 where x is a variable corresponding to a car and maximum_speed(¼) is a function which maps from the set of cars to the maximum_speed. The values of mspeedy_car are shown in Table 2. They correspond to degree of satisfaction of the decision maker with respect to maximum_speed. From Table 2 we can induce the following functions f+(¼) and g+(¼):
Rough Sets and Gradual Decision Rules
f +(
if 0 ≤ < 0.25
0.2 0.4 ) = 0.6 0.9
if 0.25 ≤ < 0.5 if 0.5 ≤ < 0.75
0.3 0.8 ) = 0.9 1
and g + (
if 0.75 ≤ ≤ 1
if
159
=0
if 0 < ≤ 0.5 if 0.5 < ≤ 0.75
.
if 0.75 < ≤ 1
Table 1. Decision maker’s evaluation of sample cars Car: maximum_speed (km/h) Acceptability
A
B
C
D
E
F
G
H
I
J
180
200
210
200
190
240
210
220
200
230
0.3
0.4
0.6
0.8
0.2
0.9
0.7
0.9
0.7
1
Table 2. Degrees of membership to X=speedy_car, Y=acceptable_car Car:
A
B
C
D
0
0.25
0.5
0.25
0
1
0.3
0.4
0.6
0.8
0.2
0.9
mspeedy_car macceptable_car
+
E
F
G
H
I
J
0.5
0.75
0.25
1
0.7
0.9
0.7
1
+
Using functions f (¼) and g (¼), we can represent knowledge contained in Table 2 in terms of LP- and UP-decision rules having the following syntax: LP-rule: "if x is speedy_car with credibility mspeedy_car(maximum_speed(x))a, then x is acceptable_car with credibility f+(a)b"; UP-rule: "if x is speedy_car with credibility mspeedy_car(maximum_speed(x))a, then x is acceptable_car with credibility g+(a)b". The particular LP-rules induced from Table 2 are: "if mspeedy_car(maximum_speed(x))0, then macceptable_car(x)0.2", "if mspeedy_car(maximum_speed(x))0.25, then macceptable_car(x)0.4", "if mspeedy_car(maximum_speed(x))0.5, then macceptable_car(x)0.6", "if mspeedy_car(maximum_speed(x))0.75, then macceptable_car(x)0.9". The particular UP-rules induced from Table 2 are: "if mspeedy_car(maximum_speed(x))0, then m acceptable_car(x)0.3", "if mspeedy_car(maximum_speed(x))0.5, then m acceptable_car(x)0.8", "if mspeedy_car(maximum_speed(x))0.75, then m acceptable_car(x)0.9", "if mspeedy_car(maximum_speed(x))1, then m acceptable_car(x)1".
The syntax of LP- and UP-rules is similar to the syntax of “at least” and “at most” decision rules induced from dominance-based rough approximations of preferenceordered decision classes [7, 10, 11]. + Let us also observe that functions f+(a), f (a), g (a) and g (a) can be seen as modifiers [20] which give restrictions on the credibility of the conclusion on the basis of the credibility of the premise of a decision rule. Let us also mention that BouchonMeunier and Yao [2] studied fuzzy rules in relation with general modifiers and that
160
S. Greco, M. Inuiguchi, and R. 6áRZL VNL
Inuiguchi et al. [12] proposed to use modifiers for specification of possibility and necessity measures useful for induction of fuzzy rules.
3 Fuzzy Rough Approximations as a Basis for Rule Induction -
-
The functions f+(¼), f (¼), g+(¼) and g (¼) are related to specific definitions of lower and upper approximations considered within rough set theory [15]. Let us consider a universe of the discourse U and two fuzzy sets, X and Y, defined on U by means of membership functions mX:U[0,1] and mY:U[0,1]. Suppose that we want to approximate knowledge contained in Y using knowledge about X. Let us also adopt the hypothesis that X is positively related to Y, which means “the more x is in X, the more x is in Y”. Then, the lower approximation of Y given the information on X is a fuzzy set App+(X,Y), whose membership function for each x³U, denoted by m[App+(X,Y),x], is defined as follows: m[App+(X,Y),x] = inf{mY(z) | z³U, mX(z)mX(x)}. (1) Interpretation of lower approximation (1) is based on a specific meaning of the concept of ambiguity. According to knowledge about X, the membership of object w³U to fuzzy set Y is ambiguous at the level of credibility r if there exists an object z³U such that mX(w)<mX(z), however, mY(w)r while mY(z)
App +(X,Y), whose membership function for each x³X, denoted by
m[ App +(X,Y),x], is defined as follows: m[ App +(X,Y),x] = sup{mY(z) | z³U, mX(z)mX(x)}. (2) Interpretation of upper approximation (2) is based again on the above meaning of ambiguity. On the basis of knowledge about X and taking into account positive relationship between X and Y, x belongs to fuzzy set Y with some possible ambiguity at credibility level rw, where w is the greatest membership degree of z³U to Y, such that w is exactly equal to m[ App +(X,Y),x]. In other words, even if mY(x) would be smaller than w, then due to the fact that there exists z³U with mX(z)mX(x) but mY(z)=w>mY(x) (i.e. z and x are ambiguous), another object w, such that mX(w)=mX(x), could belong to Y with credibility w. Example A (cont.). According to definitions (1) and (2), the lower and upper approximations of the set of cars described in Table 1 are presented in Table 3. The concept of ambiguity and its relation with rough approximations can be illustrated on cars from Table 3. Let us consider car G. Its acceptability is 0.7. However, there is another car, called C, having not smaller membership to speedy_car than G, but a smaller membership degree to acceptable_car than G. Therefore, acceptability 0.7 for car G is ambiguous. The highest membership degree to
Rough Sets and Gradual Decision Rules
161
acceptable_car for which there is no ambiguity is equal to 0.6. Therefore, this value is the degree of membership of car G to the lower approximation of the set acceptable_car. Considering again car G, let us remark that there exists another car, called D, having not greater membership degree to speedy_car than G, but a greater membership degree to acceptable_car than G. Therefore, G is acceptable_car with the membership 0.8 which is possible for the reason of ambiguity between G and D. This is also the highest degree of membership of car G to acceptable_car taking into account the ambiguity; in consequence, it is the membership degree of car G to the upper approximation of acceptable_car. These remarks on the concept of rough approximation can be summarized by saying that for each car in Table 3, the degrees of membership to the lower and to the upper approximation of acceptable_car, respectively, give for each car the maximum degree of membership to acceptable_car without ambiguity and the maximum degree of membership to acceptable_car admitting some possible ambiguity. Table 3. Rough approximations of the set of cars (X=speedy_car, Y=acceptability)
Car:
A
mspeedy_car
B
C
D
E
0 0.25 0.5 0.25
0
F 1
G
H
I
0.5 0.75 0.25
J 1
macceptable_car
0.3
0.4
0.6
0.8
0.2 0.9 0.7
0.9
0.7
m[App (X,Y),x]
0.2
0.4
0.6
0.4
0.2 0.9 0.6
0.9
0.4 0.9
m[ App (X,Y),x]
0.8
0.8
0.8
0.8
0.3
0.9
0.8
+
+
1
0.8
1
1
One can remark that the definitions of App+(X,Y) and App +(X,Y) are useful if X is positively related with Y, i.e., when we suppose that “the more mY(x), the more mX(x)” or “the less mY(x), the less mX(x)”. When X is negatively related with Y, then “the more mY(x), the less mX(x)” or “the less mY(x), the more mX(x)”. In this case, the definitions of App+(X,Y) and App +(X,Y) are not useful, thus we need other approximations. The following definitions of the lower and upper approximations are appropriate when X is negatively related with Y: m[ App (X,Y),x] = inf{mY(z) | z³U, mX(z)mX(x)}, (3) -
m[ App (X,Y),x] = sup{mY(z) | z³U, mX(z)mX(x)}. (4) Definitions (3) and (4) have similar interpretation in terms of ambiguity as definitions (1) and (2), and they are also concordant with the dominance principle. Of course, approximations (3) and (4) are not useful when X is positively related with Y. When X is independent on Y, it is not worthwhile approximating Y by X. Theorem. The following properties are satisfied: 1) for each fuzzy set X and Y defined on U, and for each x³U 1.1) m[App+(X,Y),x] mY(x) m[ App +(X,Y),x],
162
S. Greco, M. Inuiguchi, and R. 6áRZL VNL -
-
1.2) m[App (X,Y),x] mY(x) m[ App (X,Y),x]. 2)
for any negation N(¼), being a function N:[0,1][0,1] decreasing and such that N(1)=0, N(0)=1, for each fuzzy set X and Y defined on U, and for each x³U -
+
-
2.1) m[App+(X,Yc),x] = N(m[ App (X,Y),x]) = N(m[ App (Xc,Y),x]) = m[App (Xc,Yc),x], -
+
-
2.2) m[ App +(X,Yc),x] = N(m[App (X,Y),x]) = N(m[App (Xc,Y),x]) = m[ App (Xc,Yc),x], -
+
-
+
2.3) m[App (X,Yc),x] = N(m[ App (X,Y),x]) = N(m[ App (Xc,Y),x]) = m[App (Xc,Yc),x], -
+
-
+
2.4) m[ App (X,Yc),x] = N(m[App (X,Y),x]) = N(m[App (Xc,Y),x]) = m[ App (Xc,Yc),x], where for a given fuzzy set W, Wc is its complement defined by µW c (x ) = N (µW ( x )) .
Results 1) and 2) of the Theorem can be read as fuzzy counterparts of the following results well-known within the classical rough set theory: 1) says that fuzzy set Y includes its lower approximation and is included in its upper approximation; 2) represents a complementarity property – it says that the lower (upper) approximation of fuzzy set Yc being positively related to X is the complement of the upper (lower) approximation of fuzzy set Y being negatively related with X (see first line of (2.1) and (2.2)); moreover, the lower (upper) approximation of fuzzy set Yc being negatively related with X is the complement of the upper (lower) approximation of fuzzy set Y positively related with X (see first line of (2.3) and (2.4)). Result 2) of the Theorem states also other complementarity properties related to the complement of X (see second line of (2.1)-(2.4)); this complementarity property has not been considered within the classical rough set theory. The lower and upper approximations defined above can serve to induce certain and approximate decision rules in the following way. Inferring lower and upper credibility rules is equivalent to finding functions f+(¼), f (¼), g+(¼) and g (¼). These functions are defined as follows: for each a³[0,1], f+(a) = sup {µ[ App + ( X ,Y ), x]} , µ X ( x ) ≤α
+
f (a) = sup {µ[ App − ( X ,Y ), x]} , -
µ X ( x ) ≥α
−
g+(a) = inf {µ[ App ( X ,Y ), x]} , and g (a) = inf {µ[ App ( X ,Y ), x]} . µ X ( x ) ≥α
-
µ X ( x ) ≤α
4 Conclusions and Further Research Directions In this paper we presented a new fuzzy rough set approach. The main advantage of this new approach is that it infers the most cautious conclusions from available imprecise information, without using any fuzzy connectives or specific parameters, whose choice is always subjective to some extent. Another advantage of our approach is that it uses only ordinal properties of membership degrees. We noticed that our approach is related to: gradual rules, with respect to syntax and semantics of considered decision rules, fuzzy rough set approach based on a-cuts, with respect to the specific use of sup and inf in the definition of lower and upper approximations,
Rough Sets and Gradual Decision Rules
163
dominance-based rough set approach, with respect to the idea of monotonic relationship between credibility of premise and conclusion. We think that this approach gives a new prospect for applications of fuzzy rough approximations in real-world decision problems. More precisely, we envisage the following four extensions of this methodology: 1) Multi-premise generalization. 2) Variable-precision rough approximation. 3) Reasoning with multiple extracted gradual decision rules. 4) Imprecise input data represented by fuzzy numbers. Acknowledgements. The research of the first author has been supported by the Italian Ministry of Education, University and Scientific Research (MIUR). The third author wishes to acknowledge financial support from the State Committee for Scientific Research and from the Foundation for Polish Science.
References 1.
Cattaneo, G., Fuzzy extension of rough sets theory, [in] L. Polkowski, A. Skowron (eds.), Rough Sets and Current Trends in Computing, LNAI 1424, Springer, Berlin 1998, pp. 275–282 2. Bouchon-Mounier, B., Yao, J., Linguistic modifiers and gradual membership to a category, International Journal on Intelligent Systems, 7 (1992) 26–36 3. Dubois, D., Prade, H., Gradual inference rules in approximate reasoning, Information Sciences, 61 (1992) 103–122 4. 'XERLV ' DQG 3UDGH + 3XWWLQJ URXJK VHWV DQG IX]]\ VHWV WRJHWKHU >LQ@ 5 6áRZL VNL (ed.), Intelligent Decision Support: Handbook of Applications and Advances of the Sets Theory, Kluwer, Dordrecht, 1992, pp. 203–232 5. Dubois, D., Prade, H., Yager, R., Information Engineering, J.Wiley, New York, 1997 6. *UHFR 6 ,QXLJXFKL 0 6áRZL VNL 5 'RPLQDQFHEDVHG URXJK VHW DSSURDFK XVLQJ possibility and necessity measures, [in]: J.J. Alpigini, J.F. Peters, A. Skowron, N. Zhong (eds.), Rough Sets and Current Trends in Computing, Lecture Notes in Artificial Intelligence, vol. 2475, Springer-Verlag, Berlin, 2002, pp. 85–92 7. *UHFR60DWDUD]]R%6áRZL VNL57KHXVHRIURXJKVHWVDQGIX]]\VHWVLQ0&'0 Chapter 14 [in] T.Gal, T.Stewart, T.Hanne (eds.), Advances in Multiple Criteria Decision Making, Kluwer Academic Publishers, Boston, 1999, pp. 14.1–14.59 8. *UHFR60DWDUD]]R%6áRZL VNL55RXJKVHWSURFHVVLQJRIYDJXHLQIRUPDWLRQXVLQJ fuzzy similarity relations, [in]: C.S. Calude and G. Paun (eds.), Finite Versus Infinite – Contributions to an Eternal Dilemma, Springer-Verlag, London, 2000, pp.149–173 9. *UHFR 6 0DWDUD]]R % 6áRZL VNL 5 )X]]\ H[WHQVLRQ RI WKH URXJK VHW DSSURDFK WR multicriteria and multiattribute sorting, [in] J. Fodor, B. De Baets and P. Perny (eds.), Preferences and Decisions under Incomplete Knowledge, Physica-Verlag, Heidelberg, 2000, pp.131–151 10. *UHFR 6 0DWDUD]]R % 6áRZL VNL 5 5RXJK VHWV WKHRU\ IRU PXOWLFULWHULD GHFLVLRQ analysis. European J. of Operational Research 129 (2001) no.1, 1–47 11. *UHFR60DWDUD]]R%6áRZL VNL50XOWLFULWHULDFODVVLILFDWLRQ>LQ@:.ORHVJHQDQG - \WNRZHGV Handbook of Data Mining and Knowledge Discovery, Oxford University Press, New York, 2002, chapter 16.1.9, pp. 318–328
164
S. Greco, M. Inuiguchi, and R. 6áRZL VNL
12. ,QXLJXFKL 0 *UHFR 6 6áRZL VNL 5 7DQLQR 7 Possibility and necessity measure specification using modifiers for decision making under fuzziness, Fuzzy Sets and Systems, 2003 (to appear) 13. Inuiguchi, M., Tanino, T., New Fuzzy Rough Sets Based on Certainty Qualification, [in]: S. K. Pal, L. Polkowski and A. Skowron (eds.): Rough-Neuro-Computing: Techniques for Computing with Words, Springer-Verlag, Berlin, 2002 14. Nakamura, A., Gao, J.M., A logic for fuzzy data analysis, Fuzzy Sets and Systems, 39, 1991, 127–132 15. Pawlak, Z., Rough Sets, Kluwer, Dordrecht, 1991 16. Polkowski, L., Rough Sets: Mathematical Foundations, Physica-Verlag, Heidelberg, 2002 17. 6áRZL VNL 5 5RXJK VHW SURFHVVLQJ RI IX]]\ LQIRUPDWLRQ >LQ@ 7</LQ $:LOGEHUJHU (eds.), Soft Computing: Rough Sets, Fuzzy Logic, Neural Networks, Uncertainty Management, Knowledge Discovery. Simulation Councils, Inc., San Diego, 1995, 142–145 18. 6áRZL VNL 5 6WHIDQRZVNL - 5RXJK VHW UHDVRQLQJ DERXW XQFHUWDLQ GDWD Fundamenta Informaticae 27 (1996) 229–243 19. Yao, Y.Y., Combination of rough and fuzzy sets based on a-level sets, [in]: T.Y. Lin and N. Cercone (eds.), Rough Sets and Data Mining: Analysis for Imprecise Data, Kluwer, Boston, 1997, pp. 301–321 20. Zadeh, L.A., A Fuzzy Set-Theoretic Interpretation of Linguistic Hedges, J. Cybernetics, 2 (1972) 4–34
Explanation Oriented Association Mining Using Rough Set Theory Y.Y. Yao, Y. Zhao, and R.B. Maguire Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {yyao, yanzhao, rbm}@cs.uregina.ca
Abstract. This paper presents a new philosophical view and methodology for data mining. A framework of explanation oriented data mining is proposed and studied with respect to association mining. The notion of conditional associations is adopted, which explicitly expresses the conditions under which an association occurs. To illustrate the basic ideas, the theory of rough sets is used to construct explanations.
1
Introduction
In this paper, we view data mining as a specific field of study concerning theories, methodologies, and in particular, computer systems for exploring and analyzing a large amount of data [2]. A data mining system is designed with an objective to automatically discover, or to assist a human expert to discover, knowledge embedded in data [4]. This view allows us to examine data mining in a wide context with respect to revolutions, theories, and creativity in science [3]. The examination in turn may help us to see the limitations of current data mining research and the needs for a reconsideration of fundamental issues of the field. In the development of many branches of science such as mathematics, physics, chemistry, and biology, the discovery of a natural phenomenon is only the first step. The important subsequent tasks for scientists are to build a theory accounting for the phenomenon and to provide justifications, interpretations, and explanations of the theory. The interpretations and explanations enhance our understanding of the phenomenon and guide us to make rational decisions. The general observation sheds new light on the study of data mining, if one draws analogy of human centered scientific discovery tasks and data mining tasks. Current research in data mining focuses mainly on the task of discovering a natural phenomenon, represented as patterns, rules, or clusters in data. Little attention is paid to the tasks of constructing theories and models that account for a discovered phenomenon. In other words, researchers concentrate on the task of detecting the existence of a pattern, and have not moved a step further to the task of searching for the underlying reasons that explain the existence of the pattern. Unless we take another step to the explanation task, effectiveness of data mining systems would be limited. The main objective of this paper is to propose and evaluate a framework of explanation oriented data mining. On top of the traditional tasks of pattern G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 165–172, 2003. c Springer-Verlag Berlin Heidelberg 2003
166
Y.Y. Yao, Y. Zhao, and R.B. Maguire
discovery and evaluation, we add the tasks of explanation construction and evaluation. Furthermore, we adopt supervised learning methods to search for and evaluate plausible explanations. The contribution of the paper is not a new algorithm, but a new philosophical view and a new methodology, namely, data mining must deal with both discovery and explanation. One may argue that many researchers have in fact implicitly considered the issues we address here, and hence question the significance of the approach. If one takes another look at the whole picture in the wide context presented earlier, it is easy to come up with another conclusion. The recognition and identification, as well a clearly description, of the new philosophical view, the explicit separation of explanation construction and evaluation from other tasks, and a general framework unifying all these notions, may have a fundamental contribution to the development of the field.
2
A Framework of Explanation Oriented Data Mining
The notion of information tables is used as a formal model to represent and interpret various notions of data mining, such as data, patterns and rules [15]. A framework of explanation oriented data mining is presented and compared with related studies. 2.1
Basic Notions of Data Mining in Information Tables
An information table is defined as a system: S = (U, At, L, {Va |a ∈ At}, {fa |a ∈ At}), where U is a finite nonempty set of objects called universe, At is a finite nonempty set of attributes, L is a language, Va is a nonempty set of values for each a ∈ At, and fa : U → Va is an information function mapping an object to a value in Va . In the language L, an atomic formula φ is given by a = v, where a ∈ At and v ∈ Va . All formulas of L are defined recursively by logical connectives ¬, ∧, ∨, →, and ↔. For an atomic formula a = v, an object x in U satisfies a = v if Ia (x) = v. The satisfiability of all formulas can be easily established. Given a formula φ, we associate with it a set m(φ) = {x ∈ U | x satisfies φ} called the meaning of φ in S. In this way, we define a pair (φ, m(φ)), which represents the basic level knowledge in S. The pair (φ, m(φ)) is commonly known as a formal concept with φ as its intension and m(φ) as its extension [15]. The next level of knowledge can be summarized by relationships between concepts. Given two concepts (φ, m(φ)) and (ψ, m(ψ)), a relationship between them is written in a general form through their intensions as φ ◦ ψ. The symbol ◦ can be interpreted in many ways [16]. For example, ◦ may be interpreted as one-way associations denoted by ⇒, two-way associations denoted by ⇔, and similarity denoted by ≈. Statistical information about two concepts are given by the contingency table in terms of their extensions:
Explanation Oriented Association Mining Using Rough Set Theory
φ ¬φ
ψ a c
¬ψ b d
Totals a+b c+d
Totals a + c b + d a + b + c + d = n
167
a = |m(φ ∧ ψ)|, b = |m(φ ∧ ¬ψ)|, c = |m(¬φ ∧ ψ)|, d = |m(¬φ ∧ ¬ψ)|.
In the table, | · | denotes the cardinality of a set, and |U | = n. Relationships of concepts can be defined, identified, classified, and interpreted using the information in the contingency table. For example, we have [15,16]: |m(φ ∧ ψ)| |U | |m(φ ∧ ψ)| similarity(φ ≈ ψ) = |m(φ ∨ ψ)| |m(φ ∧ ψ)| conf idence(φ ⇒ ψ) = |m(φ)| |m(φ ∧ ψ)| coverage(φ ⇒ ψ) = |m(ψ)| support(φ ∧ ψ) =
=
a , n
a , a+b+c a = , a+b a = . a+c =
(1) (2) (3) (4)
A task of data mining is to discover useful relationships between concepts based on these quantities. For example, in association rule mining, one is interested in finding all one-way associations φ ⇒ ψ whose support support(φ ∧ ψ) and confidence conf idence(φ ⇒ ψ) are above certain threshold values. 2.2
Explanation Oriented Data Mining
A data mining process typically consists of the following steps: data cleaning, data integration, data selection, data transformation, pattern discovery and evaluation, and presentation [5]. Conceptually, explanation oriented data mining inserts an additional step called explanation construction and evaluation into the data mining process. Its main function is to search for the underlying reasons that explain the existence of a discovered pattern. The basic idea of explanation oriented data mining is presented in this section with respect to association mining. Suppose we have discovered an association from a transaction database by using the well-known Apriori algorithm [1]. The identification of an association is only the first step. We need to understand the meaning as well as the implications of the association. The previous discussed measures provide some useful information, but offer no explanation regarding why the association does exist. In many situations, such an explanation is needed, if one wants to justify any decision made based on the discovered pattern. One can be easily convinced that plausible explanations can not be obtained from the original transaction databases. In other words, we need to find explanations from other data source or related domain specific information. Explanations for an association may be obtained in several ways. If additional information about customers is available, we may search for an explanation for
168
Y.Y. Yao, Y. Zhao, and R.B. Maguire
the association based on customer features. If the time of every transaction is available, we may explain the association in terms of transaction time. If products information is available, we may explanation the association through product characteristics. Each of these explanations may be useful to different types of users of a data mining system. A method of explanation construction and evaluation is given below. A transaction database can be easily expressed as a binary information table called transaction table. Let φ denote a particular association in a set U of transactions. 1. We introduce a binary attribute named association. Given a transaction x ∈ U , its value on association is 1 if it satisfies φ in the original transaction table, otherwise its value is 0. 2. We select a set E of attributes related to possible explanations of association. For example, we can choose a set of attributes of customers who made the transactions. 3. We construct an information table by using the attributes obtained from (1) and (2). The new table is called an explanation table. 4. By treating association as the target class, we can apply any supervised machine learning method to derive classification rules of the form: c ⇒ association = 1, which corresponds to the conditional association φ | c. The condition c is a formula in the explanation table, which clearly states the condition c under which the association φ occurs. 5. We evaluate conditional associations based on previously discussed statistical measures. With a conditional association, one explicitly states the conditions under which the association occurs. If conditional associations are constructed properly, they provide explanation for the original association. The basic idea of explanation oriented data mining can be applied to explain results from any unsupervised learning method. Suppose one applies a unsupervised clustering algorithm to a data set. For a particular cluster, we can construct an explanation table with elements from the cluster as positive instances. A supervised machine learning algorithm can then be used to construct explanations for that cluster. It should be noted that the above procedure for explanation construction and evaluation is only one of the possible solutions. The effectiveness of the procedure needs to be evaluated experimentally on real world data. An advantage of the procedure is that it only uses existing data mining and machine learning algorithms, and hence can be easily added to any existing data mining systems. Moreover, it provides another example to demonstrate that one can apply existing data mining technologies to obtain totally new results. Future data mining research needs to pay more attention to how to apply existing algorithms effectively, in addition to the study of new algorithms.
Explanation Oriented Association Mining Using Rough Set Theory
2.3
169
Comparisons with Existing Studies
Two specific research directions of data mining are closely related to explanation oriented data mining. They are pattern evaluation and multi-database mining. A comparative study of those topics will put explanation oriented data mining into a proper perspective. A discovered piece of knowledge is considered to be interesting if it is novel (new), potentially useful, understandable, actionable, or profitable. Many proposals have been made to precisely quantify such an intuitive notion of interestingness. From extensive existing studies, one can identify roughly two classes of approaches, the statistics centered approaches and the semantics centered approaches. They lead to different philosophies in designing data mining solutions to real world problems. Statistics centered approaches focus on the statistical characteristics of discovered patterns [16]. They are user, application and domain independent. A pattern is deemed interesting if it has certain statistical properties. Different classes of rules can be identified based on statistical characteristics [17], such as peculiarity rules (low support and high confidence), exception rules (low support and high confidence, but complement to other high support and high confidence rules), and outlier patterns (far away from the statistical mean). Although statistical information provides an effective indicator of the nature of a pattern, its usefulness is limited. Semantics centered approaches are application and user dependent. In addition to statistical information, one incorporates other domain specific knowledge such as user interest, utility, value, profit, actionability, and so on. In constraint based mining, the user provides restraints that specify the type of knowledge, ranges of a measure or subset of database to be mined [6]. In utility/profit based mining, a profit is associated with each item and the usefulness of discovered patterns is determined by their profits [7,13]. In actionable data mining, a pattern is considered to be useful only if it leads to profitable actions [8,9]. Both statistical and semantical approaches of pattern evaluation focus on different features of patterns. Their main function is to filter out patterns that are not worthy further investigation. They only examine and summarize characteristics of the discovered patterns, and do not provide explanation for the existence of the pattern. On the other hand, measures and methods used in pattern evaluation can be applied to explanation evaluation. In the framework of explanation oriented data mining, we explicitly use two sets of data. It is tempting to conclude that the proposed framework is simply a multi-database mining model. This conclusion on the superficial similarity is simply not valid. Existing studies on multi-database mining mainly focus on pattern discovery rather than explanation construction. Patterns discovered in multi-database mining also need to be explained. On the other hand, multiple data sources are used for quite different purpose in explanation oriented data mining. In general, explanation construction may be carried out by using domain knowledge without explicitly using a second data source.
170
3
Y.Y. Yao, Y. Zhao, and R.B. Maguire
Construction of Explanation Using Rough Set Theory
The theory of rough sets has been used as a basis for the design and implementation of many supervised machine learning algorithms [10,11]. There are two types of learning algorithms based on rough set theory, namely, attribute oriented induction and granule (i.e., attribute-value pair) oriented induction [14]. One can immediately apply rough set based methods to the explanation table to construct explanation. A key notion of attribute oriented induction is reduct. With respect to the attribute association, a reduct is a set of attributes that individually necessary and jointly sufficient to define association. In other words, we need each attribute in a reduct to explain the association, and all attributes in a reduct together are sufficient to explain the association. Many algorithms have been proposed for computing a reduct. Let E be the set of attributes in the explanation table. The following is an outline of algorithm for finding a reduct: 1. 2. 3. 4. 5.
Let R = ∅ and S = E, If R is a reduct or S = ∅, exit, Select an attribute a from S based on a criterion, Add a to R and remove a from S, Go back to step 2.
The criterion in step 3 can be defined in terms of lower and upper approximations. It can also be defined based on user preference of attributes. The condition in step 2 can also be modified so that the procedure would stop as soon as a set of attributes that provide reasonable explanations is found. From a reduct R, all combinations of attribute values in R will be used as conditions c in the conditional association φ | c. Those conditional associations that meet certain criteria will be presented as potential explanations. Alternatively, we can apply granule oriented induction. In this case, we focus on a single explanation each time. An outline of such an algorithm is given by: 1. 2. 3. 4.
Let c = 1, Let a = v denote an atomic formula maximizing support(φ | (a = v)), Add a = v to c by conjunction, If support(φ | c) > support(φ), exit; otherwise, goto step 2.
In the algorithm, the support of conditional association is given by: support(φ | c) =
|m(association = 1 ∧ c)| . |m(c)|
(5)
The condition in step 4 suggests that under condition c the association is more pronounced, and thus may provide a plausible explanation. Other evaluation criteria can also be used in step 4. We have carried out a real experiment to evaluate the effectiveness of explanation oriented mining. We generated a Web site consisting of pages on several
Explanation Oriented Association Mining Using Rough Set Theory
171
topics, such as Business, Finance, and so on. We built a user profile database by collecting relevant information believed to be related to user browsing pattern. By applying the Apriori algorithm on Web log files, we found an association φ = Business ∧ F inance with support of 5.71%. To construct explanation for this association, we used the Rosetta system [12], a rough set based tool kit, to learn conditional association from the explanation table. The following results were obtained: – reduct length is one: • support(φ | age = [30, 49]) = 9.1%; • support(φ | gender = f emale) = 8.0%; • support(φ | occupation = student) = 6.7%; • support(φ | occupation = employee) = 5.7%. – reduct length is two: • support(φ | gender = male ∧ age = [30, 49]) = 13.0%; • support(φ | occupation = employee ∧ age = [20, 29]) = 6.0%. We got a set of conditional associations with the support value being higher than that of the unconditional association. The set of conditions provides some interesting explanations of the browsing pattern. The results clearly show that age, gender and occupation influence the users’ browsing behavior. Such knowledge is useful to a Web site designer.
4
Conclusion
A new philosophical view and methodology called explanation oriented data mining is introduced. It is argued that the effectiveness of current data mining systems is unnecessarily limited by a lack of explanation of discovered knowledge. In order to resolve this problem, we suggest to insert another step, namely, explanation construction and evaluation, to the commonly accepted data mining precess. A framework of explanation oriented mining is proposed. It is recognized that explanations may not be found using the original data table. Additional information is collected to form an explanation table. The results of current data mining systems can be used to define a classification problem in the explanation table. Consequently, any standard machine learning algorithm can be used to construct plausible explanations. Association mining is used to illustrate the main idea of explanation oriented data mining. Explanations are coded as conditions in conditional associations. Rough set based machine learning algorithms are used to search for such conditions.
172
Y.Y. Yao, Y. Zhao, and R.B. Maguire
References 1. Agrawal, R., Imielinski, T. and Swami, A. Mining association rules between sets of items in large databases, Proceedings of SIGMOD, 207–216. 1993. 2. Berry, M.J.A. and Linoff, G.S. Mastering Data Mining, John Wiley & Sons, Inc., New York, 2000. 3. Bohm, D. and Peat, F.D. Science, Order, and Creativity, 2nd Edition, Rooutledge, London, 2000. 4. Fayyad, U.M., Piatetsky-Shapiro, G. (eds.) Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996. 5. Han, J. and Kamber, M. Data Mining: Concept and Techniques, Morgan Kaufmann Publisher, 2000. 6. Han, J., Lakshmanan, L. and Ng, T. Constraint-based, multidimensional data mining, Computer magzine, 1999. 7. Lin, T.Y., Yao, Y.Y. and Louie, E. Mining value added association rules, Proceedings of PAKDD’02, 8. Ling, C., Chen, T., Yang, Q. and Chen, J. Mining optimal actions for intelligent CRM, Proceedings of ICDM, 2002. 9. Liu, B., Hsu, W. and Ma, Y. Identifying non-actionable association rules, Proceedings of KDD, 329–334, 2001. 10. Pawlak, Z. Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht, 1991. 11. Polkowski, L. and Skowron, A. (Eds.) Rough Sets in Knowledge Discovery I, II, Physica-Verlag, Heidelberg, 1998. 12. Rosetta, a rough set toolkit for analyzing data, http://www.idi.ntnu.no/˜aleks/rosetta 13. Wang, K., Zhou, S. and Han, J. Profit mining: from patterns to actions, Proceedings of EDBT, 70–87, 2002. 14. Yao, J.T. and Yao, Y.Y. Induction of classification rules by granular computing, Proceedings of RSCTC’02, 331–338, 2002. 15. Yao, Y.Y. Modeling data mining with granular computing, Proceedings of the 25th Annual International Computer Software and Applications Conference (COMPSAC 2001), 638–643, 2001. 16. Yao, Y.Y. and Zhong, N. An analysis of quantitative measures associated with rules, Proceedings of PAKDD, 479–488, 1999. 17. Zhong, N., Yao, Y.Y., Ohshima, M. and Ohsuga, S. Interestingness, peculiarity, and multi-database mining, Proceedings of ICDM, 566–573, 2001.
Probabilistic Rough Sets Characterized by Fuzzy Sets Li-Li Wei and Wen-Xiu Zhang Institute of Information and System Science, Faculty of Science, Xi’an Jiaotong University, Xi’an, 710049, People’s Republic of China [email protected]
Abstract. In this paper, fuzziness in probabilistic rough set is studied by fuzzy sets. we show that the variable precision approximation of a probabilistic rough set can be generalized from the vantage point of the cuts of a fuzzy set which is determined by the rough membership function. As a result, the fuzzy set can be used conveniently to describe the feature of rough set. Moreover we give a measure of fuzziness, fuzzy entropy, induced by roughness in a probabilistic rough set and make some characterizations of this measure. For three well-known entropy functions, we show that the finer the information granulation is, the less the fuzziness in a rough set. The superiority of fuzzy entropy to Pawlak’s accuracy measure is illustrated with examples. Finally, the fuzzy entropy of a rough classification is defined by the fuzzy entropy of corresponding rough sets, and show that one possible application of it is to measure the inconsistency in a decision table.
1
Introduction
Theories of fuzzy sets [16] and rough sets [7,8] are generalizations of classical set theory for modelling vagueness and uncertainty. There have been many studies on their connections and differences [2,5,8,14]. Dubois and Prade [5] made an investigation around these two notions and reported that they are not rival theories but two different mathematical tools and aim to two different purpose. Chakrabarty, Biswaw and Nanda [2] have given a measure of fuzziness in a rough set and make some characterization of this measure. The probabilistic (stochastic) rough set model was introduced by Wong and Ziarko [12], which generalized Pawlak’s original idea. Pawlak, Wong and Ziarko [10] reviewed and compared the fundamental results in the probabilistic and deterministic models of rough sets. The focus of this paper is to characterize probabilistic rough sets by a fuzzy set which is determined by the rough membership function. We show how the concept of approximations of a rough set can be generalized from the vantage point of the cut of a fuzzy set which is determined by the rough membership function, simultaneously, the characters of the cut of fuzzy set can be used G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 173–180, 2003. c Springer-Verlag Berlin Heidelberg 2003
174
L.-L. Wei and W.-X. Zhang
conveniently to describe the feature of rough set[17]. Moreover we give the uncertainty measures in probabilistic rough set and rough classification using fuzzy entropy[4], it is different to information-theoretic measures of uncertainty for rough sets[1,3,6,11]. Some characterizations of this measure are obtained, and show that the finer the granularity in information is, the less fuzziness in a rough set. The basic ideas of fuzzy sets and the measure of fuzziness are reviewed in Section 2, in which we also present probabilistic approximation space. In Section 3, we described the approximation of a variable precision rough set by the rough membership function, and some characterizations of approximation are easily made by the cut of the fuzzy set. In Section 4, we focus on the measure of fuzziness in a probabilistic rough set, some characterizations of this measure, including the relation between the fuzziness in a rough set and granularity of knowledge, are studied with examples. In Section 5, the fuzzy entropy of a rough classification is defined, and show that one possible application of it is to measure the inconsistency in a decision table.
2
Fuzzy Sets and Rough Sets
Let U be a universal set and F (U ) denote the set of all fuzzy subset of U . For A ∈ F(U ), let A(x) be the membership function of the fuzzy set A. For α ∈ [0, 1], Aα = {x ∈ U | A(x) ≥ α} and Aα = {x ∈ U | A(x) > α} are α-cut and strong α-cut of A respectively. Particularly, SuppA = A0 and KerA = A1 are respectively called the support and kernel of the fuzzy set A. The degree of fuzziness is assumed to express on a global level the difficulty of deciding which elements belong and which do not belong to a given fuzzy set. Mathematically, a measure of fuzziness is a map E from F(U ) to [0, 1] satisfying the following conditions: (a1) E(A) = 0 if and only if A is a crisp subset of U ; (a2) E(A) assumes the maximum value if and only if A(x) ≡ 12 , ∀x ∈ U ; (a3) If |A(x) − 0.5| ≥ |B(x) − 0.5| for all x ∈ U , then E(B) ≤ E(A); (a4) E(A) = E(Ac ), here Ac (u) = 1 − A(u), ∀u ∈ U . As De Luca mentioned in [4], the measure of fuzziness E(A) can be regarded as the “entropy” of a fuzzy set A, a formulation of entropy has been given as following: E(A) =
h(A(x))p(x)dx.
(1)
x∈U
Where p(x) is the probability density of the available data in U and the entropy function h : [0, 1] → [0, 1]. In this paper we deal with some well-known examples of the entropy functions are show in the following: h1 (x) = −x log2 x − (1 − x) log2 (1 − x), x ∈ (0, 1).
(2)
h2 (x) = 2[x ∧ (1 − x)], x ∈ [0, 1]. h3 (x) = 4x(1 − x), x ∈ [0, 1].
(3) (4)
Probabilistic Rough Sets Characterized by Fuzzy Sets
175
The entropy decided by hk (x) will be denoted Ek , k=1, 2, 3, where h1 (x) is the so-called Shannon function and E1 is the Shannon fuzzy entropy. If U = {x1 , x2 , · · · , xN } is a finite set, then the integral in formula (1) will be replaced by sum. In particular, when p(x) is the discrete uniform distribution, N E2 (A) = 1 − N1 D(A, Ac ), D(A, B) = k=1 |A(xk ) − B(xk )|. So E2 is the generalized index of linear fuzziness or Yager’s fuzziness[13] . E3 is the Gini index of a fuzzy set. A probabilistic approximation space can be defined as a triple, A = (U, R, P ), where (U, R) is a Pawlak approximation space and P is a probability distribution on U . In this context, each subset of U corresponds to a random event. The primary objective here is to characterize an expert concept X ⊆ U in A by the known concept Xi (i = 1, . . . , n), the equivalence class of R. Let P (X|Xi ) denote the probability of event X conditioned on event Xi . In other words, P (X|Xi ) is the probability that a randomly selected object with the description of concept Xi belongs to X. In terms of these conditional probabilities, one can define A(X) and A(X) as follows[12]: A(X) = {x ∈ U | P (X | [x]R ) = 1}, (5) A(X) = {x ∈ U | P (X | [x]R ) > 0}.
3
Variable Precision Rough Set Model
Let A = (U, R, P ) be a probabilistic approximation space and suppose X ⊆ U . For an element x ∈ U , the degree of rough belongness of x in X is given by R P (X|[x]R ). This immediately induces a fuzzy set FX of U , membership function is given by P (X ∩ [x]R ) R FX (x) = P (X|[x]R ) = . (6) P ([x]R ) R For convenience, FX (x) = 1 if P ([x]R ) = 0. It is not difficult to obtain the following properties: Proposition 3.1. Let A = (U, R, P ) be a probabilistic approximation space, X and Y are subset of U , then the following hold. R R (b1) FX (x) = FX (y), if xRy, x, y ∈ U ; R R c (b2) FX c = (FX ) ; R R R (b3) FX∪Y (x) = FX (x) + FYR (x) − FX∩Y (x), ∀x ∈ U ; R R R R R (b4) max(0, FX (x) + FY (x) − 1) ≤ FX∩Y (x) ≤ min(FX (x), FX (x)), ∀x ∈ U ; R R R R R (b5) max(FX (x), FY (x)) ≤ FX∪Y (x) ≤ min(1, FX (x) + FX (x)), ∀x ∈ U ; R (b6) If X ⊆ Y ⊆ U , then FX ⊆ FYR ; R R R R (b7) FX∪Y ⊇ FX ∪ FYR , FX∩Y ⊆ FX ∩ FYR ; R Based on the definition of fuzzy set FX , the lower and upper approximatiR ons of X (defined by formula (5)) are support and kernel of the fuzzy set FX respectively. Inspired by variable precision rough set model[17], and using the notion of cut of a fuzzy set, we obtain the following generalized notion of variable precision probabilistic rough set model.
176
L.-L. Wei and W.-X. Zhang
For 0 ≤ β ≤ α < 1, the (α, β)-approximation of X is defined by R A(X; α) = {x ∈ U | P (X | [x]R ) > α} = (FX )α , R A(X; β) = {x ∈ U | P (X | [x]R ) ≥ β} = (FX )β .
(7)
R That is, A(X; α) is the strong α-cut of fuzzy set FX , A(X; β) is the β-cut of the R fuzzy set FX . If α = β = 12 , then the definition (11) is same as in [10]. It is easily to obtained some properties by using the characters of fuzzy set and Proposition 3.1. Proposition 3.2. Let A = (U, R, P ) be a probabilistic approximation space, then for 0 ≤ β ≤ α < 1 and X, Y ⊆ U , the following hold. (c1) A(∅; α) = A(∅; β) = ∅, A(U ; α) = A(U ; β) = U ; (c2) A(X; α) ⊆ A(X; β); c c c (c3) A(X; α) = [A(X c ; 1 − α)] , A(X; β) = [A(X ; 1 − β)] ; (c4) A(X ∪ Y ; β) ⊇ A(X; β) A(Y ; β), A(X ∩ Y ; α) ⊆ A(X; α) A(Y ; α); (c5) A(X ∩ Y ; β) ⊆ A(X; β) A(Y ; β), A(X ∪ Y ; α) ⊇ A(X; α) A(Y ; α); (c6) X ⊆ Y =⇒ A(X; α) ⊆ A(Y ; α), A(X; β) ⊆ A(Y ; β); (c7) r ≤ s =⇒ A(X; s) ⊆ A(Y ; r), A(X; s) ⊆ A(Y ; r).
4
Fuzziness in a Probabilistic Rough Set
There are two types of uncertainty in rough sets and rough classifications. The one arises from partitioning all values into a finite set of equivalence classes. Many methods for manage this uncertainty have been developed based on information theory[1,3,6,11]. The another uncertainty is modelled through the boundary area of rough set where elements of the lower approximation region have total participation in the rough set and those of the upper approximation region have uncertain participation in the rough set. Accuracy measure[7,8] represents the degree of completeness of knowledge about the rough set. Given a probabilistic approximation space A = (U, R, P ), from the Section R 3, we know each subset X of U can induce a fuzzy set FX ∈ F(U ). In this section, we discuss the fuzzy entropy of a probabilistic rough set, it is expected to manage two types of uncertainty. Definition 4.1. Let A = (U, R, P ) be a probabilistic approximation space and suppose X ⊆ U , the fuzzy entropy in the rough set X is denoted by ER X and R is defined as the amount of fuzzy entropy in the fuzzy set FX . Example 4.1. Let A = (U, R, P ) be a probabilistic approximation space, where U = {0, 1, 2, 3, · · · , 10}, the equivalence classes determined by the indiscernibility relation R are as follows: U/R = {{0, 5, 8, 10}, {1, 6}, {2, 7}, {3, 4, 9}}. Suppose the probability P having binomial distribution B(10, 12 ). Let us consider a subset X = {1, of U . It is easy to see: 2, 4, 5, 6, 7, 10} 21 21 253 253 21 253 R FX = 253 299 , 1, 1, 34 , 34 , 299 , 1, 1, 299 , 34 , 299 . For the entropy functions h1 (x), h2 (x), h3 (x), given by formulas (2)-(4), we have (ER (ER (ER X )1 = 0.4995; X )2 = 0.3438; X )3 = 0.4657.
Probabilistic Rough Sets Characterized by Fuzzy Sets
177
Proposition 4.1. Let A = (U, R, P ) be a probabilistic approximation space, R then ER = E = 0. U ∅ Proposition 4.2. Let A = (U, R, P ) be a probabilistic approximation space, X ⊆ U is a composed set, the union of some elementary set in A, namely, X = {x ∈ U | [x]R ⊆ X}, then ER X = 0. Proposition 4.3. Let A1 = (U, R, P ) and A2 = (U, S, P ) be two probabilistic approximation space with same universe and probability distribution, different equivalence relation. The elementary sets in A1 and A2 are U/R = {X1 , X2 , . . . , Xn },
U/S = {Y1 , Y2 , . . . , Ym }
respectively. If S ≺ R, namely for any Xi ∈ U/R, exists subsets {Yj } of U/S, such that Xi = {Yj }, then for every Z ⊆ U , we have (ESZ )k ≤ (ER Z )k , k=1, 2, 3. Proof. Because U/R = {X1 , X2 , . . . , Xn }, and U/S = {Y1 , Y2 , . . . , Ym } both are partitions of U , and S ≺ R, we can partition {1, 2, . . . , m} into a family of disjoint subsets: {Ei }, Ei ∩ Ej = ∅, i = j, i, j = 1, 2, . . . , n, such that Xi = Yj , i = 1, 2, . . . , n. j∈Ei
(1) For h1 (x) = −x log2 x − (1 − x) log2 (1 − x), we have
R (ER h1 FZR (x) P (x) Z )1 = E1 (FZ ) = x∈U
=
n
[−P (Z | Xi ) log2 P (Z | Xi ) − P (Z c | Xi ) log2 P (Z c | Xi )] P (Xi ),
i=1
(ESZ )1 =
n
[−P (Z | Yj ) log2 P (Z | Yj ) − P (Z c | Yj ) log2 P (Z c | Yj )] P (Yj ).
i=1 j∈Ei
By using Jensen inequality j∈Ei
≥
j∈Ei
P (Z | Yj ) log2 P (Z | Yj )
P (Yj ) P (Xi )
P (Yj ) P (Y ) j = P (Z | Xi ) log2 P (Z | Xi ). P (Z | Yj ) log2 P (Z | Yj ) P (Xi ) P (Xi ) j∈Ei
Therefore (ESZ )1 ≤ (ER Z )1 . (2) For h2 (x) = 2[x ∧ (1 − x)],
S (ESZ )2 = E2 (FX )= 2 FZS (x) ∧ 1 − FZS (x) P (x) x∈U
=2
m j=1
[P (Z | Yj ) ∧ P (Z c | Yj )] P (Yj )
178
L.-L. Wei and W.-X. Zhang
≤2
n
i=1
=2
n
P (Z ∩ Yj ) ∧
j∈Ei
P (Z c ∩ Yj )
j∈Ei
[P (Z ∩ Xi ) ∧ P (Z c ∩ Xi )] = (ER Z )2
i=1
(3) the proof of (ESZ )3 ≤ (ER Z )3 is similar to proof of (1) and hence omitted. Example 4.2. Suppose (U, R, P ) and X is same as in Example 4.1, S is an another equivalence relation, is finer then R, the equivalence classed are as follows:U/S = {{0, 5}, {8, 10}, {1, 6}, {2, 7}, {3, 4, 9}}. Similarly we get S FX =
252
253 ,
1, 1,
21 21 252 34 , 34 , 253 ,
1 21 1 46 , 34 , 46
1, 1,
.
For the entropy functions h1 (x), h2 (x), h3 (x), given in (2-4), we have (ESX )1 = 0.3346;
(ESX )2 = 0.2578;
(ESX )3 = 0.3214.
Compared with the Example 4.1, we know (ESZ )k ≤ (ER Z )k k = 1, 2, 3. If we consider the accuracy measure proposed by Pawlak [7,8] in Example 4.1 and Example 4.2, it is obvious that the accuracy measure of X is all the same to R and S. In other words, the accuracy measure can’t distinguish the R and S fairly well.
5
Fuzziness in a Rough Classification
Let C = {C1 , C2 , · · · , Cr } be a classification of U , it is independent on the knowledge R, for instance, it may be given by an expert for a classification problem. AC = {AC1 , AC2 , · · · , ACr } and AC = {AC1 , AC2 , · · · , ACr } are called the lower and upper approximation of the family C. it is easily to see ACi ⊆ Ci ⊆ ACi , i = 1, 2, · · · , r. If ACi = Ci = ACi , i = 1, 2, · · · , r, then C is called precise classification, otherwise C is called rough classification. The fuzziness of rough classification C = {C1 , C2 , · · · , Cr } can be measured R by fuzzy entropy ER Ci , i = 1, 2, · · · , r of the fuzzy sets FCi , i = 1, 2, · · · , r. We specify the following four measures: 1 3
R ER C = max1≤j≤r ECj ; r R R EC = j=1 ECj ;
2 4
R ER C = min1≤j≤r ECj ; r R EC = j=1 P (Cj )ER Cj .
(8)
Proposition 5.1. Let A = (U, R, P ) be a probabilistic approximation space, C = {C1 , C2 , · · · , Cr } is a classification of U . If C is precise, then there is’n any fuzziness in C. Proposition 5.2. Let A1 = (U, R, P ) and A2 = (U, S, P ) be two probabilistic approximation space with same universe and probability distribution, different equivalence relation. If S ≺ R, C = {C1 , C2 , · · · , Cr } is a classification of U , then for every entropy functions h1 (x), h2 (x), h3 (x) provided in Section 2, we have q ESC ≤q ER C , q = 1, 2, 3, 4.
Probabilistic Rough Sets Characterized by Fuzzy Sets
179
Proof. From Proposition 4.3, (ESCj )k ≤ (ER Cj )k , k = 1, 2, 3; j = 1, 2, · · · , r. q S q R thus EC ≤ EC , q = 1, 2, 3, 4. The fuzziness of a rough classification can be used to describe inconsistency of a decision table. Let (U, C, D) denote a decision table, where C is a finite set of condition attributes and D is the set of decision attributes. In general, there is only one decision attribute in a decision table such that D = {d} and D can be simply written as d. RC and RD are indiscernibility relation of C and D respectively. If RC ≺ RD , then the decision table is consistent; if RC ≺ RD , then the decision table is inconsistent. We simply denote the fuzziness of RC C classification U/RD by EC D = EU/RD . Observably, ED is well suited to deal with this inconsistency of decision table. For consistent decision table, from Proposition 5.1, this measure is 0. B Proposition 5.3. Let (U, C, d) be a decision table, B ⊆ C, then EC d ≤ Ed . C This Proposition means the bigger Ed is, the more the inconsistency in decision table is.
References [1] T. Beaubouef, F. E. Petry and G Arora, Information-theoretic measures of uncertainty for rough sets and rough relational databases, Information Science 109(1998) 185–195. [2] K. Chakrabarty, R. Biswas and S. Nanda, Fuzziness in rough sets, Fuzzy Sets and Systems 110(2000) 247–251. [3] X. Chen, S. Zhu and Y. Ji, Entropy based uncertainty measures for classification rules with inconsistency tolerance, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics 4 Oct 8–Oct 11 2000. [4] A. De Luca and S. Termini, A definition of non-probabilistic entropy in the setting of fuzzy sets theory, Information and Control 20(1972) 301–312. [5] D. Dubois and H. Prade, Rough fuzzy sets and fuzzy rough sets, International Journal of General System 17(1990) 191–208. [6] I. D¨ untsch and G. Gediga, Uncertainty measures of rough set prediction, Artificial Intelligence 106(1998) 109–137. [7] Z. Pawlak, Rough sets, International Journal of Computer and Information Science 11(1982) 341–356. [8] Z. Pawlak, Rough sets: Theoretical aspects of reasoning about data, Kluwer Academic Publishers, Dordrecht, 1991. [9] Z. Pawlak and A. Skowrond, Rough Membership Function, in: L. A. Zadeh and J. Kacprzyk (Ed.), Fuzzy Logic for the Management of Uncertainty, Wiley, New York, 1994, 251–271. [10] Z. Pawlak, S. K. M. Wong and W. Ziarko, Rough sets: probabilistic versus deterministic approach, International Journal of Man-Machine Studies 29(1988) 81–95. [11] M. J. Wierman, Measuring uncertainty in rough set theory, International Journal of General Systems 28(1999) 283–297. [12] S. K. M. Wong and W. Ziarko, Comparison of the probabilistic Approximate Classification and the fuzzy set Model, Fuzzy Sets and Systems 21 (1987) 357– 362.
180
L.-L. Wei and W.-X. Zhang
[13] Y. Y. Yager, On measures of fuzziness and negation, Part I: membership in unit interval, International Journal of General Systems 5(1979) 221-229. [14] Y. Y. Yao, A comparative study of fuzzy sets and rough sets, Information Sciences 109(1998) 227–242. [15] Y. Y. Yao, Relational interpretations of neighborhood operators and rough set approximation operators, Information Sciences 111(1998) 239–259. [16] L. A. Zadeh, Fuzzy sets, Information and control 8(1965) 338–353. [17] W. Ziarko, Variable Precision Rough Set Model, Journal of Computer and System Sciences 46(1993) 39–59.
A View on Rough Set Concept Approximations Jan Bazan1 , Nguyen Hung Son2 , Andrzej Skowron2 , and Marcin Szczuka2 1
Institute of Mathematics,University of Rzesz´ ow Rejtana 16A, 35-959 Rzesz´ ow, Poland 2 Institute of Mathematics, Warsaw University Banacha 2, 02-097 Warsaw, Poland {bazan,son,skowron,szczuka}@mimuw.edu.pl
Abstract. The concept of approximation is one of the most fundamental in rough set theory. In this work we examine this basic notion as well as its extensions and modifications. The goal is to construct a parameterised approximation mechanism making it possible to develop multistage multi-level concept hierarchies that are capable of maintaining acceptable level of imprecision from input to output.
1
Introduction
The notion of concept approximation is a focal point of many approaches to data analysis based on rough set theory. The original concept of indiscernibility and approximation, as introduced by Pawlak in [5], is meant to provide a way of dealing with inconsistency and incompleteness in data. Elegantly and simply devised, the rough set approximations proved to be a useful tool in supporting data-related tasks such as classification, decision making, and description. In the majority of rough set applications the approximations are used only at some initial stage of inductive learning. However, in most cases the final system is based on extended representation. Majority of the existing solutions (see [2,3, 15]) make use of decision rules derived from data and accompanied with a recip´e for their usage. We show that approximations of concepts retrieved from such a system can be enriched by using some patterns defined by rule based classifiers or other classifiers. Such approach is much more flexible and better suited for multi-stage reasoning schemes (see [11]) that are discussed further in the paper. The importance of proper approximation choice becomes much more crucial when it comes to construction of compound decision support and classification systems. These systems create higher level concepts using as building blocks not only attribute values but also previously derived, more primitive concepts. The lower level concepts may be created with use of approximate techniques as well. Therefore, higher level concepts incorporate not only the imprecision resulting from the way they are being constructed but also the imprecision inherited from the lower level concepts. The crucial point is to assure the parameterised space of possible approximations. Then, by proper tuning of parameters we may control the proliferation of imprecision. In rough set terms this corresponds to setting size and shape of boundary region. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 181–188, 2003. c Springer-Verlag Berlin Heidelberg 2003
182
J. Bazan et al.
To illustrate the problems that require multi-layered approximation schemes and compound concept approximation let us bring two examples related to RoboCup [14] and WITAS project [11,17]. The RoboCup [14] is an initiative aimed at fostering research in the field of cooperative autonomous robotics by construction of programs for control of robots which play football. The aim of WITAS project is to construct an unmanned aerial vehicle (autonomous helicopter) capable of recognising the road situation underneath and take an appropriate action. The aircraft is equipped with several sensors, most importantly a video camera. In both examples, the decision making process should not only take into account simple observations, but also the higher level concepts such as situation of other objects on the field. The paper presents the topics sketched above in the following way. First, the concept of rough set approximation is introduced and discussed. After that we present and discuss the approximations induced by rule sets and concept assessment schemes accompanying them. Finally we present how the proposed approach deals with the issue of construction of compound concept hierarchies while preserving approximation quality. We also make connections with the concept of parameterised approximation space [11] and ideas connected to the approach known as rough mereology [8,9].
2 2.1
Rough Set Preliminaries Information Systems
An information system [5] is a pair S = (U, A), where U is a non-empty, finite set of objects and A is a non-empty, finite set, of attributes. Each a ∈ A corresponds to the function a : U → Va , where Va is called the value set of a. We associate with any non-empty set of attributes B ⊆ A the B-information signature for any object x ∈ U by inf B (x) = {(a, a(x)) : a ∈ B}. The set {infA (x) : x ∈ U } is called the A-information set and it is denoted by IN F (S). In supervised learning problems, objects from training set are pre-classified into several categories or classes. To deal with such type of data we use decision systems of the form S = (U, A, dec), where dec ∈ / A is a distinguished attribute called decision and elements of attribute set A are called conditions. In practice, decision systems contain description of a finite sample U of objects from larger (may be infinite) universe U. Conditions are such attributes that their values are known for all objects from U, but decision is a function defined on the objects from the sample U only. Usually decision attribute is a characteristic function of an unknown concept or several concepts on a sample of objects. Without loss of generality one can assume that the domain Vdec of the decision dec is equal to {1, . . . , d}. The decision dec determines a partition {CLASS1 , . . . , CLASSd } of the universe U , where CLASSk = {x ∈ U : dec(x) = k} is called the k-th decision class of S for 1 ≤ k ≤ d. By class distribution of any set X ⊆ U we denote the vector ClassDist(X) = n1 , . . . , nd , where nk = card(X ∩CLASSk ) is the number of objects from X belonging to the k-th decision class.
A View on Rough Set Concept Approximations
2.2
183
Concept Approximation
In many real life situations, we are not able to give an exact definition of the concept. Such uncertain situations are caused by either the lack of information about the concept or by the richness of natural language. There are different approaches to deal with uncertain and vague concepts like multi-valued logics, fuzzy set theory, and rough set theory. Using these approaches, concepts are defined by “multi-value membership function” instead of classical “binary (crisp) membership relations” (set characteristic functions). In particular, what we want to underline in this paper, rough set approach offers a way for establishing membership functions that is data-grounded and significantly different from others. In rough set methods, it is assumed that there exists a concept X defined over the huge universe U of objects (X ⊆ U). The problem is to find a description of the concept X, which can be expressed in a predefined descriptive language, which is a set of formulas that are interpretable as subsets of U. In general, the problem is to find a description of a concept X in a language L2 (e.g., consisting of boolean formulas defined over subset of attributes) assuming the concept is definable in another language L1 (e.g., natural language, or defined by a set of attributes). Usually, the concept X is specified partially, i.e., value of characteristic function of X is given only on a small subset U ⊆ U called training sample. Such information makes it possible to search for patterns in a given language defining on the training sample sets included (or sufficiently included) in a given concept. Observe that the approximations of a concept can not be defined uniquely from a given sample of objects. The approximations of the whole concept X are obtained by induction from given information on a sample U of objects (containing some positive examples X ∩ U and negative examples X ∩ U ). Hence, the quality of such approximations should be verified on new testing objects. Thus we propose to search for concept approximations gradually. Parameterised patterns defined by rough membership functions related to classifiers help to discover relevant patterns on the object universe extended by adding new testing objects. In the paper we present illustrative examples of such parameterised patterns. By tuning parameters of such patterns one can obtain patterns relevant for concept approximation of the extended training sample by testing objects from U ∗ where U ⊆ U∗ ⊆ U . Due to bounds on expressiveness of language L in the universe U, we are forced to find some approximated rather than exact description of a given concept. Rough set methodology for approximation of a concept X ⊆ U, assuming X and U − X are known only on a sample U ⊆ U, can be based on finding pairs P = (L, U) of object sets in U satisfying the following conditions: 1. L, U, U \ L, U \ U are subsets of U expressible in the language L ; 2. L ∩ U ⊆ X ∩ U ⊆ U ∩ U ; 3. the set L (U) is maximal (minimal) in the family of sets definable in L satisfying (2). The sets L and U are called the lower approximation and the upper approximation of the concept X ⊆ U (generated by its sample on U ), respectively. The set
184
J. Bazan et al.
BN = U \ L is called the boundary region of approximation of X. The set X is called rough with respect to its approximations (L, U) if L = U, otherwise X is called crisp in U. In practical applications the last constraint in the above definition can be hard to satisfy. Hence, by using some heuristics sub-optimal instead of maximal or minimal sets are constructed. The rough approximation of concept can be also defined by means of rough membership function. A function f : U → [0, 1] is called a rough membership function of the concept X ⊆ U approximated by (L, U) (assuming X and U − X are known only on a sample U ⊆ U) if and only if L = Lf = {x ∈ U : f (x) = 1} and U = Uf = {x ∈ U : f (x) > 0}. Rough set approximations [5,6] are fundamental and widely used in many reasoning methods under uncertainty (caused, e.g., by the lack of some attributes). For a given information system S = (U, A) and an attribute set B ⊆ A, one can define a B-indiscernibility relation IN D(B) assuming IN D(B) = {(x, y) ∈ U × U : infB (x) = infB (y)} . Its equivalence classes are defined by [x]IN D(B) = {u ∈ U : (x, u) ∈ IN D (B)} for any object x ∈ U . The problem is to define a concept X ⊆ U , assuming that only some attributes from B ⊆ A are given. This problem is often specified by a decision system S1 = (U, B, decX ), where decX (u) = 1 for u ∈ X, and decX (u) = 0 for u ∈ / X. Attributes from B determine the rough membership function µB : U → [0, 1] for the concept X X by µB (x) = card(X ∩ [x] )/card([x] ). This function, according to IN D(B) IN D(B) X the rough membership function definition, yields rough approximations of the concept X by using indiscernibility classes: LB (X) = LµB = x ∈ U : µB X (x) = 1 = x ∈ U : [x]IN D(B) ⊆ X X UB (X) = UµB = x ∈ U : µB X (x) > 0 = x ∈ U : [x]IN D(B) ∩ X = ∅ X called the B-lower and the B-upper approximation of X in S, respectively. The set BNB (X) = UB (X) \ LB (X) is called B-boundary region of the concept X. Observe, in such definition of approximation we assume closed world assumption, i.e., the concept approximation problem is related to objects from the information system S only. In inductive learning, it is necessary to extend the rough set based approximations for objects outside U . Unfortunately, the lack of generalisation in the process of attribute-based approximations implies that there may exist objects x ∈ U \ U satisfying [x]IN D(B) ∩ U = ∅. Hence, for such objects we are unable to make any decision. In the following sections we discuss some extensions of approximations on supersets of U which are less sensitive to the above mentioned problems.
3
Case-Based Approximations
In case-based reasoning methods, like k-NN (k-Nearest-Neighbour) method, it is necessary to define a distance function between objects δ : U × U → R+ . The problem searching for a relevant distance function for the given data set is
A View on Rough Set Concept Approximations
185
not trivial, but at this point, let us assume that such function has been already defined. In k-NN classification method, the decision for a new object x ∈ U \ U is made on the basis of the set N N (x; k) := {x1 , . . . , xk } ⊆ U with k objects from U which are nearest to x with respect to the distance function δ. Usually k is a parameter which can be set up by expert or constructed from experiment data. The k-NN classifiers often use voting algorithm for decision making, i.e. the decision value for new object x can be predicted by: dec(x) = V oting(n1 , . . . , nd ) where ClassDist(N N (x; k)) = n1 , . . . , nd is the class distribution of the set N N (x, k) satisfying n1 + . . . + nd = k. The voting function can return the most frequent decision value occurring in N N (x, k). In case of imbalanced data, the vector n1 , . . . , nd can be scaled w.r.t. global class distribution first, and after that the voting algorithm can be employed. Now, we are going to present the rough approximation based on the sets N N (x; k). Let us define a family of functions defined by if ni ≥ t2 1 ni −t1 1 ,t2 if ni ∈ (t1 , t2 ) µtCLASS (x) = i t2−t1 0 if ni ≤ t1 where t1 < t2 < k, ni is the i-th coordinate in the class distribution of N N (x; k). Any such function defines patterns described by means of the following formulas: 1 ,t2 i −t1 µtCLASS (x) ◦ c, where ◦ ∈ {=, ≥, ≤, <, >} and c ∈ {0, 1, nt2−t1 }. One can tune i parameters of such formulas to obtain new relevant patterns for the concept approximation on the considered extension of the universe by testing objects.
4
Rule-Based Approximations
Let S = (U, A, dec) be a decision system. Any implication of the form (ai1 = v1 ) ∧ . . . ∧ (aim = vm ) ⇒ (dec = k) where aij ∈ A and vj ∈ Vaij , is called a decision rule for the k-th decision class. Any decision rule r of the above form can be characterised by following parameters: – – – –
length(r) = the number of descriptors in the premise of r; [r] = carrier of r, i.e., the set of objects satisfying the premise of r. support(r) = number of objects satisfying the premise of r; k) conf idence(r) = the measure of truth of the decision rule = card([r]∩CLASS . card([r])
In data mining, we are interested in searching for short, strong decision rules with high confidence. The linguistic features like “short”, ”strong” or “high confidence” of decision rules can be formulated by means of their length, support or confidence. Many decision rule generation methods have been developed using rough set theory (see e.g., [1,2,10,15]). The rule based classification methods work in three phases: 1. Learning phase: generates a set of decision rules RU LES(S) (satisfying some predefined conditions) from a given decision system S.
186
J. Bazan et al.
2. Rule selection phase: selects from RU LES(S) the set of such rules that can be supported by x, where x ∈ U is a testing object. We denote this set by M atchRules(S, x). 3. Decision making phase: makes a decision for x using some voting algorithm for decision rules from M atchRules(S, x) A rule based classifier works as follows. Suppose we would like to decide if a given object x ∈ U belongs to the i-th decision class. Let M atchRules(S, x) = Ryes ∪ Rno , where Ryes is the set of all decision rules for i-th class matched by x and Rno is the set of decision rules for other classes matched by x. We assign two real values wyes , wno defined by wyes =
r∈Ryes
strength(r)
wno =
strength(r)
r∈Rno
where wyes , wno are called for and against weights of the object x, and strength(r) is a normalised function depending on length(r), support(r), conf idence(r) and some global information about the decision system S like decision system size, global class distribution, etc. (see [1]). Using some relationships between wyes , wno the classifier is predicting the decision. Notice, that in such approach any classifier can be identified with a membership function. We can define rule-based classifiers by a parameterised function µCLASSk (x) of the following form: IF max(wyes , wno ) < ω THEN µCLASSk (x) = 0 if wyes − wno ≥ θ 1 −wno ) ELSE µCLASSk (x) = θ+(wyes if |wyes − wno | < θ 2θ 0 if wyes − wno ≤ −θ where ω, θ are parameters that allow to search for new relevant patterns (pieces of concept description) for the concept approximation (on the extension of the initial training sample by testing objects).
5
Approximations of Compound Objects
As mentioned before, here we are not only concerned with approximation of concepts that are described with simple attributes but also with higher level concepts established from already existing ones. The idea is to use approximations in the way which gives us the ability to control the level of approximation quality. To simplify notation let us assume that we have two concepts C1 and C2 that are given by means of rule-based approximations derived from decision systems SC1 = (U, AC1 , decC1 ) and SC2 = (U, AC2 , decC2 ). Hence, we are given two sets of patterns for approximation of concepts C1 and C1 C1 C2 (see Section 4). They can be obtained by tuning parameters {wyes , wno , ω C1 , C1 C2 C2 C2 C2 θ } and {wyes , wno , ω , θ } discussed previously. We want to establish a relC C evant set of patterns and parameters for the target concept C, i.e. {wyes , wno , ω C , θC }.
A View on Rough Set Concept Approximations
187
The issue is to define a decision system from which we can derive rules determining approximations of C. Let us recall that both simpler concepts C1 , C2 and target concept C are defined over the same universe U and are specified on a sample U ⊆ U. To complete the construction of SC = (U, AC , decC ) we need to specify AC ∪ {decC }. The decision attribute is known for an arbitrary object x ∈ U and conditional attributes in (the simplest case of) our proposal are either rough memberships for simpler concepts (A = {µC1 (x), µC2 (x)}) or C1 C1 C2 C2 weights for simpler concepts (A = {wyes , wno , wyes , wno }). In the former case we concentrate on the degree of inclusion while in the later case we take into account the relationships with positive and negative information generated during object classification. The rules used for C construction are making use of attributes that are in fact classifiers themselves. Therefore, it is worth to stratify and interpret attribute domains for attributes in AC . Instead of using just a value of membership function, weight or their combination, we would prefer to use linguistic statements such as the likeliness of the occurrence of C1 is low. Hence, we have to map the attribute value sets onto some limited family of subsets. It is quite natural to introduce (linearly ordered) ranges of values, e.g., {negative, low, medium, high, positive}, what yields fuzzy-like layout of attributes. One may also consider the case when these subsets overlap. Then, there may be more linguistic values related to attributes. Stratification of attribute values and introduction of linguistic variable attached to the strata provides a way for representing knowledge in more humanreadable format since for new object x∗ ∈ U \ U to be classified we may use rules like: If compliance of x∗ with C1 is high or medium and compliance of x∗ with C2 is high then x∗ ∈ C. Another advantage of imposing the division of attribute value sets lays in extended control over flexibility and validity of system constructed in this way. We gain the ability of making system more stable and inductively correct and we control the general layout of boundary regions that contribute to construction of the target concept. The process of setting the intervals for attribute values may be performed by hand or with use of automated methods for interval construction e.g., clustering, template analysis, and discretisation. For some discussion of this approach, related to rough neurocomputing and computing with words see [11].
6
Conclusions and Further Directions
In the paper we have presented a collection of basic ideas that redefine the view on the approximation of concepts in rough set framework. This is very initial work and there is still much more to be done. Some of the techniques mentioned in the paper are already implemented and may be tested (see e.g., [15]). Still, there are much more issues that accompany the task of proper approximation construction. One of them is related to incremental construction (incremental learning) of approximations. It is a big challenge to devise the method that will be capable of learning an approximation from the sample U
188
J. Bazan et al.
and then, given enriched (finite) sample U ∗ ⊃ U , extend the approximation with few simple steps, without fundamental reconstruction. Another interesting topic is the investigation of partially defined rough membership functions. In this paper the set of objects, which are not belonging to domain of rough membership function was treated as a part of boundary region (see Section 4). This solution was caused by the fact that rough set methods are based on 3-valued logics, and we are convinced that it can be improved by introducing 4-value logics [16]. We will dwell on this idea in next papers. Acknowledgements. This work was supported by KBN grant 8T11C02519 and by the Wallenberg Foundation grant in frame of the WITAS project.
References 1. Bazan J., A comparison of dynamic non-dynamic rough set methods for extracting laws from decision tables. In: [7], pp. 321–365. 2. Grzymala-Busse J., A new version of the rule induction system LERS. Fundamenta Informaticae, Vol. 31(1), 1997, pp. 27–39. 3. Ohrn A., Komorowski J., Skowron A., Synak P., The ROSETTA Software System. In [7] pp. 572–576. 4. Nguyen H.S., Skowron A., Szczuka M., Situation Identification by Unmanned Aerial Vehicle. Proceedings of CS&P 2000, Informatik Berichte, HumboldtUniverit¨ at zu Berlin, Berlin, 2000, pp. 177–188. 5. Pawlak Z., Rough sets: Theoretical aspects of reasoning about data, Kluwer, Dordrecht, 1991. 6. Pawlak Z., Skowron A., Rough membership functions. In Yager R., Fedrizzi M., Kacprzyk J. (eds)., Advances in the Dempster–Shafer Theory of Evidence, Wiley, New York, 1994, pp. 251–271. 7. Polkowski L., Skowron A. (eds.), Rough Sets in Knowledge Discovery vol. 1–2, Physica-Verlag, Heidelberg, 1998. 8. Polkowski L. , Skowron A., Rough mereology: A new paradigm for approximate reasoning. Int. Journal of Approximate Reasoning vol. 15(4), 1996, pp. 333–365. 9. Polkowski L., Skowron A., Towards an adaptive calculus of granules. In Zadeh L. A., Kacprzyk J.(eds.), Computing with Words in Information/Intelligent Systems vol. 1, Physica-Verlag, Heidelberg, 1999, pp. 201–228. 10. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In Slowi´ nski R. (ed.). Intelligent Decision Support – Handbook of Applications and Advances of the Rough Sets Theory, Kluwer Academic Publishers, Dordrecht, 1992, pp. 311–362. 11. Skowron A., Szczuka M., Approximate reasoning schemes: Classifiers for computing with words. In Proceedings of SMPS 2002, Advances in Soft Computing series, Physica-Verlag, Heidelberg, 2002, pp. 338–345. 12. Stefanowski J., On rough set based approaches to induction of decision rules. In [7] pp. 500–529. 13. Ziarko, W., Rough set as a methodology in Data Mining. In [7] pp. 554–576. 14. The RoboCup Homepage – www.robocup.org 15. The RSES Homepage – logic.mimuw.edu.pl/∼rses 16. Vitoria A., Maluszy´ nski: A logic programming framework for rough sets. LNAI Vol. 2475, Springer-Verlag, Heidelberg, 2002, pp. 205–212. 17. The WITAS Project Homepage – www.ida.liu.se/ext/witas/
Evaluation of Probabilistic Decision Tables Wojciech Ziarko Department of Computer Science University of Regina, Regina, SK, S4S 0A2 Canada
Abstract. The article presents the basic notions of the variable precision rough set model (VPRSM). The main subject of the article is the evaluation of VPRSM set approximations and corresponding probabilistic decision tables using a number of proposed probabilistic measures.
1
Introduction
In practical applications of rough set theory, the common objective is to analyze and optimize decision tables derived from data (see, for example [3],[2], [8]). The data-acquired decision tables represent the classification of the domain of interest and the relationships occurring in the domain. Since there is practically unlimited number of ways decision tables can be extracted from data, an important issue is their evaluation. The problem is particularly difficult in case of probabilistic decision tables [8] derived in the framework of the variable precision rough set model (VPRSM) [7], [6] or the Bayesian rough set model [4], where standard rough set quality measures [2] are not directly applicable. To deal with this problem, we present a comprehensive set of probabilistic measures aimed at evaluation of probabilistic decision tables.
2
Variable Precision Rough Set Model
Let U denote a universe of objects referred to as elementary events and let s(U ) be the ∂-algebra of measurable subsets of U referred to as random events. We will assume that there is a random process generating new objects e belonging to the universe U. For each new object e, we will say that the event X ∈ s(U ) occurred if the object e ∈ X. In addition, we will assume the existence of the prior probability function P (X) on subsets X belonging to s(U ). We will also assume that all subsets under consideration in this article are members of the family of sets s(U ) and that all of them are likely to occur, that is that P (X) > 0, and that their occurrence is not certain, that is P (X) < 1. To define the structure of rough approximation space, we will denote by R an equivalence relation on U with finite number of equivalence classes (elementary sets) E1 , E2 , . . . , En such that P (Ei ) > 0, for all 1 ≤ i ≤ n. Each elementary set E
Supported by research grant from the Natural Sciences and Engineering Research Council of Canada.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 189–196, 2003. c Springer-Verlag Berlin Heidelberg 2003
190
W. Ziarko
is assigned a measure of overlap with the set X by the conditional probability function defined as P (X|E) = P (X∩E) P (E) . The values of the conditional probability function are normally estimated from sample data by taking the ratio P (X|E) = card(X∩E) card(E) . The asymmetric VPRSM generalization of the original rough set model is based on the values of the probability function P and two lower and upper limit certainty threshold parameters l and u such that 0 ≤ l < P (X) < u ≤ 1. The VPRSM is symmetric if l = 1 − u. Traditionally [7], in this special case the symbol β such that 0 < P (X) < β ≤ 1 is used instead of the symbol u to denote the model’s upper threshold parameter. The problem of the parameters’ selection is a subject of ongoing research [1]. 2.1
Rough Approximation Regions
In this section the VPRSM approximation regions are briefly defined and compared to approximation regions introduced in the original Pawlak’s approach. Positive Region: The parameter u defines the u-positive region or the u-lower approximation of the set X in the VPRSM model. The value of u reflects the least acceptable degree of the conditional probability P (X|Ei ) to include the elementary set Ei in the positive region, or u-lower approximation of the set X. It can be perceived as quality threshold of the probabilistic information associated with elementary sets Ei . Only those elementary sets with sufficiently high information quality are to be included in the positive region. The u-positive region of the set (event) X, P OSu (X) is defined as P OSu (X) = ∪{Ei : P (X|Ei ) ≥ u}. Intuitively, u represents the desired level of improved prediction accuracy when predicting the occurrence of the event X based on the information that event E actually occurred. The lower approximation represents an area of the universe where the likelihood of X occurrence is high, where the subjective term high is reflected by the given value of the parameter u. Since the prior probability of X occurrence in the absence of any additional information is P (X), then the improvement of prediction accuracy is possible only if P (X) < u ≤ 1. We note that the lower approximation defined in the Pawlak’s model as P OS(X) = ∪{Ei : X ⊇ Ei } corresponds to P OS1 (X) of the VPRSM model and that P OS(X) ⊆ P OSu (X) for all u such that P (X) < u ≤ 1. Negative Region: The l−negative region N EGl (X) of the set X in the VPRSM approach represents an area of the universe where the occurrence of the set X is significantly less likely than random guess (prior) probability P (X). The threshold value for the probability of X occurrence is represented by the lower limit parameter l. The decrease in the probability of X occurrence relative to prior probability is possible only if 0 ≤ l < P (X). Consequently, N EGl (X) = ∪{Ei : P (X|Ei ) ≤ l}. Alternatively, the l-negative region can be perceived as (1 − l)−positive region of the complement of the set X, the set ¬X = U − X. That is, N EGl (X) = ∪{Ei : P (¬X|Ei ) ≥ 1 − l} = P OS1−l (¬X). In other words, N EGl (X) includes all those elementary sets Ei that are associated with sufficiently high probability to predict that X would not occur. The Pawlak’s notion of negative region defined as N EGORG (X) = ∪{Ei : X ∩ Ei = ∅} can be
Evaluation of Probabilistic Decision Tables
191
reformulated in probabilistic terms as N EGORG (X) = ∪{Ei : P (X|Ei ) = 0} = N EG0 (X). Clearly, N EG0 (X) ⊆ N EGl (X) for all 0 ≤ l < P (X). Boundary Region: The boundary region BN Dl,u (X) in the VPRSM model covers an area of the universe where the prediction about the occurrence of the event (set) X or its complement ¬X is not possible with sufficient likelihood to qualify either for the positive region of X or the positive region of the complement ¬X, that is the negative region of X. This can be expressed formally as BN Dl,u (X) = ∪{Ei : u > P (X|Ei ) > l}. The boundary region of the Pawlak’s model can be defined, using the same notation, as BN DORG (X) = ∪{Ei : 1 > P (X|Ei ) > 0} = BN D0,1 (X). We note that the VPRSM boundary area is, in general, narrower than the original rough set boundary area since for all permissible l, u values and sets X we have BN Dl,u (X) ⊆ BN DORG (X). Upper Approximation: The l-upper approximation U P PORG (X) of the set X in the original rough set model isolates an area of the universe where the occurrence of the target set (event) X is possible, that is, it could be predicted with any probability greater than zero. This is formulated in the definition U P PORG (X) = ∪{Ei : X ∩ Ei = ∅} which in probabilistic terms can be re-expressed as U P PORG (X) = ∪{Ei : P (X|Ei ) > 0} = P OSORG (X) ∪ BN DORG (X). The natural generalization of this definition in the framework of VPRSM is to define the upper approximation as a union of positive and boundary regions. This leads to the following generalized definition of l−upper approximation as U P Pl (X) = ∪{Ei : P (X|Ei ) > l} = P OSu (X) ∪ BN Dl,u (X). We may notice that in the original rough set model we always have P (U P P (X)) > 0. This is also the case in the VPRSM model since it can be shown that for any 0 ≤ l < P (X) we have P (U P Pl (X)) > 0.
3
Model Quality Measures
In practical applications of the VPRSM to data mining, machine learning and data-based modelling problems, it is essential to evaluate the quality of the available information and derived decision table models using appropriate measures. The measures can be divided into two categories: local measures associated with individual elementary sets and global measures associated with rough approximation regions. 3.1
Local Measures
Certainty Measure: The single most important measure is the certainty measure defined as the conditional probability of the event X conditioned on the occurrence of the elementary set E, P (X|E) = P (X∩E) P (E) . This measure was originally introduced in the context of rough set theory in [5]. All other measures presented here are derivatives of the certainty measure. Coverage Measure: The coverage measure is defined as the conditional probability of the elementary set E conditioned on the occurrence of the target set X, P (E|X) = P P(X∩E) (X) .
192
W. Ziarko
Certainty Gain Measure: To evaluate the degree of increase (or decrease) of the predictive probability relative to the random guess (prior) probability P (X) of the target event X, based on the conditional probability information associated with an elementary set E, the certainty gain measure G(X|E) is introduced (X) (X|E) as G(X|E) = P (X|E)−P if P (X|E) > P (X) and G(X|E) = P (X)−P if P (X) 1−P (X) P (X|E) < P (X). One can notice that the certainty gain value in the original 1 rough set model is given by G(X|E) = P (X) − 1 > 0, for all elementary sets E belonging to the positive region, and G(X|E) = 1−P1(X) − 1 > 0 for all elementary sets E belonging to the negative region of the set X. In the VPRSM approach, the certainty gain value is positive in both the positive and negative u regions. In addition, it satisfies the following relations: G(X|E) ≥ P (X) − 1 > 0, where P (X) < u ≤ 1 for all elementary sets E belonging to the lower approxi1−l mation of the set X, and G(X|E) ≥ 1−P (X) − 1 > 0, where 0 ≤ l < P (X) for all elementary sets E belonging to the positive region of the set X.
3.2
Global Measures
Probabilities of Approximation Regions: The probability P (U P Pl (X)) of upper approximation U P Pl (X) of the set X represents the relative size of this region. It was earlier introduced in the framework of the original rough set theory as coverage. It can be computed from probabilities of elementary sets included in U P Pl (X) by P (U P Pl (X)) = E⊆U P Pl (X) P (E). Similarly, the probabilities of u-positive, (l, u)−boundary and l−negative regions can be computed respectively by P (P OSu (X)) = E⊆P OSu (X) P (E) for the posi tive region; P (BN Dl,u (X)) = P E⊆BN Dl,u (X) (E) for the boundary region; and P (N EGl (X)) = E⊆N EGl (X) P (E) for the negative region. The sum γl,u (X) = P (P OSu (X)) + P (N EGl (X)) is a dependency measure reflecting the relative size of the positive and negative region in the universe. It can be used as a measure of accuracy of approximate representation of the set X. This measure is equivalent to P (BN Dl,u (X)) since P (BN Dl,u (X)) = 1 − γl,u (X). Accuracy and Roughness of Approximation: As in the Pawlak’s definition [2], the accuracy ACCl,u (X) of set X approximation can be expressed as a ratio of ”sizes” of lower to upper approximations of the set. In the VPRSM approach, the cardinality of a set is replaced with the probability measure leadOSu (X)) ing to the definition of accuracy function as ACCl,u (X) = PP(P (U P Pl (X)) . Because P OSu (X) ⊆ U P Pl (X), the accuracy of approximation is the same as the conditional probability P (P OSu (X)|U P Pl (X)) of positive region P OSu (X) given that the upper approximation event U P Pl (X) occurred. An alternative measure is roughness, defined after Pawlak as ρl,u = 1 − ACCl,u (X). The roughness represents the degree of uncertainty in set X approximate representation. In probabilistic terms, the roughness can be interpreted as the conditional probability P (BN Dl,u (X)|U P Pl (X)) of boundary region BN Dl,u (X) given that the upper approximation U P Pl (X) occurred.
Evaluation of Probabilistic Decision Tables
193
Approximation Region Coverage Measures: In the original rough set model, the precision measure is defined as the ratio of the size of the lower approximation of the target set X to the size of the set X [2]. In the VPRSM, lower approximation of a set X is not necessarily a subset of the set X. Consequently, the natural generalization of the original precision measure is the coverage of lower approximation defined as conditional probability P (P OSu (X)|X) of positive region of the set X given the occurrence of the set. The coverage of lower approximation can be simply computed by COVP OSu (X) = P (P OSu (X)|X) = P (P OSu (X)∩X) 1 = P (X) E⊆P OSu (X) P (E)P (X|E). Similarly, the coverage of P (X) l (X)∩X) negative region is given by COVN EGl (X) = P (N EGl |X) = P (N EG = P (X) 1 P (E)P (X|E). The above measure provides an estimate of freE⊆N EGl P (X) quency of occurrences of new objects (observations) belonging to positive or negative regions in situations when they do belong to the set X. The value of the coverage of the positive region should be maximized for obvious reasons. On the other hand, the coverage of the negative region should be minimized. In principle, the coverage measure can be computed with respect to any discernible region of the universe. For instance, the coverage of boundary region, COVBN Dl,u (X) = P (BN Dl,u (X)∩X) 1 P (BN Dl,u (X)|X) = = P (X) E⊆BN Dl,u (X)) P (E)P (X|E) P (X) tells us what percentage of cases belonging to X will fall into the boundary BN Dl,u (X). Similarly, the coverage of upper approximation can be defined as conditional probability P (U P Pl (X)|X) of l−upper approximation, given the occurrence of the set X. Approximation Region Certainty Measures: The expected certainties of prediction of the event X are average measures extending over upper approximation and positive and boundary regions. The certainty of positive region provides aggregate information about the odds of an object belonging to the set X if it is known that it belongs to lower approximation of OSu (X)∩X) X. It is P (X|P OSu (X)) defined by CERTP OSu (X) = P (P = P (P OSu (X)) 1 E⊆P OSu (X) P (E)P (X|E). This measure can be used as an estiP (P OSu (X)) mator of the expected performance of a prediction system for predictions involving cases belonging to lower approximation P OSu (X). The certainty of negative EGl (X)∩X) region is defined as CERTN EGl (X) = P (X|N EGl (X)) = P (N P (N EGl (X)) ) = 1 E⊆N EGl (X) P (E)P (X|E). In practical applications with probaP (N EGl (X)) bilistic decision tables [8], the certainty of positive region should be maximized whereas the certainty of negative region should be minimized. The certainty of upper approximation represents general information about expected success rate of predictions based on the upper approximation: CERTU P Pl (X) = P Pl (X)∩X) 1 P (X|U P Pl (X)) = P (U E⊆U P Pl (X) P (E)P (X|E). P (U P Pl (X)) P (U P Pl (X)) = The certainty of boundary region is a measure of the average success rate when making predictions based on information represented by elementary sets P (BN Dl,u (X)∩X) of the region: CERTBN Dl,u (X) = P (X|BN Dl,u (X)) = P (BN D = l,u (X)) 1 E⊆BN Dl,u (X) P (E)P (X|E). We may note that the coverP (BN Dl,u (X)) age and certainty measures are related to each other: CERTP OSu (X) =
194
W. Ziarko
P (X) P (P OSu (X)) COVP OSu (X)
(X) and CERTN EGl (X) = P (NPEG COVN EGl (X). Siml (X)) ilar relations hold for other approximation regions. The relations indicate that these two measures are essentially equivalent although their interpretation is different. Approximation Region Certainty Gain Measures: Similarly to other global measures, we can define expected certainty gain Gregion (X) measure for approximation regions. This measure would capture the average degree of improvement in prediction probability when using the region’s definition in probabilistic decision table. For the positive region, it is defined as GP OSu (X) = 1 P (E|P OS (X))G(X|E) = P (E) u E⊆P OSu (X) E⊆P OSu (X) P (P OSu (X)) G(X|E). For the negative region, GN EGl (X) = E⊆N EGl (X) P (E|N EGl (X)) 1 G(X|E) = P (N EG E⊆N EGl (X) P (E) G(X|E). l (X)) Interestingly, the gain measure is essentially equivalent to the coverage and certainty measures for the positive and negative regions. It can be shown that 1 GP OSu (X) = P (X) CERTP OSu (X) − 1 = P (P OS1 u (X)) COVP OSu (X) − 1 and P (X) P (X) 1 GN EGl (X) = 1−P (X) (1− P (X) CERTN EGl (X)), that is GN EGl (X) = 1−P (X) (1− 1 P (N EGl (X)) COVN EGl (X)), which means that one measure can substituted for the other if probabilities P (X), P (P OSu (X)) and P (N EGl (X)) are known. Global Certainty Gain: To evaluate the expected certainty gain associated with the whole decision table, the measure of global certainty gain G(X) = E⊆U P (E)G(X|E) can be used. When deriving decision table from data, one of the primary guiding objectives should be maximizing the global certainty gain of the table. The global certainty gain is maximized if the target X is a precise or definable set [2]. In this case, the maximum value of the global certainty gain (X) P (X) ∗ Gmax (X) is given by Gmax (X) = P (P OS ∗ (X)) 1−P P (X) + P (N EG (X)) 1−P (X) , with P OS ∗ (X) = ∪{Ei : P (X|Ei ) > P (X)} and N EG∗ (X) = ∪{Ei : P (X|Ei ) < P (X)}. Relative Global Certainty Gain: The ratio of the actual certainty gain to the maximum certainty gain achievable defines the relative global certainty gain G(X) Grel (X) = Gmax (X) . The relative certainty gain provides normalized value of the global certainty gain which measures the overall degree of predictive certainty improvement obtained with a decision table.
4
Evaluation of Probabilistic Decision Tables
The probabilistic decision table [8] is a tabular representation of the available information about the universe U and about the relation existing between the classes E1 , E2 , ... , En of the approximation space and the target set X. In the probabilistic decision table it is assumed that objects are represented by attributes taking finite number of values. The objects are grouped into classes E1 , E2 , ... ,En based on the identity of attribute values. The decision table summarizes the following information for each elementary set Ei : • the unique identification of the class Ei ;
Evaluation of Probabilistic Decision Tables
195
Table 1. Probabilistic Decision Table with P (X) = 0.5248, l = 0.1 and u = 0.8 Ei E1 E2 E3 E4 E5 E6 E7 E8
• • • • •
the the the the the
A1 3 2 3 1 2 2 2 3
A2 0 1 1 1 1 0 0 0
A3 1 0 0 1 1 0 1 0
A4 3 3 3 1 2 3 1 1
P (X|Ei ) 0.0 0.75 0.0 1.0 0.82 0.12 0.92 0.91
P (Ei ) 0.15 0.05 0.10 0.01 0.24 0.05 0.15 0.15
G(X|Ei ) 1.1044 0.4291 1.1044 0.9054 0.5625 0.8518 0.7530 0.7340
R-REGION NEG BND NEG POS POS BND POS POS
combination of attribute values common to all objects in the class; estimate of the conditional probability P (X|Ei ) relative to the set X; estimate of the probability P (Ei ) of the class Ei ; value of the certainty gain measure G(X|Ei ); specification of the approximation region (R-Region) for each class Ei ;
In the following example, we will use measures introduced in previous sections to evaluate the probabilistic decision table (Table 1.) that was obtained assuming that l = 0.1 and u = 0.8 with P (X) = 0.5248. Below, we summarize the approximation regions of the target set X and their respective probabilities: • upper approximation U P P0.1 (X) = E2 ∪ E4 ∪ E5 ∪ E6 ∪ E7 ∪ E8 with P (U P P0.1 (X)) = 0.65; • 0.8-positive region P OS0.8 (X) = E4 ∪ E5 ∪ E7 ∪ E8 and P (P OS0.8 (X)) = 0.55; • (0.1, 0.8)-boundary BN D0.1,0.8 (X) = E2 ∪E6 with P (BN D0.1,0.8 (X)) = 0.1; • 0.1-negative region N EG0.1 (X) = E1 ∪ E3 with P (N EG0.1 (X)) = 0.25. From the above figures and the information given in Table 1, we can calculate global quality measures as follows: P (P OS
(X))
• The accuracy: ACC0.1,0.8 (X) = P (U P 0.1,0.8 P0.1 (X)) = 0.846; • The coverage measures of approximation regions: 1
COVP OS0.8 (X) = P (X) E⊆P OS0.8 (X) P (E)P (X|E) = 0.911; 1
COVBN D0.1,0.8 (X) = P (X) E⊆BN D0.1,0.8 (X) P (E)P (X|E) = 0.082; 1
COVN EG0.1 (X) = P (X) E⊆N EG0.1 P (E)P (X|E) = 0; • The certainty measures of approximation regions:
CERTP OS0.8 (X) = P (P OS10.8 (X)) E⊆P OS0.8 (X) P (E)P (X|E) = 0.875; 1
CERTBN D0.1,0.8 (X) = P (BN D0.1,0.8 E⊆BN D0.1,0.8 (X) P (E)P (X|E) (X)) = 0.435; P (X)
CERTN EG0.1 (X) = P (N EG COVN EG0.1 (X) = 0; 0.1 (X)) • The certainty gains of approximation regions:
196
W. Ziarko
GP OS0.8 (X) =
GN EG0.1 (X)
1 P (P OS0.8 (X)) COVP OS0.8 (X) − 1 = 0.656; P (X) 1 = 1−P (X) (1 − P (N EGl (X)) COVN EGl (X)) =
• the global certainty gain: G(X) =
E⊆U
1.1044;
P (E)G(X|E) = 0.7071;
(X) • the maximum achievable certainty gain: Gmax (X) = P (P OS ∗ (X)) 1−P P (X) + P (X) P (N EG∗ (X)) 1−P (X) = 0.8843;
• the normalized global certainty gain: Grel (X) =
G(X) Gmax (X)
= 0.7996.
The results of the analysis of the example decision table indicate that it has rather favorable characteristics such as good accuracy, high coverage of the positive region, low coverage of the boundary and negative regions, relatively high average certainty and certainty gain in the positive and negative regions, high global certainty gain and low probability of the boundary region.
5
Final Remarks
The article presents a number of measures applicable to analysis of probabilistic decision tables. All measures are defined using probability theory notation as a result of using VPRSM framework. The measures summarized in this paper make it easier to make distinction between good versus not very useful decision tables obtained from data.
References 1. Beynon, M.: Investigatig the choice of l and u values in the extended variable precision rough set model. In: Alpagini, J. Peters, J. Skowron, A. Zhong, N. (eds.) Rough Sets and Current Trends in Computing, Lecture Notes in AI 2475, Springer Verlag (2002) pp. 61–68. 2. Pawlak, Z.: Rough sets – Theoretical aspects of reasoning about data. Kluwer Academic Publishers (1991). 3. Pawlak, Z.: Rough sets and decision algorithms. In: Ziarko, W. and Y. Yao (eds). Rough Sets and Current Trends in Computing, Lecture Notes in AI 2005, Springer Verlag (2001) pp. 30–45. 4. Slezak, D. Ziarko, W.: Bayesian rough set model. Proc. of IEEE ICDM’02 Workshop on Foundations of Data Mining, Maebashi, Japan (2002) pp. 131–135. 5. Wong, S.K.M. Ziarko, W.:Comparison of the probabilistic approximate classification and the fuzzy set model. International Journal for Fuzzy Sets and Systems, vol. 21 (1986) pp. 357–362. 6. Yao, Y.Y., Wong, S.K.M.: A decision theoretic framework for approximating concepts, International Journal of Man-Machine Studies, 37, (1992) pp. 793–809. 7. Ziarko, W.: Variable precision rough sets model. Journal of Computer and Systems Sciences, vol. 46. no. 1, (1993) pp. 39–59. 8. Ziarko, W.: Probabilistic decision tables in the variable precision rough set model. Computational Intelligence: an International Journal, vol. 17, no 3, (2001) pp. 593– 603.
Query Answering in Rough Knowledge Bases Aida Vit´ oria1 , Carlos Viegas Dam´asio2 , and Jan Maluszy´ nski3 1
Dept. of Science and Technology, Link¨ oping University, S 601 74 Norrk¨ oping, Sweden [email protected] 2 Centro de Inteligˆencia Artificial (CENTRIA), Dept. Inform´ atica, Faculdade de Ciˆencias e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal [email protected] 3 Dept. of Computer and Information Science, Link¨ oping University, S 581 83 Link¨ oping, Sweden [email protected] Abstract. We propose a logic programming language which makes it possible to define and to reason about rough sets. In particular we show how to test for rough inclusion and rough equality. This extension to our previous work [7] is motivated by the need of these concepts in practical applications. Keywords: Rough sets, logic programming, stable models, uncertain reasoning.
1
Introduction
A rough set [4] is an approximation of a subset of the universe of discourse. It is usually defined by a finite decision table, whose rows correspond to some objects of the universe. We present a natural extension of this formalism that gives direct support for integration of background (e.g. expert) knowledge, reasoning, and possibility to combine different rough sets for defining new ones. We introduce a language for defining rough sets and its compilation to expressive logic programming language. The idea is similar to our previous work [7] but extends it substantially. Our new language is more expressive, and more suitable for defining concepts in terms of the notions of rough set theory, as illustrated by the examples in the paper. The precise meaning of the definitions is provided by translation to extended logic programs, under the paraconsistent stable model semantics [5,6]. In this way we link rough sets with paraconsistent logic. A program may have several (paraconsistent) stable models (or no model at all). Each of the models describes a family of rough sets and can be seen as a possible scenario of rational beliefs supported by the knowledge represented by the program. The price we pay for the extra expressiveness is the increase in time complexity: to know whether a literal belongs to a stable model is a NP-complete problem. However, efficient implementations exist of the stable model semantics (see e.g. [2,3]) and these systems can be readily used for querying rough sets defined in our language. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 197–204, 2003. c Springer-Verlag Berlin Heidelberg 2003
198
2 2.1
A. Vit´ oria, C. Viegas Dam´ asio, and J. Ma>luszy´ nski
Preliminaries Rough Sets
We deal with a universe U of objects with attributes. An attribute a can be seen as a partial function a : U → Va , where Va is called the value domain of a. Thus, every object is associated with a tuple of attributes. We assume that this tuple is the only way of referring to the object: different objects with same attribute values are indiscernible. A subset S of U can only be characterized by two sets of tuples: S + , including the tuples of the objects in S, and S − , including the tuples of the objects in the complement of S. Notice that S + and S − neither need to be disjoint nor they have to cover the universe. + − + − A rough set (or rough relation) S is a pair (S , S ) such that S , S ⊆ 1≤i≤n Vai , for some non empty set of attributes {a1 , · · · , an }. The indiscernibility classes S + (called the upper approximation and also represented as S) include all elements that possibly belong to S. Similarly, the classes S − include all elements that possibly do not to belong to S. The lower approximation of S defined as (S + − S − ) (and represented as S) includes only the non conflicting elements that belong to S. The rough complement of a rough set S = (S + , S − ) is the rough set ¬S = (S − , S + ). We want to stress that in our work a rough set is not defined in terms of objects of the universe, but instead in terms of the tuples that describe each equivalence class (i.e. class of indiscernible objects) to which the objects belong. Moreover, the terms “rough set” and “rough relation” are used interchangeably. 2.2
Extended Logic Programs
This section surveys well-known concepts of logic programming needed in the sequel. For more details the reader is referred to [1,5,6]. We start by presenting the syntax of extended logic programs and defining briefly the paraconsistent stable model semantics of extended logic programs [5,6], the target logic programming language of the transformations discussed in section 3.2. We resort only to the disjunctive free fragment of the languages described in [5,6]. Without loss of generality, we consider only ground logic programs (i.e. no variables occur in an atom A). Assume that the alphabet K is a set of propositional variables, called atoms. An objective literal L is either an atom A ∈ K or its explicit negation ¬A. The set of all objective literals is K ∪ ¬K, where ¬K = {¬A : A ∈ K}. The default negation of a literal L is represented by not L (also called default negated literal). A literal is either an objective literal L or its default negation not L. Intuitively, an objective literal represents a (positive or negative) evidence, while the default negated literal represents a lack of (respectively, positive or negative) evidence. This makes it possible to represent differently the information that a flight departed without delay obtained from the flight control, from lack of the delay announcement. We allow coexistence of positive and negative evidence; for formalizing this we will use a paraconsistent logic.
Query Answering in Rough Knowledge Bases
199
A program clause is an expression L0 :- L1 , . . . , Lm , not Lm+1 , . . . , not Ln where each Li is an objective literal and n ≥ 0. The left side of the clause (w.r.t. :-) is called the head and the right side is designated as body of the clause. Informally, a program clause represents an implication: if every literal in the body is true then the head must also be true1 . An integrity constraint has the form :- L1 , . . . , Lm , not Lm+1 , . . . , not Ln with n ≥ 1, and can be seen as a clause with the head being f alse. An extended logic program (ELP) is a set of program clauses and integrity constraints. A definite extended logic program is simply an ELP without integrity constraints and occurrences of default negated literals. An interpretation I of an extended logic program P is any subset of K ∪ ¬K. As usual, an interpretation settles the set of true literals. If L ∈ I then the objective literal L has the truth value true, and if L ∈ I then the objective literal L is false (i.e. not L is true). An interpretation I satisfies a program clause if the corresponding implication holds in I, and satisfies an integrity constraint if at least one literal in its body is false. A model of an extended logic program P is any interpretation that satisfies every program clause and integrity constraint of P. Intuitively, an integrity constraint discards all model candidates that make every literal in its body true. A program may have many models, they are ordered by set inclusion. Any definite ELP has a least model, other ELP’s may have several minimal models. We want to consider only the models where each objective literal can be justified by some evidence in the program. This intuition is captured by the following definition: Definition 1. Let P be an extended logic program and I an interpretation. The reduct of P with respect to I is the definite extended logic program P I such that L0 :- L1 , . . . , Lm is in P I iff there is a program clause of the form L0 :- L1 , . . . , Lm , not Lm+1 , . . . , not Ln from P such that {Lm+1 , . . . , Ln }∩I = {}. The interpretation I is a paraconsistent stable model of P iff I is the least model of P I and I satisfies all integrity constraints of P . Whenever confusion does not arise, we sometimes use the term model instead of paraconsistent stable model. Finally, we refer that Smodels [3] and dlv [2] are currently available systems for computing stable models of programs (often with tens of thousands of clauses). Both systems can also handle integrity constraints, and can be easily used to determine paraconsistent stable models of extended logic programs.
3
A Language for Defining Rough Relations
In this section, we present a language for defining and querying rough relations, based on logic programming. It substantially extends our previous work [7] by 1
If variables are allowed then they should be understood as universally quantified.
200
A. Vit´ oria, C. Viegas Dam´ asio, and J. Ma>luszy´ nski
allowing lower approximations and boundaries of rough relations in the head and in the body of clauses. We illustrate the need and potential usefulness of such extension by motivating examples. We also give the semantics of the new language. Note that the least model semantics of [7] is no more applicable. 3.1
The Syntax and Motivating Examples
In this section, variables can occur in atoms (i.e. we allow non-ground atoms). Thus, an atom A is an expression of the form p(t1 , . . . , tm ), where p is an m-ary predicate symbol and t1 , . . . , tm are variables or constants. As in the previous section, an objective literal L is an expression of the form A or ¬A, where A is an atom. Given an objective literal L, expressions of the form L , L , or L are called rough literals. A rough clause is a formula of the form H:- B1 , . . . , Bn . , where H and every Bi (0 ≤ i ≤ n) are rough literals. A rough program P is a set of rough clauses. Moreover, clauses with an empty body (i.e. n = 0) are usually called facts and are written as H. , where H is a rough literal. Intuitively, each predicate p denotes a rough relation P and we use rough literals to represent evidence about tuples. For instance, the facts p(t1 , . . . , tn ). and ¬p(t1 , . . . , tn ). express the information that the tuple belongs both to the rough relation P and to its complement ¬P (thus, to the boundary of P , denoted by p(. . . )). Obviously, p(. . . ) and ¬p(. . . ) denote the same set of tuples. The lower (upper) approximation of P is represented by p(. . . ) (p(. . . )). A decision table D can be represented in our language. A row of D corresponding to a positive (negative) example, where each ci is the value of a conditional attribute, is represented as the fact d(c1 , . . . , cn ). (¬d(c1 , . . . , cn ).). Since rough clauses allow lower and upper approximations of a relation as well as boundaries to occur both in the body and head of a clause, it is possible to define separately each of the regions (i.e. lower and upper approximations and boundary) of a rough relation in terms of regions of other rough relations. For instance, we can represent that the boundary of a rough relation Q is contained in the lower approximation of another rough relation P . If predicates q and p denote the rough relations Q and P , respectively, then the rough clause p(X1 , . . . , Xn ):- q(X1 , . . . , Xn ). captures such information. The following examples motivate the potential usefulness of our language. Example 1. A relation Train has two arguments (attributes) representing time and location, respectively. Two sensors automatically detect presence/absence of an approaching train at a crossing, producing facts like train(12:50,Montijo). automatically added to the knowledge base. A malfunction of a sensor may result in the contradictory fact ¬train(12:50,Montijo). being added, too. Crossing is allowed if for sure no train approaches. This can be described by the following clause involving lower approximation in the body. cross(X,Y) :- ¬train(X,Y).
Query Answering in Rough Knowledge Bases
201
Example 2. Statistical data on purchases of certain product during a calendar year is organized as a decision table with the following 4 attributes defining groups of customers: Area - zip code of the area where the customer lives Income - customer’s income interval Age - customer’s age interval Unimulti - does the customer live in a house or in an apartment?
Two experts classify every group as active or not active depending on the number of transactions during the year. The opinions of the experts may be different, thus the decision table defines a rough relation. The marketing department uses the activity tables act1 and act2 from two consecutive years to identify the groups of growing activity (ga). The tables are represented as facts in our language. The activity of a group may be defined: (1) as definitely growing, if the group was possibly inactive in year 1 and definitely active in year 2; (2) as definitely non growing, if its activity changed from possibly active to definitely inactive; and (3) as a boundary, if the activity was boundary in both years. This can be described by the following clauses: ga(X, Y, W, Z) :- ¬act1(X, Y, W, Z), act2(X, Y, W, Z). ¬ga(X, Y, W, Z) :- act1(X, Y, W, Z), ¬act2(X, Y, W, Z). ga(X, Y, W, Z) :- act1(X, Y, W, Z), act2(X, Y, W, Z). 3.2
(1) (2) (3)
The Semantics
In this section, we present a transformation of rough programs into extended logic programs, introduced in section 2.2. In this way, we obtain both a declarative and operational semantics for our language. The intuition is as follows. Assume that P and Q are the rough relations denoted by predicates p and q, respectively. Then, the literal p(t1 , . . . , tn ) is a statement that the tuple is in P and the literal ¬p(t1 , . . . , tn ) indicates that tuple < t1 , . . . , tn > is not in P . (i.e. belongs to ¬P ). The default negated literal not p(t1 , . . . , tn ) (not ¬p(t1 , . . . , tn )) states that there is no evidence that the tuple < t1 , . . . , tn > is a positive (negative) example of P . Now rough literals can be equivalently expressed by conjunctions of literals of ELPs, as formalized by the following transformation τ2 : τ2 (p(t1 , . . . , tn )) = p(t1 , . . . , tn ), not ¬p(t1 , . . . , tn ) , τ2 (¬p(t1 , . . . , tn )) = ¬p(t1 , . . . , tn ), not p(t1 , . . . , tn ) , τ2 (p(t1 , . . . , tn )) = p(t1 , . . . , tn ) , τ2 (¬p(t1 , . . . , tn )) = ¬p(t1 , . . . , tn ) , τ2 (p(t1 , . . . , tn )) = p(t1 , . . . , tn ), ¬p(t1 , . . . , tn ) , τ2 (p(t1 , . . . , tn )) = p(t1 , . . . , tn ), ¬p(t1 , . . . , tn ) , τ2 (¬p(t1 , . . . , tn )) = τ2 (p(t1 , . . . , tn )) , τ2 ((B1 , . . . , Bn )) = τ2 (B1 ), . . . , τ2 (Bn ) . This transformation can be used to compile rough literals in the bodies of source (i.e. rough) program clauses. However, the translation is not directly
202
A. Vit´ oria, C. Viegas Dam´ asio, and J. Ma>luszy´ nski
applicable to the heads, since the heads in the target programs can contain neither conjunctions of literals nor default literals. Therefore, it may be necessary to compile a clause in the source program into a clause and an integrity constraint of the target program, as described below. For example, consider a rough clause like p(. . . ):- q(. . . ). stating that the boundary of Q is contained in the lower approximation of P . Any element in the boundary of Q should be also considered a positive example of P but there shouldn’t be evidence that those tuples are examples of ¬P . Moreover, a tuple t belongs to the boundary of Q if and only if it represents both positive and negative evidence of it. Thus, p(. . . ):- q(. . . ), ¬q(. . . ). and :- ¬p(. . . ), q(. . . ), ¬q(. . . ). capture the same information as the rough clause above. The program clause states that tuples belonging to both Q and ¬Q also belong to P , while the integrity constraint does not allow those tuples to belong to ¬P . The discussion above gives a motivation for the formalization of the translation of rough clauses into clauses of an extended logic program. This formalization is defined as the following function τ1 which refers to the above defined function τ2 . τ1 (p(t1 , . . . , tn ):- B.) = {p(t1 , . . . , tn ):- τ2 (B). , :- ¬p(t1 , . . . , tn ) , τ2 (B).} , τ1 (p(t1 , . . . , tn ):- B.) = {p(t1 , . . . , tn ):- τ2 (B).} , τ1 (¬p(t1 , . . . , tn ):- B.) = {¬p(t1 , . . . , tn ):- τ2 (B). , :- p(t1 , . . . , tn ) , τ2 (B).} , τ1 (¬p(t1 , . . . , tn ):- B.) = {¬p(t1 , . . . , tn ):- τ2 (B).} , τ1 (p(t1 , . . . , tn ):- B.) = {¬p(t1 , . . . , tn ):- τ2 (B). , p(t1 , . . . , tn ):- τ2 (B).} , τ1 (¬p(t1 , . . . , tn ):- B.) = τ1 (p(t1 , . . . , tn ):- B.) . A rough program P will be transformed into an extended logic program τ1 (P) by compiling each rough clause. Moreover, if MP is a paraconsistent stable model of P = τ1 (P) then each predicate symbol q with arity n, occurring in P, denotes the rough relation QMP = ({(c1 , . . . , cn ) | q(c1 , . . . , cn ) ∈ MP }, {(c1 , . . . , cn ) | ¬q(c1 , . . . , cn ) ∈ MP }) ,
in the model MP . Recall that τ1 (P) is an extended logic program and, therefore, may have several paraconsistent stable models (or none). In each model, the predicate q may denote a different rough relation. Consequently, the denotation of a predicate is always with respect to a model. 3.3
Queries
This section proposes a language to query rough programs. This can be achieved by adapting existing systems based on the stable model semantics [2,3], which is a topic of future work. Here, we only present queries and their expected answers. Since there might exist more than one model, answers are computed w.r.t. one paraconsistent stable model of the program. If a program has a unique paraconsistent stable model, which may often be the case2 , the answers will refer to this model. 2
For instance, any definite extended logic program or rough program whose clauses do not contain lower approximations in their bodies, has a unique model.
Query Answering in Rough Knowledge Bases
203
Definition 2. A rough query is a pair (Q, P), where P is a rough program and Q is defined by the following abstract syntax rule Q −→ L | L | L | L1 ⊆ L2 | L1 ⊆ L2 | L1 L2 | L1 = L2 | L1 = L2 | L1 ≈ L2 . where L, L1 , and L2 are objective literals. The first three cases for a rough query (L, L, and L) have already been introduced in our previous work [7]. For instance, with the query (q(c1 , c2 ), P) we want to know whether the tuple belongs to the boundary region of the rough relation denoted by q in some model of P. Due to lack of space, we omit here any further details about translation of this kind of queries. In some applications (see e.g. [8]) it is necessary to check rough inclusion or rough equality of given rough relations. Our query language has the respective queries and we now discuss how they could be answered. The idea is to translate them to a set of integrity constraints that are added to the compiled program (τ1 (P)). Hence, a new extended logic program P is obtained in this way. Then, the query is answered positively (i.e. the test succeeds) if P has at least one paraconsistent stable model. Otherwise, the query is answered negatively (i.e. the test fails). Thus, we reduce the answering problem for this kind of queries to the problem of checking the existence of paraconsistent stable models of an ELP where certain properties, expressed by the integrity constraints, hold. Given an objective literal L, we consider that ¬¬L and L have the same meaning. Moreover, assume that the objective literals L1 and L2 denote rough relations Q1 and Q2 , respectively. We start by considering the queries (L1 ⊆ L2 , P) and (L1 ⊆ L2 , P) and define a function τ3 that transforms these queries into a set of integrity constraints. We remind the reader that candidate models that make each literal in the body of the constraints true are rejected. τ3 (L1 ⊆ L2 ) = {:- L1 , not L2 .} , τ3 (L1 ⊆ L2 ) = {:- L1 , not ¬L1 , not L2 . , :- L1 , not ¬L1 , ¬L2 .} . We recall the notions of rough inclusion and rough equality [4]. Rough relation Q1 is roughly included in rough relation Q2 , denoted as Q1 Q2 , if and only if Q1 ⊆ Q2 and Q1 ⊆ Q2 . The rough sets Q1 and Q2 are roughly equal, denoted as Q1 ≈ Q2 , if and only if Q1 = Q2 and Q1 = Q2 . Given a rough program P, we have that the answer to the query (L1 ⊆ L2 , P) is yes iff the ELP P1 = τ1 (P) ∪ τ3 (L1 ⊆ L2 ) has a model. (L1 ⊆ L2 , P) is yes iff the ELP P2 = τ1 (P) ∪ τ3 (L1 ⊆ L2 ) has a model. (L1 L2 , P) is yes iff the ELP P = P1 ∪ P2 has a model. (L1 = L2 , P) is yes iff the ELP P3 = τ1 (P) ∪ τ3 (L1 ⊆ L2 ) ∪ τ3 (L2 ⊆ L1 ) has a model. (v) (L1 = L2 , P) is yes iff the ELP P4 = τ1 (P) ∪ τ3 (L1 ⊆ L2 ) ∪ τ3 (L2 ⊆ L1 ) has a model. (vi) (L1 ≈ L2 , P) is yes iff the ELP P = P3 ∪ P4 has a model. (i) (ii) (iii) (iv)
204
4
A. Vit´ oria, C. Viegas Dam´ asio, and J. Ma>luszy´ nski
Conclusions and Future Work
We introduced a language for representing vague knowledge in the framework of the rough set theory and a query language. We defined a natural translation of our language to extended logic programming under the paraconsistent stable model semantics. This opens for re-use of the existing logic programming systems based on the stable model semantics for answering rough set queries. Our language uses common notions of rough set theory: lower and upper approximations, and boundaries for implicit definitions of rough sets. The usual technique of defining rough sets by decision tables is embedded as a special case. The language is flexible enough to allow separate definitions of each region of a rough set and substantially extends our previous proposal in [7]. Since a program P in our language may have different models, the rough sets defined by P may not be unique. This fact can be useful in some applications. The proposed query language is very expressive. It makes it possible not only to search for elements in particular regions of rough relations but also to test for rough inclusion and rough equality, which is essential in some applications. Continuation of this work will include development of a system based on the presented ideas, integrating them with techniques commonly used in rough sets such as reducts and quantitative techniques , and testing it on real life examples.
References 1. K. Apt and R. Bol. Logic programming and negation: A survey. In Journal of Logic Programming, volume 19/20, pages 9–72. Elsevier, May/July 1994. 2. T. Eiter, N. Leone, C. Mateis, G. Pfeifer, and F. Scarcello. The KR system dlv: Progress report, comparisons and benchmarks. In A. G. Cohn, L. Schubert, and S. C. Shapiro, editors, KR’98: Principles of Knowledge Representation and Reasoning, pages 406–417, San Francisco, California, 1998. Morgan Kaufmann. 3. I. Niemel¨ a and P. Simons. Efficient implementation of the well-founded and stable model semantics. In M. Maher, editor, Proc. of the Joint International Conference and Symposium on Logic Programming, pages 289–303, Bonn, Germany, 1996. MIT Press. 4. Z. Pawlak. Rough sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht, 1991. 5. D. Pearce. Answer sets and constructive logic, II: Extended logic programs and related non-monotonic formalisms. In L. Pereira and A. Nerode, editors, Logic Programming and Nonmonotonic Reasoning - proceedings of the second international workshop, pages 457–475. MIT Press, 1993. 6. C. Sakama and K. Inoue. Paraconsistent Stable Semantics for Extended Disjunctive Programs. Journal of Logic and Computation, 5(3):265–285, 1995. 7. A. Vit´ oria and J. Ma>luszy´ nski. A logic programming farmework for rough sets. In J. Alpigini, J. Peters, A. Skowron, and N. Zhong, editors, Proc. of the 3rd International Conference on Rough Sets and Current Trends in Computing, RSCTC’02, number 2475 in LNCS/LNAI, pages 205–212. Springer-Verlag, 2002. 8. W. Ziarko and X. Fei. VPRSM approach to WEB searching. In J. Alpigini, J. Peters, A. Skowron, and N. Zhong, editors, Proc. of the 3rd International Conference on Rough Sets and Current Trends in Computing, RSCTC’02, number 2475 in LNCS/LNAI, pages 514–521. Springer-Verlag, 2002.
Upper and Lower Recursion Schemes in Abstract Approximation Spaces Peter Apostoli and Akira Kanda Department of Philosophy, The University of Toronto, {apostoli,akanda}@cs.toronto.edu
Abstract. An approximation space (U, R) placed in a type-lowering retraction with 2U ×U provides a model for a first order calculus of relations for computing over lists and reasoning about the resulting programs. Upper and lower approximations to the scheme of primitive recursion of the Theory of Pairs are derived from the approximation operators of an abstract approximation space (U, ♦ : u → ∪[u]R , : u → ∩[u]R ). Keywords. Natural deduction set theory, Negation in logic programming systems, the second recursion theorem, rough sets, abstract approximation spaces and modal logic.
1
Set-Based Programming Systems
The idea of presenting a computational formalism as a subtheory of a logical system was first explored by G¨odel in order to arithmetize first order arithmetic in the context of his famous incompleteness results. The emergence of the logic programming paradigm in computer science has re-focussed attention on the G¨odelian themes program = f ormula and computation = theorem proving. Such environments, in which programming and reasoning about programs take place on a common logical basis, are called “programming systems”. Since set theory is thought to underlie mathematics and computation, it provides a convenient language for programming systems. However, the marriage of programming and set theory risks the kind of inconsistency that plagued the founders modern logic, originating from the Zermelo-Russell paradox. Most proposals for programming systems have followed one of the traditional foundational schools of thought, including type-theory and higher order logic, axiomatic set theory, Peano arithmetic (or equivalents) or finally combinatory logic. However, typing outlaws the characteristic ability of computation theory to treat the same item as both data and program, while axiomatic set theory and arithmetic encode these operations but their Hilbert-style presentation is ill-suited to computational implementation in terms of deterministic re-write systems. Type-free systems in the tradition of combinatory logic admit the dual operations as λ abstraction and application respectively and following [8] may be presented as Gentzen LK 1 sequent calculi. Called “natural deduction based set theories" in [8],2 these analytic calculi yield deterministic re-write systems by standard 1 2
See [15] for background on LK. See [15], [7] and [8] for background on natural deduction based set theory.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 205–212, 2003. c Springer-Verlag Berlin Heidelberg 2003
206
P. Apostoli and A. Kanda
resolution techniques for quantification (e.g. unification). However, natural deduction set theories usually jettison some principle of classical logic to disarm the antinomies3 . E.g., systems based upon partial logic4 [8], [2] reject bivalence, as do those based upon intuitionistic logic [16]. The method of upper and lower approximations of rough set theory [14] provides an alternative to the departure from classicality [3]. Given a domain of discourse U , λ abstraction and application may be implemented as a pair (f, g) of functions f : 2U → U, g : U → 2U such that f (g(u)) = u . Writing “u ∈ u ” for u ∈ g(u ), (U, f, g) is a model of abstract set theory in which Cantor’s “consistent multiplicities” are identified with precisely those subsets of U which are elements of the section g[U ] of the retraction. Hence the fundamental question [4] arises: which subsets are these? One answer [3] is to augment the retraction pair (f, g) with an equivalence relation R on U and characterize the section as comprised of precisely the R-closed subsets of U. Comprehension is governed by the method of upper and lower approximations in the resulting universe (U, R, f, g) of abstract rough sets, called [3] a “proximal” Frege structure. Generalized to a theory of binary relations, a model for a classical LK calculus5 of rough sets and relations is obtained, closely related to the type-free LK calculus of [2]. In this new programming system, sets of lists are computed by a method of upper and lower approximating fixed points applied to recursive functionals and are reasoned about classically.
2 Abstract Rough Set Theory Let U = ∅ and R ⊆ U × U be an equivalence relation. The pair (U, R) is called an approximation space [14]. For each point u ∈ U, let [u]R denote the equivalence class of u under R. R-classes [u]R are called (R-) granules or elementary subsets. Let A ⊆ U ; Int(A) =df {x ∈ U | (∀y ∈ U )[xR y → y ∈ A]} = ∪{[u]R | [u]R ⊆ X}, Cl(A) =df {x ∈ U | (∃y ∈ U )[xR y ∧ y ∈ A]} = ∪{[u]R | [u]R ∩ X = ∅}, are called the lower and upper approximations of a subset A, respectively. Cl(A) is a relational closure under R of the subset A. R is interpreted as a relation of indiscernibility in terms of some prior family of concepts (subsets of U ). R-closed subsets of U are called (R-) complete. It is natural to regard complete subsets of U as the parts of U and elementary subsets as the atomic parts of U. C(R) denotes the family of R-complete subsets of U. 3
Though see the “weakly-typed" systems of [1], [9], [10]. These classical systems enjoy most advantages of type-free theories but disarm the antinomies by treating sets as predicate expressions and rigorously distinguishing the “use" of an expression in predication from its “mention" as an element of a set. 4 See [7] for background on partial logic and set theory. 5 The present essay provides a model F for a first order LK calculus of sets and relations similar to that of [2]. Since brevity precludes presenting this calculus here, we use first order schemes to characterize the key axioms and rules valid on F leaving the corresponding LK axiomatization to the interested reader.
Upper and Lower Recursion Schemes in Abstract Approximation Spaces
207
Let (U, R) and (U , R ) be approximation spaces. Define an equivalence relation R × R over U × U as follows: (x, x )[R × R ](y, y ) ↔ xRy ∧ x R y . This yields a new approximation space (U × U , R × R ). If W ⊆ U × U is R × R -closed, then the first projection π1 (W ) = {x ∈ U | (∃x ∈ U )(x, x ) ∈ W } and the second projection π2 (W ) = {x ∈ U | (∃x ∈ U )(x, x ) ∈ W } are R, R -closed respectively. Further, if A ⊆ U and A ⊆ U are R, R -closed respectively, then A × A is R × R -closed. Thus R × R -closed subsets of U × U are called complete relations, i.e., complete subsets of the approximation space (U × U , R × R ). Let U 1 = U, R1 = R. Let (U, R) be an approximation space. Let f1 : 2U → U, g1 : U → 2U be functions, called down (for type-lowering) and up (for type-raising), respectively. Let f2 : 2U ×U → U , g2 : U → 2U ×U be functions down and up. Assume further that, 1. fi is a retraction and gi is the adjoining section: fi (gi (u)) = u (u ∈ U ), i.e. fi ◦ gi = 1U , (i = 1, 2). 2. The operator gi ◦ fi is closure under Ri (i = 1, 2): For every X ⊆ U i , g(f (X)) is Ri -closed and X ⊆ gi (fi (X)) so gi (fi (X)) = Cl(X) (i = 1, 2). 3. The complete subsets of U i are precisely the X ⊆ U i for which gi (fi (X)) = X (i = 1, 2). They are fixed-point of the operator gi ◦ fi (i = 1, 2). Then F = (U, R, f1 , g1 , f2 , g2 ) is called a binary proximal Frege structure (BPFS), written simply as a (U, R, f, g). In [3], f1 is written as · and g1 as ·; this convention is often followed in the sequel. Elements of U are called F-sets, or simply Frege sets. The indiscernibility relation R (= R1 ) of a BPFS is usually denoted by “ ≡ ”. The fundamental idea underlying a BPFS is to induce an approximation space over U by i retracting 2U onto U via the many-one type-lowering retraction pair (fi , gi ) yielding a i reflexive universe of sets U 2U with an associated indiscernibility relation (i = 1, 2). It follows that for every u ∈ U, gi (u) = gi (fi (gi (u))) and hence gi (u) is complete (i = 1, 2). Suppose X ⊆ U i is complete; then g(f (X)) = X. Thus, X ∈ gi (U ) = Image(gi ), (where gi (U ) =df {gi (u) | u ∈ U }). So, the family of complete subsets and relations of U is precisely the image of the gi . In algebraic term they are the kernel of the retraction mappings. Further we have the isomorphism C(Ri ) ≈ U is given by: i : C(Ri ) → U : X →fi (X), j : U → C(Ri ) : u →gi (u) (i = 1, 2). The foregoing i is summarized by the equation: C(Ri ) ≈ U 2U (i = 1, 2). Let F = (U, ≡, f, g) be a BPFS and u ∈ U . Writing “u(x)” for “x ∈ fi (x)” (x ∈ U i , i = 1, 2), U may be interpreted as a universe of abstract (“Frege”) sets and relations. It is often convenient to write “x ∈ u” for “u(x)” (x ∈ U i , i = 1, 2). As fi (u) is Ri -complete, x ∈ fi (u) ↔ x ∈ fi (u) whenever xRi x (x, x ∈ U i , i = 1, 2). Note that the conditional u(x) → x ∈ gi (u) governs the relation of set membership holding between Frege sets (object u may be treated as a complete subset of U i ), while its converse x ∈ gi (u) → u(x) yields the unrestricted comprehension of complete subsets (complete subsets may be treated as objects) (x ∈ U i , i = 1, 2). F thus validates the Principle of Naive Comprehension (PNC) fi (X i )(x) ↔ X(x) (x ∈ U i ) for Ri complete subsets X of U i (i = 1, 2). When X fails to be complete, the equivalence also fails and is replaced by a pair of approximating conditionals: (1) fi (Int(X))(x) → X(x); (2) X(x) → fi (Cl(X))(x) (x ∈ U i , i = 1, 2). Here, “X(x)” (“x falls under X”) indicates that x ∈ U i is an element of X ⊆ U i (i = 1, 2). Write “{x : X(x)}”,
208
P. Apostoli and A. Kanda
“{x, y : X(x, y)}” to denote f1 (X), f2 (X) respectively. These denote elements of U , while “{u ∈ U | X(u)}”, “{x, y ∈ U | X(x, y)}” denote subsets of U, U 2 respectively. Note the distinction between the indiscernibility class [x]Ri of x ∈ U i and the object fi ([x]Ri ) ∈ U that represents [x]Ri ; the latter is denoted {x} when x ∈ U and u1 , u2 when x = (u1 , u2 ) ∈ U 2 . The former (latter) is called the proximal singleton of x (proximal ordered pair of x1 , x2 ). Note u1 , u2 = {x, y : u1 ≡ x ∧ u1 ≡ y}. The Axiom of Extensionality (a) holds in F and (b) Ri is the relation of set theoretic indiscernibility: (a) (∀z ∈ U i )(x(z) ↔ y(z)) ↔ x = y (x, y ∈ U, i = 1, 2); (b) (∀u ∈ U )(u(z) ↔ u(z )) ↔ zRi z (z, z ∈ U i , i = 1, 2). Let X c =df U i − X (X ∈ C(Ri ), i = 1, 2). Since elements of U i represent complete subsets of U i , the complete Boolean algebra (C(Ri ), ∪, ∩,c , ∅, U ) is isomorphic to (U i , fi (∪), fi (∩),fi (c) , fi (∅), fi (U i )) under the restriction fi C(Ri ) of the typelowering retraction fi to complete subsets of U i (i = 1, 2). Here, fi (∪), fi (∩),fi (c) denote the definitions of union, intersection and complementation natural to Frege sets, i.e., u1 fi (◦)u2 =df fi (gi (u1 ) ◦ gi (u1 )) (◦ ∈ {∪, ∩}); ufi (c) =df fi (U i − gi (u)) (i = 1, 2). The f1 on these operations is usually suppressed. We define “u1 ⊆i u2 ” to be “gi (u1 ) ⊆ gi (u2 )” (i = 1, 2), i.e., the inclusion of Frege sets and relations is the partial ordering naturally associated with the Boolean algebras of Frege sets and relations, respectively. “ ⊆1 ” is abbreviated as simply “ ⊆ ”. As a corollary to (b), the unary PFS (U, R, f1 , g1 ) is a reduct of the binary PFS (U, R, f1 , g1 , f2 , g2 ) in the sense that the discriminative capacities of the latter are preserved by the former: (∀u)(x ∈ u ↔ y ∈ u). ⇔ .(∀u, v)(u(x, v) ↔ u(y, v)) ∧ (∀u, v)(u(v, x) ↔ u(v, y)). Let a ∈ U ; define the outer penumbra of a, symbolically, ♦a, to be the F-set ∪[a]≡ ; dually, define the inner penumbra, a, to be the F-set ∩[a]≡ . These operations, called the penumbral modalities, are interpreted using David Lewis’ counterpart semantics for modal logic [12]. An F-set b is called a counterpart of an F-set a iff a ≡ b. Then a (♦a) represents the set of F-sets that belong to all (some) of a’s counterparts. Thus “a ∈ b” (“a ∈ ♦b”) means “a is an element of all (some) counterparts of b”. In this sense a F-set x necessarily (possibly) belongs to a just in case x belongs to a (♦a). Penumbral modalities of relations are now defined: ♦2 a =df f2 (∪)[a]≡ , 2 a =df f2 (∩)[a]≡ . Let Plenitude be the conjunction of the principles that, for all a, b ∈ U , 1. a ≡ a ≡ ♦a; further, a ⊆ b and a ≡ b entails that for all c ∈ U , a ⊆ c ⊆ b ⇒ c ≡ b and 2. 2 a ≡ a ≡ ♦2 a; further g2 (a) ⊆ g2 (b) and a ≡ b entails that for all c ∈ U , g2 (a) ⊆ g2 (c) ⊆ g2 (b) ⇒ c ≡ b. A BPFS satisfying Plenitude is called a plenum. Hence forth it is assumed that F is a plenum. Let a ∈ U and X ⊆ U such that x ≡ a for all x ∈ X. Then ∪X, ∩X ≡ a. ([a]≡ , ∩, ∪, a, ♦a) is a complete lattice with least (greatest) element a (♦a). Call (U, ∪, ∩,c , U , ∅, , ♦) the penumbral algebra of F. It is a normal, extensive, idempotent modal algebra satisfying all the S5 axiom schemes except for “linearity” (a∩b) = a∩b (= additivity and monotonicity); in addition ♦u ⊆ ♦u
Upper and Lower Recursion Schemes in Abstract Approximation Spaces
209
(u ∈ U ). The penumbral algebra of a plenum is an example of an abstract [5], [18] approximation space. a is penumbrally open (closed) iff a = a (♦a = a). The principle that disjoint nonempty sets are discernible is a trivial corollary of the axiom of extensionality in axiomatic set theory but not in F. F validates the principle of the Discernibility of the Disjoint (DoD) iff: xfi (∩)y = ∅ ⇒ ¬x ≡ y
(x, y ∈ U, x = ∅, i = 1, 2)
In the sequel we assume that the plenum F satisfies DoD.
3 The Theory of Pairs The Theory of Pairs [17] is an equivalent of Peano Arithmetic convenient for reasoning about lists and LISP programs. The first two axioms of TP give the theory of 0 (the null list) and ordered pairing operation ·,· which may be thought as a generalized successor function: TP1∗ : ∀xy(¬0 ≡ x, y) TP2∗ : ∀x1 x2 y1 y2 (x1 , x2 ≡ y1 , y2 → x1 ≡ x2 ∧ y1 ≡ y2 ). These axioms are interpreted over the domain P =df {x : (∀z)(IN D(z) → x ∈ z)} of pairs, where IN D(x) (read “x is inductive”) =df 0 ∈ x ∧ (∀z1 z2 )(z1 ∈ x ∧ z2 ∈ x → z1 , z2 ∈ x) and 0 =df ∅. 0 is discernible from any pair of the form p, q. Two pairs p, q and p , q are indiscernible iff p indiscernible p and q indiscernible q . We write p, q, r as an abbreviation for p, q, r and in general p1 , . . . , pk for p1 , . . . , pk−1 , pk . Every pair is either 0 or uniquely of the form 0, p1 , p2 , . . . , pk for k ≥ 1.; thus, the term “pair” and “list” may be used interchangeably. The pairs pi for 1 ≤ i ≤ k are said to be elements (members) of the above list. In order to represent k-ary relations for k ≥ 3, we need to define k-ary membership and abstraction. Let k 3, A ⊆ U k and xk , x = x1 , . . . , xk−1 be distinct variables. Inductively, {x, xk : A} =df {v, xk : v ∈ {x : A}}; F-sets of this form are called k-ary F-sets. Finally, (t1 , . . . , tk ) ∈ {x1 . . . xk : A} =df (t1 , . . . , tk−1 , tk ) ∈ {x, xk : A}. Theorem 1. (k-ary PNC) For each k 3, t = t1 , . . . , tk ∈ P and x = x1 , . . . , xk , F (t) ∈ {x : A} ↔ A(t). Let k ≥ 1 and x = x1 , . . . , xk . A k + 1-ary F-set {x, xk+1 : A} is a (k-ary) Ffunction iff s ≡ t and both (s, sk+1 ), (t, tk+1 ) ∈ {x, xk : A} entails sk+1 ≡ tk+1 (t = t1 , . . . , tk , s = s1 , . . . , sk , tk+1 , sk+1 ∈ P). The family of F-functions contains the constant 0 function {x : x ≡ 0}, the pairing function {x, y, z : x, y ≡ z} and for for each k ≥ 1 and i, 1 i k, the k-ary “projection” F-function {x, y : xi ≡ y}. It is closed under the composition of F-functions. As a generalization of the axioms governing the recursive definition of addition and multiplication in first order arithmetic, TP has a three parts scheme defining a (k + 1)-ary relation symbol Rf for each k-ary primitive recursive function f : P k → P of pairs. Let f : P k → P be given by f (0, p) = g(p) and f (q1 , q2 , p) = h(q1 , q2 , f (q1 , p), f (q2 , p), p), where g : P k−1 → P, h : P k+3 → P
210
P. Apostoli and A. Kanda
are primitive recursive and p = p2 , . . . , pk . In the sequel one defines k, k − 1, k + 3-ary F-functions tf , tg , th representing f, g, h respectively (these are k + 1, k, k + 4-ary Fsets). Then, TP has schemes TP1 - TP5, translated by the following first order statements about F: TP3∗ (∀xy ∈ P)[tf (0, x, y) ↔ tg (x, y)] TP4∗ (∀v1 v2 , x, y ∈ P)[tf (v1 , v2 , x, y) ↔ (∃u1 u2 ∈ P)[tf (v1 , x, u1 ) ∧ tf (v2 , x, u2 ) ∧ th (v1 , v2 , u1 , u2 , x, y)] TP5∗ (∀x1 , x ∈ P)(∃!y ∈ P)[tf (x1 , x, y)] In addition, TP1 (first order TP) has a first order axiom scheme of induction TP6 translated by the first order scheme TP6∗ : X(0) ∧ ∀x1 x2 [X(x1 ) ∧ X(x2 ) → X(x1 , x2 )] → ∀y[X(y)] for all complete subsets X of P. TP2 has a second order axiom of induction TP7 translated by a first order axiom TP7∗ : ∀x[IN D(x) → (∀y ∈ P )[y ∈ x]]. Theorem 2. F validates TP1∗ , TP2∗ , TP6∗ , TP7∗ as well as IN D(P). What about TP3∗ , TP4∗ and TP5∗ ? As Jamie Andrews has shown6 , binary abstraction induces a relational analogue of the Y-combinator of the type-free λ calculus, as in [3]. Under the Y-combinator definition of tf , F validates upper and lower approximations of these schemes. Define upper and lower fixed points of A = A(x, y) in y by: dy A =df {xy : A(x, {x : (x, y) ∈ ♦y})}; f ixy A ≡ {x : (x, dy A) ∈ ♦dy A}; dy A =df {xy : A(x, {x : (x, y) ∈ y})}; f ixy A ≡ {x : (x, dy A) ∈ dy A} . Theorem 3. (Upper and Lower Second Recursion) Let A ⊆ P k be complete and t = t1 , . . . , tk be F-sets. Then F A(t, f ixy A) → (t) ∈ f ixy A, (t) ∈ f ixy A → A(t, f ixy A). Let k ≥ 1, R ⊆ P k and t ∈ U be a k-ary F-set. t enumerates R iff R = {(p1 , . . . , pk ) | F (p1 , . . . , pk ) ∈ t}. R is enumerable iff there is an F-set that enumerates it. R is representable iff R and its complement P k − R are enumerable. To show that the graph of each primitive recursive function over P is representable in F, whence by Kleene’s result the graph of every partial recursive function of pairs is enumerable in F, one associates with each primitive recursive (p.r.) function f : P k → P a pair (tf , tf ) of (k+1)-ary F-sets by induction on the formation of f. For initial functions, suppose f is the zero function λp.0; then tf = tf = {x, y : y ≡ 0}. Suppose f is the pairing function λpq.p, q; then tf = tf is {x, y, z : x, y ≡ z}. Suppose f is the projection function λp1 . . . pk .pi for some 1 i k; then tf = tf is {x1 . . . xk y : xi ≡ y}. Assume f : P k → P is the composition of h : P n → P and gi : P k → P and suppose h and gi are associated with tf and tgi (1 i n). Then, tf = tf is the corresponding composition of F-functions tg1 , . . . , tgn . Suppose f : P k → P is such that f (0, x) = g(x) and f (y1 , y2 , x) = h(y1 , y2 , f (y1 , x), f (y2 , x), x), where g : P k−1 → P and h : P k+3 → P are p.r. functions associated with tg and th and x = x2 , . . . , xk . 6
In private communication in 1991.
Upper and Lower Recursion Schemes in Abstract Approximation Spaces
211
Then tf , the upper approximation of f, is f ixz {x, y, z : A} where A = A(x1 , x, y, z) ⊆ P k+2 is {x1 , x, y, z | x1 ≡ 0 ∧ (x, y) ∈ tg ∨ (∃y1 y2 ∈ P)[(v1 , x, y1 ), (v1 , x, y2 ) ∈ z ∧ x1 ∈ {v1 v2 : (v1 , v2 , y1 , y2 , x, y) ∈ th }]}. tf , the lower approximation of f, is f ixz {x1 , x, y, z : A}. An F-function is said to be primitive recursive if it is of the form tf for some k-ary p.r. function f of pairs. Theorem 4. Let f : P k → P be p.r.. Then for all p1 , . . . , pk+1 ∈ P, F |= tf (p1 , . . . , pk , f (p1 , . . . , pk )), tf (p1 , . . . , pk , pk+1 ) → pk+1 ≡ f (p1 , . . . , pk ). Proof. By induction on the formation history of f, with an inner induction on pairs p1 in the case where f is obtained by the scheme of primitive recursion. Thus, while tf enumerates the graph of f, tf enumerates its complement P k − f. In this way the pairs (tf , tf ) represents all p.r. functions f. Corollary 1. All primitive recursive functions of pairs have graphs which are representable. Corollary 2. (Kleene) All partial recursive functions of pairs have graphs which are enumerable. Thus, F models a computationally complete programming system, i.e., one which computes all partial recursive functions. Its bounded-quantifier fragment comprises a subtheory capable of computing all primitive recursive functions. But how powerful is the first order theory of F in reasoning about the functions its computational subtheory represents? The question is, how much more of TP may we translate into the first order theory of F? Theorem 5. The following first order sentences, serving as upper and lower approximations to TP3, TP4 and TP5, are valid on F: T P 3 (∀xy ∈ P)[tg (x, y) → tf (0, x, y)] T P 3 (∀xy ∈ P)[tf (0, x, y) → tg (x, y)] T P 4 (∀v1 v2 , x, y ∈ P)[(∃u1 u2 ∈ P)[tf (v1 , x, u1 ) ∧ tf (v2 , x, u2 )∧ th (v1 , v2 , u1 , u2 , x, y)] → tf (v1 , v2 , x, y)] T P 4 (∀v1 v2 , x, y ∈ P)[tf (v1 , v2 , x, y) → (∃u1 u2 ∈ P)[tf (v1 , x, u1 ) ∧ tf (v2 , x, u2 ) ∧ th (v1 , v2 , u1 , u2 , x, y)]] T P 5 (∀x1 , x ∈ P)(∃y ∈ P)[tf (x1 , x, y)] T P 5 (∀x1 x2 , x, y1 , y2 ∈ P)tf (x1 , x, y1 ) ∧ tf (x2 , x, y2 ) → y1 ≡ y2 . Corollary 3. Suppose all primitive recursive F-functions are penumbrally clopen. Then F (TP2 )∗ .
4
Conclusion
Techniques from rough set theory yield a consistent, classical type-free programming system in which sets and functions of lists may be computed and reasoned about classically by the method of upper and lower recursion. Its bounded-quantifier fragment
212
P. Apostoli and A. Kanda
comprises a computational subtheory representing all p.r. functions via pairs tf , tf of upper an lower approximating F-sets, the former enumerating f, the latter, the complement of f, thus implementing classical negation for p.r. queries. Rough set theory thus points the way out of the thorny problem of “negation by failure” that troubles the field of logic programming to this day.
References 1. Andrews, James H. [2002]: “A Weakly-Typed Higher Order Logic with General Lambda Terms and Y Combinator,” Proceedings, Works In Progress Track, 15th International Conference on Theorem Proving in Higher Order Logics (TPHOLs ’02), Hampton Roads, Virginia, August 2002, 1–11, NASA Conference Publication CP-2002-211736. 2. Apostoli, P. [2000]: “The Analytic Conception of Truth and The Foundations of Arithmetic,” J. of Symbolic Logic, 65 (1) 33–102. 3. Apostoli P. and A. Kanda [2000]: “Approximation spaces of type-free sets,” with A. Kanda. RSCTC ’00, eds. W. Ziarko, Y. Y. Yao. LNAI volume 2005. Eds. J. G. Carbonell, J. Siekmann, Springer-Verlag, 98–105. 4. Bell, J. L. [2000]: “Set and Classes as Many,” J. of Philosophical Logic, 29 (6), 595–681. 5. Cattaneo, G. [1998]: “Abstract Approximation Spaces for Rough Theories,” Rough Sets in Knowledge Discovery: Methodology and Applications, eds. L. Polkowski, A. Skowron. Studies in Fuzziness and Soft Computing. Ed.: J. Kacprzyk. Vol. 18, Springer. 6. Church A. [1941]: The Calculi of λ -conversion, Annals of Mathematics Studies 6, Princeton University Press, Princeton. 7. Feferman S. [1984]: “Towards useful type-free theories I,” J. of Symbolic Logic, 49, 75–111. 8. Gilmore P.C. [1986]: “Natural Deduction Based Set Theories: A New Resolution of the Old Paradoxes,” J. of Symbolic Logic, 51, 394–411. 9. Gilmore P.C. [1997]: “NaDSyL and some applications,” Kurt G¨odel Colloquium, LNCS volume 1289, 153–166, Vienna. 10. Gilmore P.C. [2001]: “An intensional type-theory: Motivation and cut-elimination,” J. of Symbolic Logic, 66(1), 383–400, March 2001. 11. Hermes, H. [1965]: Enumerability, Decidability, Computability. Berlin: Springer Verlag. 12. Lewis, D. [1968]: “Counterpart Theory and Quantified Modal Logic,” J. of Philosophy 65, 113–26. 13. McCarthy J. [1960]: “Recursive Functions of Symbolic Expressions,” CACM 3. 14. Pawlak Z. [1982]: “Rough Sets,” International Journal of Computer and Information Sciences, 11, 341–350. 15. Prawitz, D. [1965]: Natural Deduction, Stockholm: Almqvist & Wiksell. 16. Sato M. [1983]: “Theory of Symbolic Expressions I,” Theoretical Computer Science, 22, 19–55. 17. Voda P. [1984]: “Theory of Pairs, Part I, Provably Recursive Functions,” Technical Report 84–25 of the Dept. of Computer Science, Univ. of British Columbia. 18. Yao, Y. Y. [1998b]: “On Generalizing Pawlak Approximation Operators”, eds. L. Polkowski and A. Skowron, RSCTC’98, LNAI 1414, Eds. J. G. Carbonell, J. Siekmann, Springer-Verlag, 289–307.
Adaptive Granular Control of an HVDC System: A Rough Set Approach J.F. Peters , H. Feng, and S. Ramanna Department of Electrical and Computer Engineering, University of Manitoba Winnipeg, Manitoba R3T 5V6 Canada {jfpeters, fengh, ramanna}@ee.umanitoba.ca
Abstract. This article reports the results of a three-year study of adaptive granular control of High-Voltage Direct Current (HVDC) systems using a combination of rough sets and granular computing techniques. A proportional integral (PI) control strategy is commonly used for constant current and extinction angle control in an HVDC system. A PI control strategy is based on a static design where the gains of a PI controller are fixed. Since the response of a HVDC plant dynamically changes with variations in the operating point, a PI controller’s performance is far from optimal. By contrast, an adaptive controller makes changes in the gains relative to the observed changes in HVDC system behavior. However, adaptive controllers require for their design, a frequency domain model of the controlled plant. Due to the non-linear operation of the HVDC system, such a model is difficult to establish. Because rough set theory makes it possible to set up a decision-making utility that approximates a control engineer’s knowledge about how to tune the controller of a system to improve its behavior, rough sets can be used to design an adaptive controller for the HVDC system. The contribution of this paper is the presentation of the design of a rough set based, granular control scheme. Experimental results that compare the performance of the adaptive control and PI control schemes are also given. Keywords: Adaptive control, granular computing, HVDC, rough sets.
1
Introduction
The decision-systems approach in the design of control systems has been reported relative to a rapidly growing range of control problems [2,3,4,7,8,9,10,12,13,14, 15,17,23,24,25,26]. This is especially true in various niches of control engineering where classical approaches have been fraught with difficulties or have appeared to perform inherently weakly. Considerable research in the application of rough set theory [11] as well as various forms of fuzzy set theory [16,25] and granular
Corresponding author: J.F. Peters, Department of Electrical and Computer Engineering, University of Manitoba, 15 Gillson Street, ENGR 504,Winnipeg, Manitoba R3T 5V6 Canada, [email protected]
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 213–220, 2003. c Springer-Verlag Berlin Heidelberg 2003
214
J.F. Peters, H. Feng, and S. Ramanna
computing [19,20,21,22,26] have been used in promising designs of various control systems [3,4,7,8,9,14,15,17]. In this paper, a combination of rough set theory and granular computing has been in the design of a tuner in an adaptive control strategy for current control in High-Voltage Direct Current (HVDC) system. This research extends earlier work on a fuzzy controller for HVDC systems [4]. In the earlier work, controllers for both constant current and extinction angle were reported. The adaptive granular control strategy is limited to constant current control in an HVDC system. This control strategy can be easily extended to extinction angle control in a HVDC system but this is considered outside the scope of this paper. The contribution of this paper is the presentation of the design of a rough set based granular control scheme. This paper is organized as follows. Section 2 presents an overview of the CIGRE benchmark model used to characterize the HVDC system. Preprocessing issues of typical HVDC system responses are discussed in section 3. A tuning and adaptation algorithm is given in section 3. Experimental results which includes the comparison of the Adaptive Granular System and the PI control system are given in section 4.
2
HVDC Transmission
An HVDC transmission system has advantages over alternating current (ac) transmission, especially in terms of system efficiency and stable operation. In this paper, the CIGRE benchmark model has been used to characterize the HVDC system [1,5]. This model has a two-terminal direct current (dc) scheme shown in Fig. 1. Both converter substations (rectifier and inverter) are provided with
Fig. 1. CIGRE Benchmark Model
a current control loop including a current measuring device, a current controller and firing control equipment. Usually, one of the converters is current controlled, and the other operates in constant extinction angle. The direct voltage at any point on the line and the current (or power) can be controlled by controlling the internal voltages. This is accomplished by grid/gate control of the valve ignition
Adaptive Granular Control of an HVDC System: A Rough Set Approach
215
Fig. 2. Simple PI dc Current Controller
angle or control of the ac voltage through tap changing of a converter transformer. The focus in this paper is on gate control, which is rapid (1 to 10 ms). The gate control system is basically a feed-back loop that adjusts the output according to the input, where the input is the error of the measurement and order value, and the output is the ignition angle needed to maintain a constant current or constant extinction angle. In a typical HVDC control system, a proportional plus integral regulator (PI controller) is widely used (see Fig. 2). A PI control scheme utilizes two parameters (constants Ki and gain Kp) to process inputs and to obtain the firing angle α. Since Ki and Kp are predetermined parameters with fixed values, a PI controller is in fact a time-invariant system. However, an HVDC system is in a time-varying environment either in the form of a plant with changing parameters, input signals and disturbances and/or changing performance objectives. Since many encountered changes are not predictable, an optimum pre-programmed time-varying controller is not possible. To satisfy the need to adjust the parameters in a PI controller, an adaptive granular control scheme is introduced (see Fig. 3). Briefly, the system clock is used to monitor,
Fig. 3. Adaptive Control System Model
collect and granulate overshoot, settling time and rise time of system responses relative to a preset objective. The result of data granulation is a collection of experimental values that are used to select optimum control rules, one for Kp and another for Ki. The condition attributes in a selected rule will have the attribute values that are closest to the experimental values in each constructed
216
J.F. Peters, H. Feng, and S. Ramanna
information granule. The selection of a control rule provides the percent change needed in Kp and Ki control parameters to achieve optimal HVDC behavior.
3
Experimental Results
In a system with a classical PI controller, the step responses are almost identical to each other every time when the same system step change occurs. In other words, the plot of system responses will not change its shape regardless of how many times the same system step change has occurred. However, in a system with the rough control scheme, every occurrence of a system step change will lead to a change in the control parameters Kp and Ki and these changes will be ready to be used at the instant when the same system step change recurs. 3.1
Evolution of Step Responses in Adaptive Granular Control Scheme
The evolution of step responses to the same system change Io = 1.0 is shown in Fig. 4. It can be observed that an adaptive granular control scheme results in improved step response to system change.
Fig. 4. Evolution of System Responses
Table 1. Difference in Step Responses During Adaptation of Kp, Ki (Io = 1.0) Adaptive times 1 2 3 4
Rise time(tr ) 0.0411 0.0411 0.0412 0.0412
Settle time (ts ) 0.0847 0.0530 0.0294 0.0097
Overshoot (ov) 43.4092 22.4579 8.5659 6.6472
Adaptive Granular Control of an HVDC System: A Rough Set Approach
3.2
217
Comparison of Adaptive Granular and PI Control Schemes
A comparison of the traditional PI controller and new adaptive granular controller for an HVDC transmission system is briefly presented in this section. A simulation of the responses of the PI and rough control system where system change Io may take any one of values: Io∈{0.8, 0.9, 1.0, 1.1, 1.2} is shown in Fig. 5. With a traditional PI controller, step responses may vary from each other on the arrival of different system step changes. In a rough control system with an adaptive PI-controller, there is a learning-training process required to reach a steady state (i.e., learning entails repeated adjustment in the percent changes in Kp and Ki before an optimal configuration of the PI controller is found).
Fig. 5. Controller Performance when Ki, Kp = 0.1
In Fig. 5, only the plots for the Io = 0.8 and Io = 0.9 cases are shown for Ki and Kp = 0.1 (plots the remaining Io cases are similar). The Table 2 summaries all of the results for each of the control schemes. In each case, the settle and overshoot values for the adaptive granular control scheme are significantly better than those the PI control scheme. A comparison of the two control schemes for Kp, Ki = 1.0 is shown in Fig. 6. Table 2. Comparison of Two Control Schemes
Io 0.8 0.9 1.0 1.1 1.2
Classical PI / Adaptive Granular Control ( Ki, Kp=0.1, 0.1 ) Rise time(tr) Settle time (ts) Overshoot (ov) PI Adaptive PI Adaptive PI Adaptive 0.0331 0.0327 0.2620 0.0110 147.3435 6.2884 0.0367 0.0366 0.4619 0.0110 128.5544 7.1543 0.0414 0.0414 0.3699 0.0097 112.6934 6.6472 0.0455 0.0450 0.4431 0.0044 100.8750 4.1055 0.0487 0.0489 0.5925 0.0120 89.7730 4.6735
Again in Fig. 6, only the plots for the Io = 0.8 and Io = 0.9 cases are shown for Ki and Kp = 1.0 (plots for Io = 0.8, 1.0, 1.1, and 1.2 cases are similar).The Table 6 summaries all of the results for each of the control schemes. The rise
218
J.F. Peters, H. Feng, and S. Ramanna
Fig. 6. Controller Performance when Ki, Kp = 1.0
times for each of the control schemes are essentially the same. In each case, the settling time and overshoot values for the adaptive granular control scheme are again significantly better than those the PI control scheme. All experiments shown in this section have been carried out using a Matlab 6 Simulink model of the HVDC system. Table 3. Comparison of Two Control Schemes
Io 0.8 0.9 1.0 1.1 1.2
4
Classical PI / Adaptive Granular Control ( Ki, Kp=1.0, 1.0 ) Rise time(tr) Settle time (ts) Overshoot (ov) PI Adaptive PI Adaptive PI Adaptive 0.0328 0.0327 0.0825 0.0110 57.7295 6.2884 0.0367 0.0366 0.0852 0.0110 49.0133 7.1543 0.0411 0.0414 0.0847 0.0097 43.4092 6.6472 0.0449 0.0450 0.0872 0.0044 38.1051 4.1055 0.0489 0.0489 0.0897 0.0120 34.4975 4.6735
Conclusion
This article has presented an adaptive granular control scheme for HVDC power transmission systems. This control scheme is in reality an application of two prominent technologies in computational intelligence, namely, rough sets and granular computing. The resulting control strategy illustrates the advantages provided by rough set theory and information granulation in designing a decision system useful in control systems that utilize classical PI, PD or PID control
Adaptive Granular Control of an HVDC System: A Rough Set Approach
219
paradigms. A classical PI controller typically used in HVDC systems relies on judicious choices of values for the proportional and integral coefficients. Over time, a human operation can adjust these coefficients as needed. The insights of a control engineer codified in the rules that underlie the design of the adaptive control strategy, is presented in this article. There is also a subtle element in the proposed adaptive granular control scheme that should deserves mentioning, namely, the implicit learning algorithm that is incorporated into the new control scheme. That is, the proposed control scheme learns (and remembers) to make appropriate adjustments to the coefficients of a classical controller. Future work in this project will include the design of an adaptive granular controller used in the selection of extinction angles for an HVDC system. In addition, it has already been found that discretization leads to a more general set of control rules. Further work is needed in design of adaptive control strategies that utilize discretization. Acknowledgements. The research of James Peters and He Feng have been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant 185986 and grants from Manitoba Hydro. The research of Sheela Ramanna has been supported by research grant 194376 from NSERC. The authors gratefully acknowledge the work by Maciej Borkowski in preparing the LATEXversion of this article.
References 1. J. Arrillaga, High Voltage Direct Current Transmission, Peter Pereginus, London, 1983. 2. E. Czogala, A. Mrozek, Z. Pawlak, The idea of a rough fuzzy controller and its application in the stabilization of a pendulum-car system, Fuzzy Sets and Systems 72, 1995, 61–63. 3. H. Feng, Adaptive Granular Control for a HVDC System, M.Sc. Thesis, supervisor: J.F. Peters, Department of Electrical Engineering and Computer Engineering, Dec. 2002. 4. A.M. Gole, A. Daneshpooy, D.G. Chapman, J.B. Davies, Fuzzy logic control for HVDC transmission, IEEE Winter Meeting, New York, 1997. 5. E.W. Kimbark, Direct Current Transmission, Wiley, London, 1971. 6. T.Y. Lin, N. Cercone (Eds.), Rough Sets and Data Mining: Analysis of Imprecise Data. Kluwer Academic Publishers, Dordrecht, 1997. 7. A. Mrozek, L. Plonka, R. Winiarczyk, J. Majtan, Rough sets for controller synthesis. In: T.Y. Lin (Ed.), Proc. of the Third Int. Workshop on Rough Sets and Soft Computing (RSSC’94), San Jose, California, 10–12 November 1994, 498–505. 8. A. Mrozek, L. Plonka, J. Kedziera, The methodology of rough controller synthesis. In: Proc. 5th IEEE Int. Congress on Fuzzy Systems (FUZZ-IEEE’96), New Orleans, 8–11 September 1996, 1135–1139. 9. T. Munakata, Rough control: A Perspective. In: [6], 77–90. 10. Z. Pawlak, Rough real functions and rough controllers. In [6], 139–148. 11. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Boston, MA, Kluwer Academic Publishers, 1991.
220
J.F. Peters, H. Feng, and S. Ramanna
12. J.F. Peters, V. Degtyaryov, M. Borkowski, S. Ramanna, Line-crawling robot navigation: Rough neurocomputing approach. In: C. Zhou, D. Maravall, D. Ruan, Fusion of Soft Computing and Hard Computing for Autonomous Robotic Systems. Berlin: Physica-Verlag, 2002. 13. J.F. Peters, S. Ramanna, A. Skowron, M. Borkowski: Wireless agent guidance of remote mobile robots: Rough integral approach to sensor signal analysis. In: N. Zhong, Y.Y. Yao, J. Liu, S. Ohsuga (Eds.), Web Intelligence, Lecture Notes in Artificial Intelligence 2198. Berlin: Springer-Verlag, 2001, 413–422. 14. J.F. Peters, A. Skowron, Z. Suraj, An application of rough set methods to automatic concurrent control design. Fundamenta Informaticae, vol. 43, nos. 1–4, 2000, 269–290. 15. J.F. Peters, K. Ziaei, S. Ramanna, Approximate time rough control: Concepts and application to satellite attitude control. In: L. Polkowski and A. Skowron (Eds.), Rough Sets and Current Trends in Computing, Lecture Notes in Artificial Intelligence, vol. 1424. Berlin, Springer-Verlag, 1998, 491–498. 16. J.F. Peters, W. Pedrycz, Computational Intelligence. In: J.G. Webster (Ed.), Encyclopedia of Electrical and Electronic Engineering. 22 vols. NY, John Wiley & Sons, Inc., 1999. 17. J.F.Peters, S. Ramanna, Framework for approximate time rough control systems: An integrated fuzzy sets-rough sets approach. In: Proc. 7th Int. Symposium on Artificial Intelligence in Real-Time Control (AIRTC98), Grand Canyon, Arizona. 5–8 October, 1998, 1–8. 18. A. Skowron, Toward intelligent systems: Calculi of information granules. In: S. Hirano, M. Inuiguchi, S. Tsumoto (Eds.), Bulletin of the International Rough Set Society, vol. 5, no. 1/2, 2001, 9–30. 19. A. Skowron, J. Stepaniuk, S. Tsumoto, Information granules for spatial reasoning, Bulletin of the International Rough Set Society, vol. 3, no. 4, 1999, 147–154. 20. A. Skowron, J. Stepaniuk, J.F. Peters, Extracting patterns using information granules. In: S. Hirano, M. Inuiguchi, S. Tsumoto (Eds.), Bulletin of the International Rough Set Society, vol. 5, no. 1/2, 2001, 135–142. 21. A. Skowron, J. Stepaniuk, Information Granules: Towards foundations of granular computing, International Journal of Intelligent Systems, vol. 16, no. 1, Jan. 2001, 57–104. 22. A. Skowron, J. Stepaniuk, J.F. Peters, Hierarchy of information granules. In: H.D. Burkhard, L. Czaja, H.S. Nguyen, P. Starke (Eds.), Proc. of the Workshop on Concurrency, Specification and Programming, Oct. 2001, Warsaw, Poland , 254– 268. 23. T. Furuhashi, H. Yamamoto, J.F. Peters, W. Pedrycz, A stability analysis of fuzzy control systems using generalized fuzzy Petri net model, International Journal of Advanced Computational Intelligence 3(2), 1999, 99–106. 24. J.J. Alpigini, J.F. Peters, Dynamic visualization with rough performance maps. In: W.Ziarko, Y.Yao (Eds.), Rough Sets and Current Trends in Computing (RSCTC’00), Lectures Notes in Artificial Intelligence 2005. Berlin: Springer-Verlag, 2001, 90–97. 25. L.A. Zadeh, Fuzzy sets, Information Control 8, 1965, 338–353. 26. L.A. Zadeh, Toward a theory of fuzzy information granulation and its certainty in human reasoning and fuzzy logic, Fuzzy Sets and Systems 90(2), 1997, 111–128.
Rough Set Approach to Domain Knowledge Approximation Tuan Trung Nguyen and Andrzej Skowron Warsaw University ul. Banacha 2, Warsaw, Poland {skowron,nttrung}@mimuw.edu.pl
Abstract. Classification systems working on large feature spaces, despite extensive learning, often perform poorly on a group of atypical samples. The problem can be dealt with by incorporating domain knowledge about samples being recognized into the learning process. We present a method that allows to perform this task using a rough approximation framework. We show how human expert’s domain knowledge expressed in natural language can be approximately translated by a machine learning recognition system. We present in details how the method performs on a system recognizing handwritten digits from a large digit database. Our approach is an extension of ideas developed in the rough mereology theory. Keywords: Rough mereology, concept approximation, domain knowledge approximation, machine learning, handwritten digit recognition.
1
Introduction
Several decades of research on handwritten digit recognition have yielded significant success, with many efficient algorithms and recognition systems that provide impressive results. Extensive experiments however have shown that no matter how efficient and sophisticated the techniques employed are, a considerable set of samples remain difficult to deal with. This problem is subject to a recently growing trend in the community which stresses on the special treatment of difficult areas in the feature space. We present a scheme for incorporating domain knowledge about handwritten digit samples into the learning process. The knowledge is provided by an hypothetical expert that will interact with the classification system during a later phase of the learning process, providing certain “guidance” to the difficult task of adaptive searching for correct classifiers. In distinction to most popular domain knowledge based approaches widely used in recognition systems, ours concentrates on specific difficult, error-prone samples encountered during the learning phase. The expert will pass the correct classification of such cases to the system along with his explanation on how he made the decision on the class identity of the sample. The system then incorporates this knowledge, using its own descriptive language and primitives, to rebuild its classifiers. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 221–228, 2003. c Springer-Verlag Berlin Heidelberg 2003
222
T.T. Nguyen and A. Skowron
In this paper, we describe the expert’s knowledge representation scheme, based on the rough mereology approach to concept’s approximation, as well as the mechanism of interaction between expert and the classifier construction system presented in [6]. It is easy to observe that the approach is not limited to handwritten digits, but can be readily used in classifying other structured objects.
Classifier Construction System
Expert
Knowledge incorporated
Expert’s Perception
Knowledge transfer by learning
Fig. 1. General Outline
2
Adaptive Recognition System
The development of OCR in general and handwritten digit recognition in particular over the years yielded many highly effective description models for the analysis of digit images. For the research in this paper we have chosen the Enhanced Loci coding scheme, which assigns to every image’s pixel a code reflecting the topology of its neighborhood. The Enhanced Loci algorithm, though simple, has proved to be very successful in digit recognition. For a detailed description of the Loci coding scheme, see [2]. Once the Loci coding is done, the digit image is segmented into regions consisting of pixels with the same code value. These regions then serve as primitives to build the graph representation of the image.
NW
NE
SW
SE
Fig. 2. Graph Model based on Loci Encoding.
Rough Set Approach to Domain Knowledge Approximation
3
223
“Hard” Samples Detection
Typical digit recognition system classifies an unknown sample by computing its “distance” or “similarity” to a collection of prototypes established during the training phase. The most popular, the k-nearest neighbor method assigns the sample to the class represented by the majority of its k neighbors samples. This and other traditional methods perform this task on an uniform basis irrespective to the “difficulty” of the investigated sample, whereas it is obvious that not all samples are equally easy to classify. Samples that are far from the “centers” of the class prototypes tend to fall on the boundaries between classes, are more error-prone and hence can be regarded as more “difficult”. A straightforward criteria to detect such samples can be defined as follows: Let P Gk be the prototype graph set constructed for class k during the training phase and dk be the distance function established for that class. An unknown digit sample uk of class k is considered “difficult”, “hard” or “atypical” if: dk (uk , P Gk ) ≥ ρ max{(v, P GK ) : v ∈ T R ∧ CLASS(v) = k} where ρ ∈ (0, 1] is some cut-off threshold and T R is the training table. Alternatively, samples repeatedly misclassified during cross-validation tests in the training phase can as well be considered ”difficult”. Since class identity of all digit samples are known during the training phase, we can detect the “hard” ones beforehand and submit them to the expert for review.
4
Passing Domain Knowledge to Classifiers
Now suppose that at some point during the training phase, we detect a number of samples of class k that are misclassified. The samples are submitted to the expert, which returns not only the correct class identity, but also an explanation on why, and perhaps more importantly, how he arrived at his decision. We will assume that the expert’s explanation is expressed as a rule: [CLASS(u) = k] ≡ (EF eature1 (u), ..., EF eaturen (u)) where EF eaturei represents the expert’s perception of some characteristics of the sample u, while synthesis operator represents his perception of some relations between these characteristics. It is assumed that the actual structure of EF eaturei might not be flat, but can be multi-layered with various sub-concepts at subsequent levels of abstractions. For example, the expert may express his perception of digit ‘5’ as (See Fig. 3): [CLASS(u) =‘5’] ≡ a, b, c, d are parts of u; “Above Right”(a,b); “HStroke”(a); b = Compose(c, d); “VStroke”(c); “WBelly”(d); ”Above”(c, d) hold
224
T.T. Nguyen and A. Skowron
where Compose is an assembling operator that produces a bigger part from smaller components. The above means if there is a west-open belly below a vertical stroke and the two have a horizontal stroke above-right in the sample’s image, then the sample is a ‘5’.
Fig. 3. Object Perception Provided by Experts
The main challenge here is that the expert explanation is expressed in his own descriptive language, intrinsically related to his natural perception of images and often heavily based on natural language constructs (a foreign language Lf ), while classifiers have a different language designed to, for example, facilitate the computation of physical characteristics of the images (a domestic language Ld ). For example, the expert may view sample images as a collection of shapes or strokes (“A ‘6’ is something that has a neck connected with a circular belly”) while the recognition system regards the samples as graphs of Loci-based nodes. The knowledge passing process hence can be considered as approximating of expert’s concept by the classifier construction system. It is essential here that the concept matching should not be “crisp”, but expressed by some rough inclusion measures, determining if something is satisfying the concept to a certain degree [6]. For instance, a stroke at 85 degree to the horizontal can still be regarded as a vertical stroke, though obviously with a degree less than 1.0. The extent of such variations may be provided by the expert (e.g., by providing samples that represent “extremes” instances of a given concept). Let us assume that such an inclusion measure is denoted by M atch(p, C) ∈ [0, 1], where p is a pattern (or a set of patterns) encoded in Ld and C is a concept expressed in Lf . An example of concept inclusion measures would be: M atch(p, C) =
|{u ∈ T : F ound(p, u) ∧ F it(C, u)}| |{u ∈ T : F it(C, u)}|
where T is a common set of samples used by both the system and the expert to communicate with each other on the nature of expert’s concepts, F ound(p, u) means a pattern p is present in u and F it(C, u) means u is regarded by the expert as fit to his concept C.
Rough Set Approach to Domain Knowledge Approximation
225
Our principal goal is, for each expert’s explanation, find sets of patterns P at, P at1 ,...,P atn and a relation d such that if (∀i : M atch(P ati , EF eaturei ) ≥ pi ) ∧ (P at = d (P at1 , ..., P atn )) then Quality(P at) > α where p, pi : i ∈ {1, .., n} and α are certain cutoff thresholds, while the Quality measure, intended to verify if the target pattern P at fits into the expert’s concept of digit class k, can be any, or combination, of the following criteria: SupportCLASS=k (P at) = |{u ∈ U : F ound(P at, u) ∧ CLASS(u) = k}| M atchCLASS=k (P at) =
Support(P at) |{u ∈ U : F ound(P at, u)}|
CoverageCLASS=k (P at) =
Support(P at) |{u ∈ U : CLASS(u) = k}|
where U is the training set. In other words, we seek to translate the expert’s knowledge into the domestic language so that to generalize the expert’s reasoning to the largest possible number of physical digit samples. The requirements on inclusion degrees ensure the stability of the target reasoning scheme, as the target pattern P at retains its quality regardless of deviations at input patterns P ati as long as they still approximate the expert’s concept EF eaturei to degrees at least pi . This may also be described as pattern robustness. Another important aspect of this process is its concept approximation robustness, meaning not only does it ensure that the target pattern P at will retain its quality with regard to input patterns deviations in inclusion degrees, but it also should guarantee that if we have some input patterns P at,i equally ”close” or ”similar” to EF eaturei , then the target pattern P at, = d (P at,1 , ..., P at,n ) will meet the same quality requirements as P at to a satisfactory degree. This leads to an approximation of EF eaturei that is independent from particular patterns P ati , allowing us to construct approximation schemes that focus on inclusion degrees pi rather than on a specific input patterns P ati . One can observe that the main problem that poses here is how to establish the interaction between the expert who reasons in Lf and the classifier construction system that uses Ld . Here, once again, the system has to learn (with the expert’s help) ”what he meant when he said what he said.” More precisely, the system will have to construct the measure M atch and the relation d . In order to learn the measure M atch, which essentially means we are trying to learn the expert’s concept of EF eaturei , we will ask the expert to examine a given set of samples U and provide a decision table (U, d) where d is the expert decision whether EF eaturei is present in a particular sample from U , for instance, whether a sample has a ”WBelly” or not. We then try to select a set of features in the system’s domestic language that will approximate the decision d, for example, number of pixels with the NES Loci code. For example:
226
T.T. Nguyen and A. Skowron
u1 u2 ... un
WBelly yes no ... yes
u1 u2 ... un
⇒
#NES WBelly 252 yes 4 no ... ... 90 yes
In the above table, #NES is the number of white pixels that are bounded in all directions except to the West. It is assumed that the set U will not be too large to ensure the feasibility of acquiring the expert’s answers, which facilitates this feature selection task. Experiments have shown that for popular features such as the presence of a circle or a straight stroke, sometimes it is enough to employ some simple greedy heuristics. For more complex patterns, one can use some efficient evolutionary strategies. Having approximated the concepts EF eaturei , we can try to translate the expert’s relation into our d by asking the expert to go through U and provide us with the additional attributes of whether he found the EF eaturei and a decision d if the relation holds. We then replace the attributes corresponding to EF eaturei with the characteristic functions of the domestic feature sets that approximate those concepts and try to add other features, possibly induced from original domestic primitives, in order to approximate the decision d. Again, this task should be resolved by means of adaptive or evolutionary search strategies without too much computing burden. Here is an example how the concept of a “vertical stroke” “above” a “westopen belly” would be approximated:
u1 u2 ... un
VStroke WBelly Above yes yes yes yes no no ... ... ... yes yes no
⇒
u1 u2 ... un
#V S #NES Sy < By Above 34 252 yes yes 45 4 no no ... ... ... ... 40 150 no no
⇓ u1 u2 ... un
M atch(#V S,VStroke) M atch (#NES,WBelly) M atch (Sy < By ) M atch(Above) 0.85 0.95 (yes,1.0) (yes, 0.9) 0.95 1.0 (no, 0.1) (no, 0.05) ... ... ... ... 0.90 0.70 (no, 0.3) (no, 0.15)
In the above table, #V S is the number of black pixels having the Loci code characterizing a vertical stroke and Sy < By tells whether the median center of the stroke is placed closer to the upper edge of the image than the median center of the belly. The third table shows degrees of inclusion of these domestic features in the original expert’s concept ”VStroke”, ”WBelly” or ”Above” respectively.
Rough Set Approach to Domain Knowledge Approximation
227
It is noteworthy that the concept approximation process should work under a requirement to the quality of the searched global pattern P at, which should have a substantial support among other samples, not examined by the expert, from the training collection. This will ensure the knowledge passed by the expert on a particular example is actually generalized into more global concept.
5
Classifying Unknown Samples
Once we have established domestic language feature sets and constrain relations approximating the expert’s reasoning on a particular type of digits, we essentially obtained a multi-layered reasoning scheme. An unknown sample can then be checked against this scheme to see whether it bears enough characteristic traits of this digit class. This can be done by decomposing the unknown pattern according to the structure of the reasoning scheme, checking its matching degree at each level and subsequent computing its matching degree at higher levels up to the root.
Fig. 4. AR-scheme Recognizing New Samples
It can be observed that each pattern and constrain relation set at a particular level of reasoning, called production (see Fig. 4.), determines a cluster of samples matching it. Based on the similarity measures developed during the training phase that correspond to the pattern sets, we can derive the distance of unknown samples to each cluster and, in consequence, develop the inclusion measure of a new sample in the concept approximated by the cluster. Such productions can be composed into approximate reasoning schemes (AR-schemes) under constraints [6] expressing that the quality of required input pattern by a production in AR-scheme is lower than delivered by production sending such pattern.
228
T.T. Nguyen and A. Skowron
It is also essential to note that the quality requirement imposed while searching for the target patterns ensure that the obtained classifier is stable, i.e. resistant to certain derivations in the input sample. It is enough for the new input to match the basic patterns at the lowest level P ati to a degree greater than the satisfactory threshold pi , and the outcome classification decision will be guaranteed to at least a satisfactory degree of accuracy.
6
Conclusion
A method for incorporating domain knowledge into the design and development of a classification system is presented. We have demonstrated how approximate reasoning scheme can be used in the process of knowledge transfer from human expert’s ontology, often expressed in natural language, into computable pattern features. Developed schemes ensure stability and adaptability of constructed classifiers. We have shown how granular computing, equipped with rough mereology concepts can be effectively applied to a highly practical field such as OCR and handwritten digit recognition. Developed framework might as well be used in classifying other objects including, but not limited to, fingerprints, mugshots, iris scans as well as for more complex tasks like project WITAS (http://www.ida.liu.se/ext/witas/).
Acknowledgment. This work has been supported by Grant 8T11C02519 from the State Committee for Scientific Researches of the Republic of Poland (KBN) and partially by the Wallenberg Foundation Grant.
References 1. J. Geist, R. A. Wilkinson, S. Janet, P. J. Grother, B. Hammond, N. W. Larsen, R. M. Klear, C. J. C. Burges, R. Creecy, J. J. Hull, T. P. Vogl, and C. L. Wilson. The second census optical character recognition systems conference. NIST Technical Report NISTIR 5452, pages 1–261, 1994. 2. K. Komori, T. Kawatani, K. Ishii, and Y. Iida. A feature concentrated method for character recognition. In Bruce Gilchrist, editor, Information Processing 77, Proceedings of the International Federation for Information Processing Congress 77, pages 29–34, Toronto, Canada, August 8–12, 1977. North Holland. 3. Z.C. Li, C.Y. Suen, and J. Guo. Hierarchical models for analysis and recognition of handwritten characters. Annals of Mathematics and Artificial Intelligence, pages 149–174, 1994. 4. L. Polkowski and A. Skowron. Towards adaptive calculus of granules. In L.A. Zadeh and J. Kacprzyk, editors, Computing with Words in Information/Intelligent Systems, pages 201–227, Heidelberg, 1999. Physica-Verlag. 5. R.J. Schalkoff. Pattern Recognition: Statistical, Structural and Neural Approaches. John Wiley & Sons, Inc., 1992. 6. A. Skowron. Towards intelligent systems: Calculi of information granules. Bulletin of the International Rough Set Society, 5(1–2):9–30, 2001.
Reasoning Based on Information Changes in Information Maps Andrzej Skowron1 and Piotr Synak2 1
2
Institute of Mathematics, Warsaw University Banacha 2, 02-097 Warsaw, Poland Polish-Japanese Institute of Information Technology Koszykowa 86, 02-008 Warsaw, Poland
Abstract. We discuss basic concepts for approximate reasoning about information changes. Any rule for reasoning about information changes specifies how changes of information granules from the rule premise influence changes of information granules from the rule conclusion. Changes in information granules can be measured, e.g., using expressions analogous to derivatives. We illustrate our approach by means of information maps and information granules defined in such maps.
1
Introduction
We discuss basic concepts for reasoning about information changes related, e.g., to changes in time or space [11]. Our approach has roots in rough mereology [12, 9,10]. The basic concepts used are information granules, measures of inclusion and closeness of information granules and rules for reasoning about changes in information granules. Information granules and their relevant changes used in reasoning are assumed to be extracted from the underlying information maps. In [13] we have shown that patterns defined over information maps are relevant in data mining [1,4]. Any map consists of a transition relation on a set of states, i.e., pairs (label, inf ormation(label)). Any label describes a context in which information assigned to the label is relevant. Exemplary information maps can be extracted from decision systems [8,6]. In this case one can take attribute–value vectors as labels. The corresponding information is a subsystem of a given decision system consisting of all objects consistent with the label. Patterns over information maps describe sets of states expressible by means of temporal formulas [3,2]. We investigate expressibility of information changes in comparison with the changes of attribute value vectors in a family of descending neighborhoods of given label e0 . We have found an analogy to derivative of information function f , mapping a set of labels onto an information set. Any rule for reasoning about information changes specifies how changes of information granules from the rule premise influence changes of information granules from the rule conclusion. Changes in information granules can be measured, e.g., using expressions analogous to derivatives. We illustrate our approach by means of information maps and information granules defined in such maps. The presented approach can be extended to hierarchical information maps. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 229–236, 2003. c Springer-Verlag Berlin Heidelberg 2003
230
A. Skowron and P. Synak f(e4) f(e3) f(e1) f(e2)
e4 e3
e1
f(e5)
e5
e2
Fig. 1. Information map.
2
Preliminaries
Let us recall notation [6]. In particular A = (U, A) denotes an information system [8] with the universe U of objects and the attribute set A. Each attribute a ∈ A is a function a : U → Va , where Va is the value set of a. For a given set of attributes B ⊆ A we define the indiscernibility relation IN D(B) on universe U that splits U into indiscernibility classes, i.e., sets {y ∈ U : a(x) = a(y) for any a ∈ B} for any x ∈ U. Decision tables are denoted by A = (U, A, d) where d is the decision attribute. The decision attribute d defines a partition of the universe U into decision classes. An object x is inconsistent if there exists an object y such that xIN D(A)y, but it belongs to a different decision class than x, i.e., d(x) = d(y). A positive region of the decision table A (denoted by P OS(A)) is the set of all consistent objects.
3
Information Maps
Information maps [13] are usually generated from experimental data, like information systems or decision tables, and are defined by means of some binary (transition) relations on set of states. Any state consists of information label and information extracted from a given data set corresponding to the information label. We present examples explaining the meaning of information labels, information related to such labels and transition relations (in many cases partial orders) on states. Such structures are basic models over which one can search for relevant patterns for many data mining problems [13]. An information map A is a quadruple (E, ≤, I, f ), where E is a finite set of information labels, a transition relation ≤ ⊆ E × E is a binary relation on information labels, I is an information set and f : E → I is an information function associating the corresponding information to any information label. In Figure 1 we present an example of information map, where E = {e1 , e2 , e3 , e4 ,e5 }, I = {f (e1 ), f (e2 ), f (e3 ), f (e4 ), f (e5 )}, and ≤ is a partial order on E. A state is any pair (e, f (e)) where e ∈ E. The set {(e, f (e)) : e ∈ E} of all states of A is denoted by SA . The transition relation on information labels can be extended to relation on states, e.g., in the following way: (e1 , i1 ) ≤ (e2 , i2 ) iff e1 ≤ e2 . A path in A is any sequence s0 s1 s2 . . . of states, such that (i) si ≤ si+1 and (ii) if si ≤ s ≤ si+1 then s = si or s = si+1 , for any i ≥ 0. A property of A
Reasoning Based on Information Changes in Information Maps
x1 x2 x3 x4
a v v v v
b y x y y
c w w x y
d x u u u
a x2 v x3 v x4 v a x1 v x2 v
b x y y
c w x y
231
d u u u
b c d y w x x w u
z = {(a=v), (d=u)} x = {(a=v)}
t
y = {(a=v), (c=w)}
(a)
(b)
Fig. 2. Information map of (a) information system; (b) temporal information system.
is any subset of SA . Let F be a set of formulas of given language. Property ϕ is expressible in F iff ϕ = α for some α ∈ F (α denotes the semantics of α). We present two examples of information maps – more can be found in [13]. Any information system A = (U, A) defines its information map defined by (i) the set of labels E = IN F (A) = {InfB (x) : x ∈ U, B ⊆ A} where InfB (x) = {(a, a(x)) : a ∈ B}, (ii) the relation (being a partial order on E) ≤ defined by u ≤ v iff u ⊆ v for any u, v ∈ IN F (A), (iii) the information set I = {Av : v ∈ IN F (A)} where Av = (Uv , Av ), Uv = {x ∈ U : ∀(a, t) ∈ v a(x) = t} and attributes from Av are attributes from A restricted to Uv , (iv) the information function f mapping IN F (A) into I is defined by f (v) = Av for any v ∈ IN F (A). In Figure 2 (a) three information vectors x = {(a, v)}, y = {(a, v), (c, w)} and z = {(a, v), (d, u)} are shown satisfying conditions x ≤ y, x ≤ z. Our second example is related to temporal information systems. A temporal information system [14] is an information system A=({xt }t∈E⊆N , A) with linearly ordered universe by xt ≤ xt iff t ≤ t . Patterns in such systems are widely studied in data mining (see, e.g., [7,11]). Any temporal information system A defines in a natural way its information map. The information label set E is the set of all possible time units and let the relation ≤ be the natural order on N restricted to E. The information function f maps any given unit of time t into information corresponding to an object of U related to t, i.e., f (t) = InfA (xt ). In this case the map reflects temporal order of attribute value vectors ordered in time. An example of such information map is presented in Figure 2 (b). The transition relation in the first example defines information changes in space (assuming the objects are spatial) while the in second example the transition relation concerns of time changes only. A combination of such transition relations may lead to a spatio-temporal information map.
4
Exemplary Problem
Information maps can be used to formulate numerous problems relevant for data mining. We present one example related to a specific kind of information maps – maps for information systems and decision tables. For more examples the reader is referred to [13]. Solutions of considered problems can be obtained by searching
232
A. Skowron and P. Synak
for good patterns in a relevant language. Such patterns express (temporal and/or spatial) properties of given information system. Let us consider an example. Problem. For a given information map A of a given information system (or decision table) A = (U, A), find the minimal element e of E with respect to partial order ≤, such that the set of subtables S(e) = {f (e ) : e ≤ e } satisfies given constraints (expressible in a fixed temporal logic). One can choose such constraints in the following way. We look for such states that the set of states reachable from them is sufficiently large and has the following property: any two states s1 = (e1 , f (e1 )), s2 = (e2 , f (e2 )) reachable from the state s = (e, f (e)) (i.e., s ≤ s1 and s ≤ s2 ) consist of decision subtables f (e1 ), f (e2 ) with close positive regions. The closeness of positive regions can be defined by means of closeness measures of sets. Other possible choices can be made using entropy or association rules parameterised by thresholds (support and confidence [1]) instead of positive regions. In the new setting one can consider a generalization of association rules [1] by considering implications α ⇒ β where α, β are some temporal formulas from a fixed temporal logic interpreted in an information map. The support and confidence can be defined in an analogous way as in the case of the standard association rules taking into account the specificity of states in information maps. For example, one can search for such a pattern α of subtables that if a state s is satisfying α then with certainty defined by the confidence coefficient this state has also property β (e.g., β means any path starting from such a state s consists of a subtable with a sufficiently small entropy). At the same time the set of states satisfying α and β should be sufficiently large to a degree defined by the support coefficient. The temporal structure of association rules depends on application.
5
Reasoning Based on Information Changes
Discovery of relevant information changes in spatio-temporal information granules is a challenge. We discuss approximate reasoning based on information changes using terminology from granular computing and rough mereology [12, 9,10]. In case of information maps one can talk about different levels of information granules, e.g., elementary granules like labels, as well as about complex granules being labels’ neighborhoods or sets of all labels reachable from a given one by applying transition relation. In all such cases we would like to perform reasoning based on information changes in such granules. Let F : G → G where G is a set of granules. For a given granule e0 ∈ G we can investigate how values of a function F change compared to changes of an argument, for a sequence {ek } of granules becoming sufficiently close to e0 if k increases. To compare granules we can use, e.g., a closeness function cl : G × G → [0, 1], so the presented idea may be expressed by the formula: cl(F (e0 ), F (ek )) cl(e0 , ek )
(1)
Reasoning Based on Information Changes in Information Maps
233
If there exists limit of (1) for k → ∞ we can talk about the derivative of F in point e0 . However, in case when G is a finite set such a limit may be trivial and have no practical sense. Therefore, we are looking for minimal k0 such that for any greater index the value of expression (1) does not change significantly. In the next section we present an example of application of above schema for information maps: e0 is an elementary granule, i.e., label; {ek }k=1,2,... is a descending family of neighborhoods of e0 ; F is an information function f . Notion of derivative in above sense is important for many applications. For example, its existence may be used to approximate of information changes on states reachable from two close states. 5.1
Derivatives in Information Maps
Let A = (E, ≤, I, f ) be the information map of an information system A = (U, A, d). Labels from E are elementary patterns from IN FA (A). For a given label e0 , we estimate changes of information function f on a descending sequence of neighborhoods of e0 , compared to the change of its argument (label). Assuming a closeness measure cl : E × E → [0, 1] is given we define k − th neighborhood of the label e0 by 1 Nk (e0 ) = {e : cl(e, e0 ) ≥ 1 − }. (2) k cl(e0 , Nk (e0 )) is defined by means of closeness degrees between e0 and all labels from Nk (e0 ). Analogously, we define neighborhoods of the information corresponding to e0 as a set of information corresponding to all labels from Nk (e0 ): Nk (f (e0 )) = {f (e) : cl(e, e0 ) ≥ 1 −
1 } k
(3)
Finally, the measure of information change with respect to the argument (label) change within some neighborhood Nk (e0 ) of label e0 can be expressed in terms of closeness functions, i.e., by the following formula: cl(f (e0 ), Nk (f (e0 ))) . cl(e0 , Nk (e0 ))
(4)
Let us see, that because the number of labels is finite there exists l ≥ k such that corresponding neighborhoods are trivial, i.e., Nl (e0 ), Nl+1 (e0 ), . . . are oneelement sets {e0 }. Analogously, neighborhoods defined by f (e0 ), i.e., Nl (f (e0 )), Nl+1 (f (e0 )), . . ., are one-element sets {f (e0 )}. Thus, the limit (in classical sense) of (4), for k → ∞ is 1. Instead, we are looking for the minimal index k0 such that for all neighborhoods, starting from k0 , if labels are close enough, so is the corresponding information. Because {Nk (e0 )}k=1,2,... is a descending family of neighborhoods of e0 (i.e., N1 (e0 ) ⊇ N2 (e0 ) ⊇ . . .) we can define the following limit, for some threshold parameter δ ≥ 0: δ − lim {Nk (e0 )} . k→∞
(5)
234
A. Skowron and P. Synak
The neighborhood Nk0 (e0 ) is a δ-limit of {Nk (e0 )}k=1,2,... if and only if k0 is the smallest number such that for each k ≥ k0 cl(f (e0 ), Nk (f (e0 ))) cl(f (e0 ), Nk0 (f (e0 ))) < δ. − (6) cl(e0 , Nk (e0 )) cl(e0 , Nk0 (e0 )) cl(f (e ),N
(f (e )))
The number cl(e00 ,Nkk0 (e0 ))0 can be interpreted as derivative of an information 0 function f in point e0 with precision δ. Such notion can be used, e.g., to measure closeness of states reachable from close sets. 5.2
Examples
Let us consider two examples of possible interpretations of presented ideas. In the first one the closeness between labels and defined label’s neighborhood is based on label’s syntax. By close labels we understand those described by the same set of attributes assuming the corresponding descriptors are close each other. To talk about a closeness between descriptors we assume they are numerical ones. Let e1 = {(a, a1 ), (b, b1 ), . . .}, e2 = {(a, a2 ), (b, b2 ), . . .} and let Attr(e) be a set of all attributes occurring in label e. Then the closeness between two descriptors based on the same attribute we define by cl((a, a1 ), (a, a2 )) = 1 −
|a1 − a2 | |Va |
(7)
where |Va | is the length of interval being the domain (value set) of attribute a. Let Attr(e1 ) = Attr(e2 ). We define closeness between e1 and e2 by cl(e1 , e2 ) =
1 · card(Attr(e1 ))
cl(e1 (a), e2 (a))
(8)
a∈Attr(e1 )
where e(a) is a descriptor of label e based on attribute a. In case when Attr(e1 ) = Attr(e2 ) we define closeness of such labels as 0. Presented definition expresses the idea that close labels are based on the same attributes and have close values. That means that close labels are only those from the same level of information map and they form an anti-chain of ≤ . Also the neighborhoods are defined in this manner. It is easy to see that the distance based on such a definition of the closeness function, i.e., d(e1 , e2 ) = 1 − cl(e1 , e2 ), is a metric. The closeness of a label e0 and its neighborhood we define as an average closeness between e0 and all of the labels from Nk (e0 ): cl(e0 , Nk (e0 )) =
1 · card(Nk (e0 ))
cl(e, e0 ).
(9)
e∈Nk (e0 )
Now, let us define the closeness between elements of an information set. It can be expressed in terms of a chosen property of information systems determined by labels. For example, for a given label e, we can consider the relative size of positive region P OS(Ae ) where Ae is the decision table corresponding to e:
Reasoning Based on Information Changes in Information Maps
card(P OS(Ae )) card(P OS(Ae0 )) cl(f (e), f (e0 )) = 1 − − card(Ue ) card(Ue0 )
235
(10)
Thus, by close information systems we understand those having close relative positive regions. The closeness of an information system Ae0 and its neighborhood we define as an average of closeness between Ae0 and all systems from Nk (f (e0 )): cl(f (e0 ), Nk (f (e0 ))) =
1 · card(Nk (e0 ))
cl(f (e), f (e0 )).
(11)
e∈Nk (e0 )
In the second example the closeness between labels is defined on the syntax level as well (see (8)) – so is the definition of label’s neighborhood. However, to define the closeness between a label and its neighborhood we use labels’ semantics, i.e., corresponding sets of objects. It is worth to notice that if label e = e0 belongs to the neighborhood of e0 then universes of corresponding information systems Ae and Ae0 , respectively, are disjoint. sup cl(e0 , Nk (e0 )) = 1 −
e∈Nk (e0 )
|card(Ue ) − card(Ue0 )| sup
e∈Nk (e0 )
card(Ue )
(12)
The closeness between an information related to a label e0 and its neighborhood we define again in terms of the size of positive region. However, this time we measure the degree of its change when we extend the universe of an information system Ae0 using universes from the neighborhood: cl(f (e0 ), Nk (f (e0 ))) =
card(P OS(Ae0 ∪ Ae )) card(P OS(Ae0 )) } {1 − − card(Ue0 ∪ Ue ) card(Ue0 ) e∈Nk (e0 )/e0 sup
(13)
Let us see that now the definition of the derivative of an information function f in point e0 is defined as quotient where the numerator expresses degree of change of the information (value) and the denominator degree of change of the label (argument).
6
Conclusions and Directions for Further Research
We have discussed foundations for reasoning about information changes. An illustrative example was used to present our approach. The approach can be extended to a more general case assuming changes are measured by means of information granules extracted from underlying relational structure (with many transition relations and functions having spatial, temporal or spatio–temporal nature). Such an approach can be based on hierarchical information maps defined over a given relational structure. Information granules over hierarchical information maps are sets defined by means of formulas specifying properties
236
A. Skowron and P. Synak
of relational structures. More complex information granules and closeness measures between them are defined recursively using a given relational structure and previously defined information granules as well as closeness measures [12]. This more general case will be studied in next papers. Acknowledgements. The research has been supported by the State Committee for Scientific Research of the Republic of Poland (KBN) research grant 8 T11C 025 19. The research of Andrzej Skowron has been also supported by the Wallenberg Foundation grant.
References 1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, I.: Fast Discovery of Association Rules, Proc. of the Advances in Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, 1996, 307–328. 2. Bolc, L., Szalas, A. (Eds.): Time and logic, UCL Press, London 1995. 3. Clark, E.M., Emerson, E.A., Sistla, A.P.: Automatic verification of finite state concurrent systems using temporal logic specifications: A practical approach, ACM Transactions on Programming Languages and Systems 8(2), 1986, 244–263. 4. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.): Advances in knowledge discovery and data mining, AAAI/MIT Press, 1996. 5. Gabbay, D.M., Hogger, C.J., Robinson, J.A. (Eds.): Handbook of logic in artificial intelligence and logic programming, Epistemic and temporal reasoning 4, Oxford University Press, Oxford 1995. 6. Komorowski, J., Pawlak, Z., Polkowski, L., and Skowron, A.: Rough sets: A tutorial. In Pal, S.K., Skowron, A. (Eds.): Rough fuzzy hybridization: A new trend in decision–making, Springer-Verlag, Singapore 1999, 3–98. 7. Mannila, H., Toivonen, H.,Verkamo, A.I.: Discovery of frequent episodes in event sequences. Report C-1997-15, University of Helsinki, Finland 1997. 8. Pawlak, Z.: Rough sets – Theoretical aspects of reasoning about data, Kluwer, 1991. 9. Polkowski, L., Skowron, A.: Rough mereology: a new paradigm for approximate reasoning, International Journal of Approximate Reasoning, 15(4), 1996, 333–365. 10. Polkowski, L., Skowron, A.: Towards an adaptive calculus of granules. In Zadeh, L.A., Kacprzyk, J., (Eds.): Computing with Words in Information/Intelligent Systems 1, Physica-Verlag, Heidelberg 1999, 201–228. 11. Roddick, J. F., Hornsby, K., and Spiliopoulou, M.: YABTSSTDMR - yet another bibliography of temporal, spatial and spatio-temporal data mining research. In Unnikrishnan, K. P. and Uthurusamy, R. (Eds.): SIGKDD Temporal Data Mining Workshop, ACM Press, San Francisco 2001, 167–175. 12. Skowron, A.: Toward intelligent systems: Calculi of information granules. Bulletin of the International Rough Set Society 5(1–2), 2001, 9–30. 13. Skowron, A. and Synak, P.: Patterns in information maps. LNAI 2475, SpringerVerlag, Heidelberg 2002, 453–460. 14. Synak, P.: Temporal templates and analysis of time related data, LNAI 2005, Springer-Verlag, Heidelberg 2000, 420–427.
Characteristics of Accuracy and Coverage in Rule Induction Shusaku Tsumoto Department of Medical Informatics, Shimane Medical University, School of Medicine, Enya-cho Izumo City, Shimane 693-8501 Japan [email protected]
Abstract. Rough set analysis are closely related with accuracy and coverage. However, there have been few studies on the formal characteristics of accuracy and coverage for rule induction have never been discussed until Tsumoto showed several characteristics of accuracy and coverage. In this paper, the following characteristics of accuracy and coverage are further investigated: (1) The higher the accuracy of the conjunctive formula become, the lower the effect on the conjunction will become. (2) Coverage will decrease more rapidly than accuracy. (3) The change of coverage becomes very small when the length of the conjunctive formula becomes larger. (4) The discussions above are corresponding to those on sensitivity and specificity. (5) When we focus on accurate classification, the classification efficiency, which is the product of sensitivity and specificity will become lower.
1
Introduction
Rough set based rule induction methods have been applied to knowledge discovery in databases[1,2,6,7]. The empirical results obtained show that they are very powerful and that some important knowledge has been extracted from datasets. Furthermore, Tsumoto discusses that the core ideas of rough sets to the reasoning style of medical experts, which makes the results obtained easier for physicians to understand and to discover useful knowledge from those results[6,7]. However, there have been few studies on the formal characteristics of accuracy and coverage for rule induction have never been discussed until Tsumoto showed the following characteristics of accuracy and coverage[9]: (a) accuracy and coverage measure the degree of sufficiency an necessity, respectively. Also, they measure that of lower approximation and upper approximation. (b) Coverage can be viewed as likelihood. (c) These two measures are related with statistical independence. (d) These two indices have trade-off relations. (e) When we focus on the conjunction of attribute-value pairs, coverage decreases more than accuracy. In this paper, the following characteristics of accuracy and coverage are further investigated: (1) The higher the accuracy of the conjunctive formula, the lower the effect on the conjunction will become. (2) Coverage will decrease more rapidly than accuracy. (3) The change of coverage becomes very small when the G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 237–244, 2003. c Springer-Verlag Berlin Heidelberg 2003
238
S. Tsumoto
length of the conjunctive formula becomes larger. (4) When we focus on accurate classification, the classification efficiency will become lower.
2 2.1
Preliminaries Definition of Accuracy and Coverage
In the subsequent sections, we adopt the following notations, which is introduced by Skowron and Grzymala-Busse[5]. Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, i.e., a : U → Va for a ∈ A, where Va is called the domain of a, respectively.Then, a decision table is defined as an information system, A = (U, A ∪ {d}). The atomic formulas over B ⊆ A ∪ {d} and V are expressions of the form [a = v], called descriptors over B, where a ∈ B and v ∈ Va . The set F (B, V ) of formulas over B is the least set containing all atomic formulas over B and closed with respect to disjunction, conjunction and negation. For each f ∈ F (B, V ), fA denote the meaning of f in A, i.e., the set of all objects in U with property f , defined inductively as follows. (1) If f is of the form [a = v] then, fA = {s ∈ U |a(s) = v} (2) (f ∧ g)A = fA ∩ gA ; (f ∨ g)A = fA ∨ gA ; (¬f )A = U − fa By using this framework, accuracy and coverage is defined as follows. Definition 1. Let R and D denote a formula in F (B, V ) and a set of objects which belong to a decision d. Classification accuracy and coverage(true positive rate) for R → d is defined as: αR (D) =
|RA ∩ D| |RA ∩ D| (= P (D|R)), and κR (D) = (= P (R|D)), |RA | |D|
where |A| denotes the cardinality of a set A, αR (D) denotes a classification accuracy of R as to classification of D, and κR (D) denotes a coverage, or a true positive rate of R to D, respectively. 2.2
Fundamental Characteristics of Accuracy and Coverage
As measures of necessity and sufficiency. One of the most important features of accuracy and coverage for rule induction is that they can be viewed as measures for sufficiency and necessity. That is, αR (D) measures the degree of the sufficiency of a proposition, R → d and κR (D) measures the degree of its necessity. Thus, if both measures are 1.0, then R ↔ d (RA = D). The values of these indices provides the degree of these conditions, which can be interpreted as the power of a given proposition. Especially, αR (D) = 1.0 is equivalent to RA ⊆ D and κR (D) = 1.0 is equivalent to D ⊆ RA . Thus, Theorem 1 (Rough Set Approximation). αR (D) and κR (D) measures the degree of the lower and upper approximation. In other words, both approximations can be characterized by these two indices. Lower Approximation ∪RA s.t. αR (D) = 1.0 Upper Approximation ∪RA s.t. κR (D) = 1.0
Characteristics of Accuracy and Coverage in Rule Induction
3
239
Former Results on Accuracy and Coverage
This section shows the results obtained in [9]. 3.1
Statistical Dependence
A| Let P (R) and P (D) be defined as: P (R) = |R and P (D) = |D| |U | |U | , where U denotes the total samples. Then, a index for statistical dependence ςc is defined as:
ςR (D) =
|RA ∩ D| P (R, D) = , |RA ||D| P (R)P (D)
where P (R, D) denotes a joint probability of R and D (P (R, D) = |RA ∩D|/|U |). Since the formula P (R, D) = P (R)P (D) is the definition of statistical independence, ςR (D) measures the degree of statistical dependence. That is, If ςR (D) > 1.0, then R and D are dependent, other R and D are independent; especially, if ςR (D) is equal to 1.0, they are statistically independent. Theorem 2. Lower approximation and upper approximation gives (strong) statistical dependent relations. 1 Proof. Since αR (D) = 1.0 for the lower approximation, ςR (D) = P (D) > 1.0 In 1 the same way, for the upper approximation, ςR (D) = P (R) > 1.0
Definition 2 (A Sequence of Conjunctive Formula). Let U be described i by n attributes. A conjunctive formula R(i) is defined as: R(i) = k=1 [ai = vi ], where index i is sorted by a given criteria, such as the value of accuracy. Then, the sequence of a conjunction is given as: R(i + 1) = R(i) ∧ [ai+1 = vj+1 ]. Since R(i + 1)A = R(i)A ∩ [ai+1 = vi+1 ]A , for this sequence, the following proposition will hold: R(i+1)A ⊆ R(i)A Thus, the following theorem is obtained. Theorem 3. When we consider a sequence of conjunctive formula such that the value of accuracy should be increased, the statistical dependence will increase. Proof. ςR(i+1) (D) = 3.2
αR(i+1) (D) αR(i) (D) ≥ = ςR(i) (D) P (D) P (D)
Tradeoff between Accuracy and Coverage
First, it is notable that coverage decreases its value when a sequence of conjunction of R is considered. Theorem 4 (Monotonicity of Coverage). Let a sequence of conjunctive formula R(i) given with n attributes. Then, κR(i+1) (D) ≤ κR(i) (D). Then, since accuracy and coverage has the following relation:
240
S. Tsumoto
κR (D) P (R) = . αR (D) P (D)
(1)
Since P (R) will decrease with the sequence of conjunction, the following theorem is obtained. Theorem 5. Even if a sequence of conjunction for R is selected such that the value of accuracy increases monotonically, κR (D) will decrease. That is, the decrease of κR (D) is larger than the effect of the increase of αR (D).
4
Accuracy and Coverage of the Sequence of Conjunction
In this section, the characteristics of accuracy and coverage in the sequence of conjunction are investigated. Let us consider the sequence of conjunction: {R(i), R(i + 1), R(i + 2)} and assume the following relations: |R(i)A | = n, |R(i + 1)A | = n−n1 , |R(i+2)A | = n−n1 −n2 , |R(i)A ∩D| = m, |R(i+1)A ∩D| = m−m1 , |R(i + 2)A ∩ D| = m − m1 − m2 , |D| = d. Then, simple calculation shows that the difference of accuracy is obtained as: ∆α(i, i + 1) = αR(i+1) (D) − αR(i) (D) =
m − m1 m mn1 − nm1 − . = n − n1 n n(n − n1 )
(2)
Thus, Proposition 1. For a sequence of conjunction such that the values of accuracy increase monotonically, the following inequality should hold: mn1 − nm1 > 0
m m1 > n n1
or
In the similar way, the difference of coverage is obtained as: ∆κ(i, i + 1) = κR(i+1) (D) − κR(i) (D) =
m − m1 m −m1 − = < 0. d d d
Then, for the increasing sequence of accuracy, ∆κ(i, i + 1) =
−m1 −mn1 > d dn
Next, let us consider the secondary difference of the accuracy. From the equation (2), ∆α(i + 1, i + 2) = αR(i+2) (D) − αR(i+1) (D) =
(m − m1 )n2 − (n − n1 )m2 . (3) (n − n1 )(n − n1 − n2 )
Then, the secondary difference of accuracy is obtained as: ∆2 α(i, i + 1, i + 2) = ∆α(i + 1, i + 2) − ∆α(i, i + 1) −mn1 (n − n1 − n2 ) − nn2 m1 = <0 n(n − n1 )(n − n1 − n2 ) Thus,
Characteristics of Accuracy and Coverage in Rule Induction
241
Table 1. Two way Contingency Table for R(i) D ¬D R(i) m n−m n ¬R(i) d − m N − d − n + m N − n d N −d N
Theorem 6. Even if a sequence of conjunction for R is selected such that the value of accuracy increases monotonically, the difference of accuracy will decrease. That is, the increase of the value of accuracy will diminish if the length of conjunctive formula becomes larger.
On the other hand, the secondary difference of coverage is obtained as: ∆2 κ(i, i + 1, i + 2) = ∆κ(i + 1, i + 2) − ∆κ(i, i + 1) =
m1 − m2 . D
Thus, when m1 < m2 , the secodary difference of kappa is negative, which means the kappa will decrease rapidly. Conversely, if we take the sequence m1 > m2 > · · · > mp , where p is the number of attributes, the decrease of coverage will become smaller. Thus, Theorem 7. If the sequence of conjunction is selected such that m1 > m2 > · · · > mp , then the decrease of the value of coverage will diminish.
5
Sensitivity and Specificity
5.1
Trades-off between Sensitivity and Specificity
In medical epidemiological context, accuracy and coverage correspond to positive predictive value and sensitivity when given conditional and decision attributes are binary. Thus, if both attributes in an information table which are not binary are transformed into binary attributes, the discussions on accuracy and coverage can be translated into sensitivity and specificity. First, let us consider the two-way contingency table for R(i) shown in Table 1. Its sensitivity Sens(i) and specificity Spec(i) are defined as: Sen(i) =
m d
and Spec(i) =
N −d−n+m . N −d
(4)
Next, the contingency table for R(i + 1) is given in Table 2: Its sensitivity and specificity are given as: Sen(i + 1) =
m − m1 d
and Spec(i + 1) =
N − d − (n − m) + (n1 − m1 ) . N −d
242
S. Tsumoto Table 2. Two way Contingency Table for R(i + 1) D ¬D R(i + 1) m − m1 n − n1 − m + m1 n − n1 ¬R(i + 1) d − m + m1 N − d − n + m + n1 − m1 N − n + n1 d N −d N
It is well known that sensitivity and specificity have a trade-off relation. That is, if we want to have a higher sensitivity, the specificity will become lower. This relation corresponds to the relation between accuracy and coverage. Actually, for the sequence of conjunction, −m1 (= ∆κ(i, i + 1)) < 0, d n1 − m1 ∆Spec(i, i + 1) = Spec(i + 1) − Spec(i) = >0 N −d ∆Sen(i, i + 1) = Sen(i + 1) − Sen(i) =
Thus, the following theorem is obtained. Theorem 8. When an attribute-value pair is add to the sequence for a rule, the sensitivity decreases and specificity will increase: in other words, an additional attribute-value pair will provide the additional specificity with the loss of sensitivity.
It is notable that these relations are more general than those of accuracy and coverage. In Section 3, the sign of the difference of accuracy cannot be determined unless the sign of the formula mn1 −nm1 is not given. This result can be explained by the representation of accuracy with sensitivity and specificity. From the equations (4), the following relation is obtained: 1 αR(i) (D)
=1+
(N − d)(1 − Spec(i)) d × Sen(i)
Thus, even if the sensitivity becomes lower and the specificity becomes higher, it will not determine whether the accuracy increases or not. 5.2
Secondary Difference
Let us consider the contingency table for R(i + 2) (Table 3). The corresponding sensitivity and specificity are given as: m − m1 − m2 and d N − d − (n − m) + (n1 − m1 ) + (n2 − m2 ) Spec(i + 2) = . N −d Sen(i + 2) =
Characteristics of Accuracy and Coverage in Rule Induction
243
Table 3. Two way Contingency Table for R(i + 2) D ¬D R(i + 1) m − m1 − m2 n − n1 − n2 − m + m1 + m2 n − n1 − n2 ¬R(i + 1) d − m + m1 + m2 N − d − n + m + n1 + n2 − m1 − m2 N − n + n1 + n2 d N −d N
Thus, from the calculation of the difference between i + 2 and i + 1, the following secondary differences are obtained: m1 − m2 ∆2 Sen(i, i + 1, i + 2) = , d (n2 − m2 ) − (n1 − m1 ) ∆2 Spec(i, i + 1, i + 2) = N −d Thus: Theorem 9. The difference of sensitivity will decrease when m1 > m2 > · · · > mp holds where p is the total number of attributes. On the oher hand, the difference of specificity will decrease when n1 − m1 > n2 − m2 > · · · > np − mp holds.
5.3
Diagnostic Efficiency
To measure the degree of the trade-off relation between sensitivity and specificity, the diagnostic efficiency, which is the product of sensitivity and specificity, is frequently used in medical context. In our context, the diagnostic efficiency for R(i), denoted by eR(i) (D) is given as: eR(i) (D) = Sens(i) × Spec(i). From the contingency tables, the diagnostic efficiencies for R(i) and R(i + 1) are derived as: m{N − d − (n − m)} eR(i) (D) = and d(N − d) (m − m1 ){N − d − (n − m) + (n1 − m1 )} eR(i+1) (D) = . d(N − d) Thus, ∆e(i, i + 1) = eR(i+1) (D) − eR(i) (D) 1 =− {m1 (N − d) + m1 (n − m) + (m − m1 )(n1 − m1 )} < 0 N −d Therefore, the diagnostic efficiency will decrease for the sequence of conjunction, which suggests that the decrease of sensitivity is larger than the increase of specificity.
244
6
S. Tsumoto
Conclusion
In this paper, the following characteristics of accuracy and coverage are investigated: (1) Even if a sequence of conjunction for R is selected such that the value of accuracy increases monotonically, the difference of accuracy will decrease. (2)If the sequence of conjunction is selected such that m1 > m2 > · · · > mp , then the decrease of the value of coverage will diminish. (3) Coverage is equivalent to sensitivity and accuracy is described as a function of sensitivity and specificity. Thus, if sensitivity becomes lower, specificity will increase, although these do not suggest that accuracy increases. (4) The difference of sensitivity and specificity will diminish if some conditions holds. (5) When we focus on accurate classification, the classification efficiency, which is the product of sensitivity and specificity will become lower. This paper is a second formal study on accuracy and coverage. More formal analysis will appear in the future work. Acknowledgments. This work was supported by the Grant-in-Aid for Scientific Research (13131208) on Priority Areas (No.759) “Implementation of Active Mining in the Era of Information Flood” by the Ministry of Education, Science, Culture, Sports, Science and Technology of Japan.
References 1. Polkowski, L. and Skowron, A.(Eds.) Rough Sets and Knowledge Discovery 1, Physica Verlag, Heidelberg, 1998. 2. Polkowski, L. and Skowron, A.(Eds.) Rough Sets and Knowledge Discovery 2, Physica Verlag, Heidelberg, 1998. 3. Pawlak, Z., Rough Sets. Kluwer Academic Publishers, Dordrecht, 1991. 4. Rissanen J: Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore, 1989. 5. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory. In: Yager, R., Fedrizzi, M. and Kacprzyk, J.(eds.) Advances in the Dempster-Shafer Theory of Evidence, pp. 193–236, John Wiley & Sons, New York, 1994. 6. Tsumoto, S. Knowledge discovery in clinical databases and evaluation of discovered knowledge in outpatient clinic. Information Sciences, 124, 125–137, 2000. 7. Tsumoto, S. Automated Discovery of Positive and Negative Knowledge in Clinical Databases based on Rough Set Model., IEEE EMB Magazine, 56–62, 2000. 8. Tsumoto, S. Statistical Extension of Rough Set Rule Induction, Proceedings of SPIE: Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, 2001. 9. Tsumoto, S. Accuracy and Coverage in Statistical Induction, Proceedings of Rough Sets and Current Computing (RSCTC2002), 2002. 10. Yao, Y.Y. and Zhong, N., An analysis of quantitative measures associated with rules, N. Zhong and L. Zhou (Eds.), Methodologies for Knowledge Discovery and Data Mining, Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining, LNAI 1574, Springer, Berlin, pp. 479–488, 1999. 11. Ziarko, W., Variable Precision Rough Set Model. Journal of Computer and System Sciences, 46, 39–59, 1993.
Interpretation of Rough Neural Networks as Emergent Model Yasser Hassan and Eiichiro Tazaki Toin University of Yokohama, Japan.
Abstract. The need for more effective methods to generate and maintain global nonfunctional properties suggests an approach analogous to those of natural processes in generating emergent properties. Emergent model allows the constraints of the task to be represented more naturally and permits only pertinent task specific knowledge to emerge in the course of solving the problem. The paper describes some basics of emergent phenomena and its implementation in the rough hybrid systems.
1
Hybrid Rough Sets Systems
To discuss the general rough hybrid systems, we will consider the system composed of a set of agents Ag = {ag1 , . . . , agp }. Any agent from Ag is equipped with an information system IS ag = (Uag , Cag ) where Uag is a set of objects and Cag is a set of attributes. The decision table of the agent ag is a pair DT ag = (Uag , Cag ∪ D ag ). The lower and upper approximations of the set of objects X with respect to condition attributes of DT ag describe the vagueness in understanding of X using agents from the set Ag. The agents differ in their learned behavior, and their consequential experience and performance. Let L(ag) be a function, which determines the set of immediate neighbors agents for each agent ag. The function L can be defined as: L(ag) = {ag i : ag i ∈ Ag and ag i is immediate neighbor of ag} Function L provides the selection technique in the hybrid rough system. The communication between agents can be defined as mapping: M(ag, L(ag)): for x ∈ U ag , the value M(ag, L(ag))(x) ∈ U L(ag) According to function M, each agent from the hybrid system can send or receive information with other agents. The information can only be transmitted through sequences of immediate neighbor communications. The undirected communications and absence of complete information permit these hybrid rough set systems that sometimes but not always succeed in satisfying their goals. One of the proposals of the rough hybrid systems is the decomposition scheme of an agent into simpler parts. From an abstract point of view, this approach actually is about establishing for any two objects (agent and sub-agent) a degree in which one of them is a “part” of another. We will use the notion of a rough G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 245–249, 2003. c Springer-Verlag Berlin Heidelberg 2003
246
Y. Hassan and E. Tazaki
inclusion function [5], which gives for any two entities of discourse a degree in which one of them is a part of the other. Rough inclusion is an extension of rough membership function, which is defined by Pawlak. Rough inclusion is parameterized by a real parameter r in the interval [0, 1]. Therefore, the predicate X µr Y means that X is a part of Y in degree r. We define rough inclusion µr | by letting: µr (X, Y ) = |X∩Y = φ and µr (φ, Y ) = 1. |X| in case X Rough inclusion satisfies the following conditions; µ(X, X) = 1 for any object X, µ(X, Y ) = 1 implies that µ(Z, Y ) ≥ µ(Z, X)for any triple X, Y, Z, and There is η such that µ(η, X) = 1 for any object X. Let us define the relation part(µ) on the set of agents Ag using rough inclusion µ as follows: X part(µ) Y ⇔
µ(X, Y) = 1 and µ(Y, X) ¡ 1
The relation part(µ) is a non-reflexive and transitive relation on the set Ag. The formula x part(µ) y reads: x is a part of y and satisfies the axioms: X part(µ) Y ⇒ non(Y part(µ) X) for any pair X, Y , and X part(µ) Y and Y part(µ) Z ⇒ X part(µ) Z for any triple X, Y, Z.
2
Hybrid Rough Sets and Neural Networks
This section is an attempt to present an approach aimed at connecting rough set theory with artificial neural networks. 2.1
The Reduction of Input Features
Let us define what is called Weighted Information System as WIS = (U, C, W), where U is the set of objects, C is set of attributes, and W is set of weights related to the set of attributes C connected with each object. The weighted decision table takes the form W IS = (U, C, W ). The values of the set of weights W are determined according to the interactions between objects, and the interactions between objects and their environment. If object p1 interacts with object p2 , then the weight values for attribute i related to object p1 is modified according to the formula: ∆wi = αif do1 = do2 ∆wi = −αif do1 = do2 α is determined as α = τ if i(p1 ) = i(p2 ); α = −τ if i(p1 ) = i(p2 ) where τ is a constant. The weighted information system is a specific information table and, therefore, all concepts of the rough set analysis can be adapted to it. The algorithm for reduct of attributes based on the idea of weighted decision table can take the form: Input: Decision table DT = ( U, C ∪ D) Output: Reduct of attribute R – Convert the decision table DT to the weighted decision table WDT by adding distinguished set W. Its values associated with the attribute values and interactions between objects. – While (not exist any object does not interact) do *Interact object i with object j, i = {1, 2, . . . ,n}, j= {1,2,. . . , n}
Interpretation of Rough Neural Networks as Emergent Model
247
*For (a=1 to n) Modify the weight value associated with attribute a and object i n 1- R = φ 2- For ( a = 1 to n) If ( j=1 wij > 0) Then R = R ∪{a} Finally, after we determine the set of reducts for input features, we construct the model of rough neural networks by removing from the input vector, the attributes which were not included in any reduct set. 2.2
Rough Neuron
The outputs of a rough neuron r are calculated using the formula where g stands for any transfer function [4]: outputr¯ = max(g(inputr¯, g(inputr ))andoutputr = min(g(inputr¯, g(inputr )) ¯ is based ¯on the back-propagation algo¯ The learning process for the network rithm. In this algorithm the weights adjusted according to the formula: new old wji = wji + α.erri .f (inputi ) where f is the derivative of sigmoid function, α is the learning coefficient that remains constant during the learning, and erri is an error for neuron i. 2.3
Guidelines for Emergent Phenomena
In this system, the interaction of the dynamic representation and non-positional interpretation provides some innate emergent properties that assist in the acquisition of solutions. This hybrid system represents emergent model on some levels. First, the determination of the redundant features emerges from the interactions between the input features. Second, in the learning process itself, the ability to recognize the pattern-set (embodied in the connectional topology and weights) emerges from the interactions of agents (neurons and links). Third, once the net is trained, the appropriate pattern at the output layer emerges from the interactions between agents (neurons and links) in the static network. Rough neural network needs to be used in an emergent way and performs its roles; therefore, duplication system is required. We will define a duplicate operator that an agent can use to duplicate it. In the same manner, removable operator can be defined. In our hybrid system, the adaptation process is local for agent and no global control exists. Therefore, each agent has the ability to produce new agent and it can remove itself from the system under local control only. To define local control for each agent, we assign a fitness value as: Fag = α1 . V + α2 .B, Where Fag is a fitness function for agent ag ∈ Ag, α1 and α2 are parameters, V is the average of input and output values for agent (neuron) ag ∈ Ag, and B is how the agent ag is “better” against other agents in Ag. The value of V is determined according to how the weight vector values that connect to this neuron are modified after some number of iterations, and the highest value is the best one (it has the best value of V). Depending on the value of fitness function Fag , the agent or neuron ag can be spilt into two neurons using duplicate operator, i.e. it produces another neuron to the exactly same interconnection of the network as its parent neuron’s attributes are inherited. If a neuron does not
248
Y. Hassan and E. Tazaki
form the correct interconnections between other neurons or it is a redundant in the network, then it will die.
3
Experiment
We report results of experiments on the Medical dataset. The datasets were obtained from the Department of Urology and Pathology, Kitasato University School of Medicine, Japan[1]. The dataset contains 178 patients and consists of eight condition attributes and four decision attributes. In our experiment, the whole data set was partitioned randomly into training set (n = 89) and testing set (n = 89).
Fig. 1. The rate of classification error for three models of neural networks: Model1: Standard Neural Network, Model2: Rough Neural Network, and Model3: Proposed Model.
Table 1. The result of the three models of neural networks. The model
The error rate Training Testing Standard Neural Network 0.1609 0.1899 Rough Neural Network 0.1649 0.1742 Full Rough Neural Network 0.0082 0.0787
Figure 1 shows the rate of classification error for the three models of neural networks. From the result, we observe that the maximum errors are almost same for three models of networks. For the model1 and model2, the maximum error is 0.2176 and for model3 is 0.1547. The convergence for model1 starts after 940 iteration, while the convergence for model2 is start at time=540. For our proposed model, the convergence happened after 970 iterations.
Interpretation of Rough Neural Networks as Emergent Model
4
249
Conclusion
This paper thus provides a way in which the development of complex intelligent behaviors might involve evolutionary processes, learning processes, agent/environment interaction, and representation development. The learning method provided by rough set forms a bridge between the neural networks paradigms on the one hand and the representation list paradigm on the other.
References [1] Egawa, S., Suyama, K., Yoichi, A., Matsumoto, K., Tsukayama, C., Kuwao, S., and Baba, S., A Study of Pretreatment Nomograms to Predict Pathological Stage and Biochemical Recurrence After Radical Prostatectomy for Clinically Resectable Prostate Cancer in Japanese Men, Jpn. J. Clin. Oncol 2001, 31 (2), pp. 74–81. [2] Gotts, N., Emergent phenomena in large sparse random arrays of Conway’s ‘Game of Life’, International Journal of System Science, Vol. 31, No. 7, pp. 873–894, 2000. [3] Hassan, Y., and Tazaki, E., Emergent Phenomena in Cellular Automation Modeling, The International journal of System and Cybernetics “Kybernetes”, Vol. 31, No. 9/10, 2002. [4] Lingras, P., Rough neural networks, in proceedings of the Sixth international conference of information processing and management of uncertainty in knowledge-based systems (IPMU’96), pp. 1445–1450, 1996. [5] Polkowski, L., Tsumoto, S., and Lin, Y., Rough Set Methods and Applications, Physica Verlag, 2000.
Using Fuzzy Dependency-Guided Attribute Grouping in Feature Selection Richard Jensen and Qiang Shen Centre for Intelligent Systems and their Applications School of Informatics, The University of Edinburgh {richjens,qiangs}@dai.ed.ac.uk
Abstract. Feature selection has become a vital step in many machine learning techniques due to their inability to handle high dimensional descriptions of input features. This paper demonstrates the applicability of fuzzy-rough attribute reduction and fuzzy dependencies to the problem of learning classifiers, resulting in simpler rules with little loss in classification accuracy.
1
Introduction
The task of gathering information and extracting general knowledge from it is known to be the most difficult part of creating a knowledge-based system. The present work aims to induce low-dimensionality rulesets from historical descriptions of domain features which are often of high dimensionality. In particular, a recent fuzzy rule induction algorithm (RIA), as first reported in [2], is taken to act as the starting point for this. In order to speed up the RIA and reduce rule complexity, a preprocessing step is required. Fuzzy-rough attribute reduction (FRAR) with fuzzified dependency is introduced here to address this issue. This step reduces the dimensionality of potentially very large feature sets while minimising the loss of information needed for rule induction.
2
Fuzzy-Rough Attribute Reduction
Rough Set Attribute Reduction (RSAR) described in [6] can only operate effectively with datasets containing discrete values. As most datasets contain realvalued attributes, it is necessary to perform a discretization step beforehand. However, membership degrees of attribute values to fuzzy sets are not exploited in the process of RSAR. By using fuzzy-rough sets [3], it is possible to use this information to better guide attribute selection. Let I = (U, A) be an information system, where U is a non-empty set of finite objects (the universe); A is a non-empty finite set of attributes such that a : U → Va for every a ∈ A; Va is the value set for attribute a. In a decision system, A = {C ∪ D} where C is the set of conditional attributes and D is the set of decision attributes. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 250–254, 2003. c Springer-Verlag Berlin Heidelberg 2003
Using Fuzzy Dependency-Guided Attribute Grouping in Feature Selection
251
The fuzzy lower approximation is defined here as: µP X (x) = sup min(µF (x), inf max{1 − µF (y), µX (y)}) F ∈U/P
y∈U
(1)
The tuple < P X, P X > is called a fuzzy-rough set. FRAR builds on the notion of the fuzzy lower approximation to enable reduction of datasets containing realvalued attributes [4]. The membership of an object x ∈ U, belonging to the fuzzy positive region can be defined by µP OSP (D) (x) = sup µP X (x)
(2)
X∈U/D
Using the definition of the fuzzy positive region, the new dependency function can be defined as follows
γP (D) =
|µP OSP (D) (x)| = |U|
x∈U
µP OSP (D) (x) |U|
(3)
More details of the fuzzy-rough attribute reduction and the fuzzy-rough QuickReduct algorithm can be found in [4]. This employs the new dependency function γ to choose which attribute to add to the current list of selected features (termed a reduct candidate) in the same way as the original QuickReduct process [6]. The algorithm terminates when the addition of any remaining attribute does not increase the dependency (such a criterion could be used with the original QuickReduct algorithm).
3
Fuzzy Dependency
By its definition, the degree of dependency measure (whether using crisp or fuzzy-rough sets) always lies in the range [0,1], with 0 indicating no dependency and 1 indicating total dependency. By fuzzifying the output values of the dependency function it is hoped that the problem of noise and non-generality can be successfully tackled. In addition to this, attributes may be grouped at stages in the selection process depending on their dependency label, speeding up the reduct search. The goal of RSAR and FRAR is to find a (possibly minimal) subset of the conditional attributes for which the degree of dependency is at a maximum (typically the value 1). In the case of fuzzy equivalence classes, where an element of uncertainty is introduced, the maximum degree of dependency may be substantially less than this. In fact, the maximum dependency for different datasets may be quite different due to differing levels of uncertainty. The maximum for dataset A may be 0.9 whereas for dataset B the maximum may be only 0.2. Given a degree of dependency of 0.19, for dataset A this is quite a small value but for dataset B this is quite large, so a certain way of scaling the dependency
252
R. Jensen and Q. Shen
value depending on the dataset is required. The following is one potential way of achieving this for a subset P of all conditional attributes C: γP (D) =
γP (D) − γlow (D) γhigh (D) − γlow (D)
(1) R ← {}, γbest ← 0, γprev ← 0, (2) do (3) Cands ← {}, γprev ← γbest , γhigh ← 0, γlow ←1 (4) ∀x ∈ (C − R) (5) T ← R ∪ {x} (6) Cands ← Cands ∪ (x, γT (D)) (7) if γT (D) > γhigh then γhigh ← γT (D) (8) else if γT (D) < γlow then γlow ← γT (D) (9) Cands ← scale(Cands, γhigh , γlow ) (10) R ← R ∪ selectFeatures(Cands) (11) γbest ← γR (D) (12) until γbest = γprev
Fig. 1. The new fuzzy-rough QuickReduct algorithm with fuzzy dependencies
The new fuzzy-rough QuickReduct algorithm (FQR) which employs scaling and fuzzy dependencies can be seen in figure 1. In using fuzzy degrees of dependency, a variety of selection strategies may be used. Indeed, it is possible to change strategy at any stage of the attribute selection process. The main distinction to make in the set of possible strategies is whether features are chosen individually or in groups. In order to evaluate the utility of the present work and to illustrate its domain-independence, a challenging test dataset was chosen, namely the Water Treatment Plant Database [1]. The dataset consists of historical data charted over 521 days, with 38 different input features measured daily. Each day is classified into one of thirteen categories depending on the operational status of the plant. However, these can be collapsed into just two or three categories (i.e. Normal and Faulty, or OK, Good and Faulty) for plant monitoring purposes as many classifications reflect similar performance. It is likely that not all of the 38 input features are required to determine the status of the plant, hence the dimensionality reduction step. However, choosing the most informative features is a difficult task as there will be many dependencies between subsets of features.
4
Results
In all experiments here, FQR employs the strategy where all attributes belonging to the highest dependency group are selected at each stage. For the 2-category problem, the FRAR feature selector returns 10 features out of the original 38,
Using Fuzzy Dependency-Guided Attribute Grouping in Feature Selection
253
Fig. 2. Classification accuracies for the 2-category dataset.
whereas FQR returns 12. Figure 2 compares the classification accuracies of the reduced and unreduced datasets on both the training and testing data. The best results for FQR were obtained in the range 0.85 to 0.88, producing a classification accuracy of 82.6% on the training set and 83.9% for the test data. For FRAR, the best accuracies were 83.3% (training) and 83.9% (testing). Compare this with the optimum for the unreduced approach, which gave an accuracy of 78.5% for the training data and 83.9% for the test data. As can be seen, both the FRAR and FQR results are almost always better than the unreduced accuracies over the tolerance range. The 3-category dataset is a more challenging problem, reflected in the overall lower classification accuracies produced. Both fuzzy-rough methods, FQR and FRAR, choose 11 out of the original 38 features (but not the same features). The results of these approaches can be seen in figure 3. Again, it can be seen that both FQR and FRAR outperform the unreduced approach on the whole. The best classification accuracy obtained for FQR was 72.1% using the training data, 74.8% for the test data. For FRAR the best results were 70.0% (training) and 71.8% (testing). In this case, FQR has found a better reduction of the data. For the unreduced approach, the best accuracy obtained was 64.4% using the training data, 64.1% for the test data.
Fig. 3. Classification accuracies for the 3-category dataset.
5
Conclusion
FQR as pre-processors to fuzzy rule induction. It has demonstrated the potential benefits of using fuzzy dependencies and attribute grouping in the search
254
R. Jensen and Q. Shen
for reducts. Not only are the runtimes of the induction and classification processes improved by this step (which for some systems are important factors), but the resulting rules are less complex. Further investigations into FQR include optimization of the dependency fuzzification and selection strategy comparisons.
References 1. C.L. Blake and C.J. Merz. UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. http://www.ics.uci.edu/˜mlearn/ 2. S. Chen, S.L. Lee and C. Lee. A new method for generating fuzzy rules from numerical data for handling classification problems. AAI, 15(7):645–664, 2001. 3. D. Dubois and H. Prade. Putting rough sets and fuzzy sets together. In Intelligent Decision Support, pp. 203–232. Kluwer Academic Publishers, Dordrecht. 1992. 4. R. Jensen and Q. Shen. Fuzzy-Rough Sets for Descriptive Dimensionality Reduction. Proceedings of the 11th International Conference on Fuzzy Systems, pp. 29–34, 2002. 5. Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishing, Dordrecht, 1991. 6. Q. Shen and A. Chouchoulas. A modular approach to generating fuzzy rules with reduced attributes for the monitoring of complex systems. EAAI, 13(3):263–278, 2000.
Conjugate Information Systems: Learning Cognitive Concepts in Rough Set Theory Maria Semeniuk-Polkowska1 and Lech Polkowski2 1
2
Chair of Formal Linguistics, Warsaw University Browarna 8/12, 00991 Warsaw, Poland Polish–Japanese Institute of Information Technology Koszykowa 86, 02008 Warsaw Poland {Lech.Polkowski, polkow}@pjwstk.edu.pl
Abstract. The problem of concept learning [1] via rough sets has been recently discussed in literature e.g. cf. [6], [3], [7]. To formally study this problem, we introduce a notion of a conjugate information system cf.[3]. Keywords: rough sets, information systems, conjugate information systems, learning of cognitive concepts.
1
Introduction
All notions relevant to rough sets may be found in [2], [5] or in [4]. An information system A is defined here as a triple (U, A, h) where U is a finite set of objects and A is a finite set of attributes. A mapping h : U × A → V , with h(u, a) = a(u), is an A–value assignment, where V = {Va : a ∈ A} is the attribute value set. An information system Ad =(U, A ∪ {d}, h) with d ∈ / A is called a decision system; we regard d as a decision dependent upon A.
2
Conjugate Information Systems
By a conjugate information system, we understand a triple C = ({Ai = (Ui , Ai , hi ) : i ∈ I}, {Fad : a ∈ A}, i0 ) consisting of a family of information systems {Ai = (Ui , Ai , hi ) : i ∈ I} such that for some finite sets U, A we have Ui = U, Ai = A for i ∈ I, a family of decision systems {Fad : a ∈ A} and a singled out index i0 ∈ I. Thus the difference between information systems Ai , Aj with i = j ∈ I is in functional assignments hi , hj . The information system corresponding to i0 is said to be the tutor system. For each a ∈ A, the decision system Fad is a triple (U, F eata ∪ {a}, ha ) where U is the C–universe, F eata is the set of a–features, each f ∈ F eata being a binary attribute, a is the decision attribute of the system F eata , and ha is the value assignment. We assume that each system Fad is reduced in the sense G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 255–258, 2003. c Springer-Verlag Berlin Heidelberg 2003
256
M. Semeniuk-Polkowska and L. Polkowski
that for each value va of the attribute a, there exists a unique vector of values f ∗ (u) = (f (u) : f ∈ F eata ) with ha (f ∗ (u), a) = va ; then (h← a (va ))f is f (u). The system Fad will be regarded as the system belonging to the tutor, while its realization by an agent i ∈ I will be denoted Fad (i) = (U, F eata , hia ). 2.1
A Metric on a Conjugate System
We introduce a distance function dist on C. To this end, we first introduce for an object x ∈ U the set DISi,j (x) = {a ∈ A : hi (a, x) = hj (a, x)} of attributes |DISi,j (x)| discerned on x between Ai and Aj . Then we let dist(Ai , Aj ) = maxx , |A| where for a set M , the symbol |M | denotes the number of elements in M . Learning may start with pupil(s) closest to the tutor. 2.2
Learning Attribute Values from Features
In the learning stage, each agent Aj (represented by the corresponding information system) learns to assign values of attributes in A from features in the decision system Fad . First, in the training stage, each agent learns to assign correct values to features in the sets F eata = {fk,a : k = 1, 2, ..., na }. The measure of learning quality is the assurance-level-function mj (k, a); for each triple j, k, a it is defined as follows: mj (k, a) = pos(j,k,a) ex(j,k,a) , where ex(j, k, a) is the number of examples for learning fk,a and pos(j, k, a) is the number of positively classified examples by the agent Aj. We order features in Fad according to decreasing value of mj (k, a); the resulting linear order is denoted ρaj and the system Fad with values assigned by the agent Aj is denoted with the symbol Fad (j). Similarly, for each attribute a ∈ A and index j, we evaluate in the training sesos(j,a) sion the value M (j, a) = Pex(j,a) where P os(j, a) is the number of positively evaluated values of a and ex(j, a) is the number of examples; ex(j, a) = Σk ex(j, k, a). From these data we extract the following parameters: ajB = argmaxa∈B M (j, a), fCj,a = argmink∈C mj (k, a). We also set a distance function φa (v, w) on values of the attribute a for each a ∈ A, v, w ∈ Va as estimated in the system of type Fad by letting φa (v, w) = |DISa (v,w)| where DISa (v, w) = {f ∈ F eata : (h← = (h← a (v))f a (w))f }. Thus |F eata | φa (v, w) does express the distance at the pair v, w measured as the number of differently classified features in the row defined by v, w, respectively.
3
Maximin Learning of Attribute Values
We now address the problem of learning from the tutor of the correct evaluation of attribute values. Objects x ∈ U are sent to each agent Ai for i ∈ I one–by–one. Step 1. Assignment of attribute values based on training examples. At that stage the values dist(i, i0 ) of distances from agents Ai to the tutor Ai0 are calculated.
Conjugate Information Systems: Learning Cognitive Concepts
257
Step 2. The feedback information passed from the tutor to the agent Ai is the following: Infi = (r, Error − seti = {ai1 , ...air }, Error − vectori = [si1 , ..., sir ]) where r is the distance from i to the tutor, aij is the j–th miss-classified attribute and for j ∈ {1, .., r}, the value sij is the distance φaij (vj , wj ) where vj is the correct (tutor) value of the attribute aij and wj is the value assigned to aij by the agent Ai . Step 3. The given agent Ai begins with the attribute a = aiError−seti for which the value of the assurance-level-function is maximal (eventually selected at random from attributes with this property). For the attribute a, the value s = sa is given, meaning that s × 100 per–cent of features has been assigned incorrect values by Ai in the process of determining the value of a. Step 4. Features in F eata are now ordered into a binary (sub)tree Ta (i) according to decreasing values of the assurance–level–function mi (k, a) i.e. by ρai : starting with the feature f = fFi,aeata giving the minimal value of the function mi (k, a), the agent i goes upward the tree Ta (i), changing the value at subsequent nodes. If the value of φ remains unchanged after the change at the node, the error counter remains unchanged, otherwise its value is decremented/incremented by one. Step 5. When the error counter reaches the value 0, stop and go to the next attribute. Remark. In the real conditions, each stage in Step 4 is accompanied by an exchange of communiques between the tutor and the pupil in order to explain the reason for acceptance or not of the changed value.
4
A Hand Example
Our example concerns grading essays written by students in French cf. [3]. Grading is done on the basis of three attributes: a1 : grammar, a2 : structure, and a3 : creativity. We present below tables showing the tutor decision systems Fa for a = a1 , a2 , a3 . Example 1. Decision systems Fa1 , Fa2 , Fa3 Table 1. Decision systems Fa1 , Fa2 , Fa3 fa11 + -
fa21 + -
fa31 + + -
a1 fa12 3 2 1 +
fa22 + -
fa32 + -
a2 fa13 3 2 + 1 +
fa23 + +
fa33 +
a3 3 2 1
where fa11 takes value +/- when the percent of declination errors is ≥ /< 20 ; fa21 is +/- when the percent of conjugation errors is ≥ / < 20, and fa31 is +/- when the percent of syntax errors is ≥ / < 20; fa12 takes value +/- when the structure is judged rich/not rich, fa22 is +/- when the structure is judged medium/not
258
M. Semeniuk-Polkowska and L. Polkowski
medium, and fa32 is +/- when the structure is judged to be weak/ not weak. fa13 takes value +/- when the lexicon is judged rich/not rich, fa23 is +/- when the source usage is judged extensive/not extensive, and fa33 is +/- when the analysis is judged to be deep/ not deep. Consider a pupil A1 and a testing information system with U = {t1 , t2 , t3 }, A = {a1 , a2 , a3 } which is completed with the following value assignments. Example 2. Information systems A0 of the tutor and A1 of the pupil. Table 2. Information systems of the tutor and the pupil. t t1 t2 t3
a1 1 1 3
a2 2 1 2
a3 1 1 3
t t1 t2 t3
a1 1 1 3
a2 2 1 2
a3 2 2 3
The distance dist(A0 , A1 ) is equal to 1 as DIS0,1 (t1 ) = {a3 } = DIS0,1 (t2 ); DIS0,1 (t3 ) = ∅. Thus, the pupil miss–classified the attribute a3 due to a faulty selection of feature values: in case of t1 , the selection by the tutor is +,+,+ and by the pupil: +,+,-. The distance φa3 ,1 is equal to 1 and the information sent to the pupil in case of t1 is Inf1 = (1, {a3 }, (1)). Assuming the values of assurance– level–function m(1, k, a3 ) are such such that f 3,a3 = fa33 , the pupil starts with fa33 and error–counter =1 and changing the value at that node reduces the error to 0. This procedure is repeated with t2 etc.
References 1. T. M. Mitchell. Machine Learning. McGraw Hill, Boston, 1997. 2. Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer, Dordrecht, 1992. 3. M. Semeniuk–Polkowska.Seminar Notes. Applications of Rough Set Theory. , Fasc. I, II, III . Warsaw University, 2000 – 2002. 4. L. Polkowski. Rough Sets. Mathematical Foundations. Physica/Springer, Heidelberg, 2002. 5. L. Polkowski and A. Skowron. Rough Sets in Knowledge Discovery vols. 1,2. Physica/Springer, Heidelberg, 1998. 6. V. Dubois and M. Quafafou.Concept learning with approximations: rough version spaces. LNAI 2475, 239–246, Springer, Berlin, 2002. 7. E.St¸epie´ n.A study of functional aspects of a public library by rough set techniques. PhD Thesis, Warsaw University. Department of Library and Information Sciences. M. Semeniuk-Polkowska, supervisor, 2002.
A Rule Induction Method of Plant Disease Description Based on Rough Sets Ai-Ping Li, Gui-Ping Liao, and Quan-Yuan Wu 613#, School of Computer, National University of Defense Technology, Changsha 410073, P.R.China [email protected]
Abstract. Knowledge acquisition is the bottleneck to develop expert system. It usually takes a long period to acquire plant disease knowledge using the traditional methods. Aiming at this problem, this paper describes relations between rough set theory and rule-based description of plant diseases. Then the exclusive rules, inclusive rules and disease images of rapeseed disease are built based on the PDES diagnosis model, and the definition of probability rule is put forward. At last, the paper presents the rule-based automated induction reasoning method, including exhaustive search, post-processing procedure, estimation for statistic test and the bootstrap and resampling methods. We also introduce automated induction of the rule-based description, which is used in our plant diseases diagnostic expert system. The experimental results show that this method induces diagnostic rules correctly.
1 Introduction In this paper, we will discuss our rule induction method in our plant disease diagnostic expert system (for short, PDES). In detail, we expatiate this process use the example of knowledge acquisition in rapeseed plant disease. We not only discuss these relations between rough set theory and rule-based description of diseases, which consists of characterizing rules, discriminating rules, and the collection of observations, but also discuss induction of such knowledge from rapeseed disease databases. The implemented results show that rough set theory gives a very suitable framework to represent processes of uncertain knowledge extraction, and to induce disease description correctly. In PDES system, we apply a diagnosing model proposed by Matsumura et al. [2], composed of the following three kinds of reasoning processes: exclusive reasoning, inclusive reasoning, and reasoning about complications. Tumoto [4] has applied this model into neurological diseases diagnostic expert system, and get a fair good result. Based on this diagnostic model, we should acquire exclusive rules, inclusive rules, and disease images.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 259–263, 2003. © Springer-Verlag Berlin Heidelberg 2003
260
A.-P. Li, G.-P. Liao, and Q.-Y. Wu
2 Rough Sets and PDES Rules Table 1 is a small example of a database, which collects the sick rapeseed that complained of rapeseed diseases. The classification of training samples D can be viewed as a search for the best set [x]R supported by the relation R. In this way, we can define the characteristics of classification in the set theoretic framework. In the subsequent sections, we use the above notations to describe PDES rules and an induction algorithm. Table 1. Part of the database of rapeseed diseases …
class
Whiten,tissue rot
…
Sclerotinia sclerotiorum
cyan wheel
Whiten,speckle spread
…
Sclerotinia sclerotiorum
brown mold
fuscous wheel
Darken,not spread
…
Alternaria
White floccular
Black bacteria
beige stain
Whiten,spread till root
…
Sclerotinia sclerotiorum
stem
White floccular
Black bulkload
beige erose
Whiten,spread to root
…
Phoma lingam
flow er
White floccular
Black bacteria
beige stain
Whiten,tissue rot
…
Sclerotinia sclerotiorum
Id
Age
loc
Nature
Tenor
speckle
1
seedling
root
White floccular
Black bacteria
rufous spot
2
Grown
leaf
White floccular
Black bacteria
3
Grown
leaf
White floccular
4
Grown
stem
5
Grown
6
Grown
Process
2.1 Definition of Probabilistic Rules Definition 1 (Probabilistic Rules). Let Rf be an equivalence relation specified by some assignment function f, D denote a set whose elements belong to a class d, or positive examples in the whole training samples (the universe), U. Finally, let |D| denote the cardinality of D. A probabilistic rule of D is defined as a quadruple, α ,κ α ,κ < R f → d , α R f ( D ), κ R f ( D ) > where R f → d satisfies the following
conditions:
(1)[ x]R f ID ≠ φ , (2)α R f ( D) =| [ x]R f ID | / | [ x]R f |, (3)κ R f ( D) =| [ x]R f ID | / | D | In the above definition, α corresponds to the accuracy measure: if α of a rule is equal to 0.9, then the accuracy is also equal to 0.9. On the other hand, κ is a statistical measure of how proportion of D is covered by this rule, that is, a coverage or a true positive rate: when κ is equal to 0.5, half of the members of a class belongs to the set whose members satisfy that equivalence relation. For example, let us consider a case of a rule [nature = white floccular] → Sclerotinia sclerotiorum. Since [x] [nature = white floccular] = {1, 2, 3, 4, 5, 6} and D ={1, 2, 4, 6},α[nature = white
A Rule Induction Method of Plant Disease Description Based on Rough Sets
261
(D)=| {1, 2, 4, 6}| / |{1, 2, 3, 4, 5, 6}| = 0.67 and κ[nature = white floccular] (D)=|{1, 2, 4, 6}| / |{1, 2, 4, 6}| = 1. Thus, if rapeseed plant, which has a nature of symptom, is white floccular, Sclerotinia sclerotiorum is suspected with accuracy 0.67, and this rule covers 100 % of the cases. floccular]
2.2 PDES Diagnostic Rules Using these notations, PDES diagnosing rules are described in the following way. (1) Exclusive rules: α ,κ R → d s .t .
R = ∧i Ri and
κ Ri ( D ) = 1.0 .
In the above example, the relation R for “Sclerotinia sclerotiorum” is described as:([age = seedling] ∨ [age = grown]) ∧ ([location = root or leafstalk] ∨ [location = leaf] ∨ [location = stem] ∨ [location = flower or fruit]) ∧ [nature = white floccular] ∧[tenor = black bacteria gill] ∧ ([speckle = rufous spot] ∨ [speckle = cyan wheel] ∨ [speckle = beige stain])∧ ([process = whiten, tissue rot] ∨ [process= whiten, speckle spread] ∨ [process = whiten, spread till root]). (2) Inclusive rules: α ,κ R → d s.t. R = ∨i Ri = ∨ ∧ j ∨k [a j = vk ] , αRi ( D) > δα , and κ Ri (D) > δα .
In the above example, the simplest relation R for “Sclerotinia sclerotiorum”, is described as: [nature = white floccular] ∨ [tenor = black bacteria gill] ∨ [speckle = beige stain]) ∨ [process = whiten, tissue rot]. However, induction of inclusive rules gives us two problems. First, SI and CI are over fitted to the training samples. Second, the above rule is only one of many rules, which are induced from the above training samples. Therefore some of them should be selected from primary induced rules under some preference criterion. (3) Disease Image: α ,κ R → d s .t .
R = ∨ Ri ∨ [ ai = v j ] , α Ri ( D ) > 0(κ Ri ( D ) > 0) .
3 Induction of Rules An induction algorithm of PDES rules consists of two procedures. One is an exhaustive search procedure to induce the exclusive rule and the disease image through all the attribute-value pairs, and the other is a post processing procedure to induce inclusive rules through the combinations of all the attribute-value pairs.
262
A.-P. Li, G.-P. Liao, and Q.-Y. Wu
3.1 Exhaustive Search Let D denote training samples of the target class d, or positive examples. In the above example in Table 1, let d be “Sclerotinia sclerotiorum”, and [nature = white floccular ] be selected as [ai = vj ]. Since the intersection [x] [nature = white floccular mycelium] D({1, 2, 4, 6}) is not equal to , this pair is included in the disease image. However, since α[nature = white floccular](D)= 0.67, this pair is not included in the inclusive rule. Finally, since D ⊂ [x] [nature = white floccular](={1, 2, 3, 4, 5, 6}), this pair is also included in the exclusive rule. When all the attribute-value pairs are examined, not only the exclusive rule and disease image shown in the above section, but also the candidates of inclusive rules are also derived. The latter ones are used as inputs of the second procedure. 3.2 Post Processing Procedure Because the definition of inclusive rules is a little weak, many inclusive rules can be obtained. In the above example, an equivalence relation [age = seedling] satisfies D [x][age = seedling] , so it is also one of the inclusive rules of “Sclerotinia sclerotiorum”, although SI of that rule is equal to 1/4. In order to suppress induction of such rules, which have low classificatory power, only equivalence relations whose SI is larger than 0.5 are selected. For example, since the above relation [age = seedling] is less than this precision, it is eliminated from the candidates of inclusive rules. Furthermore, RPDES minimizes the number of attributes not to include the attributes, which do not gain the classificatory power, called dependent variables. 3.3 Cross-Validation and the Bootstrap Method Cross-validation method for error estimation is performed as following: first, the who training samples L are split into V blocks: {L1, L2, ...,LV}. Secondly, repeat for V times the procedure in which rules are induced from the training samples L –Li(i = 1, …,V) and examine the error rate erri of the rules using Li as test samples. Finally, the whole error rate err is derived by averaging erri over i, that is,
err = ∑ i =1 erri / V V
(this method is called V-fold cross-validation). On the other hand, the Bootstrap method is executed as follows: first, empirical probabilistic distribution (Fn) is generated from the original training samples [1]. Secondly, the Monte-Carlo method is applied and training samples are randomly taken by using Fn. Thirdly, using new training samples induces rules. Finally, these results are tested by the original training samples and statistical measures, such as error rate are calculated. These four steps are iterated for finite times. Empirically, it is shown that about 200 times repetition is sufficient for estimation [1].
A Rule Induction Method of Plant Disease Description Based on Rough Sets
263
4 Experimental Results and Discussion We evaluate RPDES on the datasets of PDES domain, which consist of 1464 samples, 51 classes, and 16 attributes. In this experiment, δα and δκ is set to 0.75 and 0.5, respectively. Experimental results are shown in Table 2. Table 2. Experimental Results Methods
ER-A
IR-A
DI-A
RPDES
92.1%
84.0%
89.7%
Expert
97.0%
92.0%
94.0%
CART
—
83.3%
—
AQ15
—
84.7%
—
R-CV
71.6%
77.2%
81.5%
BS
95.6%
89.4%
92.1%
Definition: ER-A: Exclusive Rule Accuracy, IR-A: Inclusive Rule Accuracy, DI-A: Disease Image Accuracy
In this paper, we introduce an induction method of disease description from plant disease databases. The experimental results not only show that rough set theory gives a very suitable framework to represent processes of uncertain knowledge extraction, but also that this method induces diagnostic rules correctly. However, this method does not incorporate all the reasoning processes, such as gathering rules of similar diseases (aggregation), and interpretation of characterization using domain knowledge (interpretation), which will be our future work.
References [1] [2] [3] [4]
Efron B. 1982. The Jackknife, the Bootstrap and OtherResampling Plans. Philadelphia: Society for Industrialand Applied Mathematics. 1982 Matsumura Y., Matunaga T., Hata Y., et al. Consultation system for diagnoses of headache and facial pain: RHINOS. Med Info. 1986. 11:145–157. Pawlak Z. Rough Sets. Dordrecht: Kluwer Academic Publishers. 1991. Tsumoto, S. and Tanaka, H. Induction of Medical Expert System Rules based on Rough Sets and Resampling Methods Proceedings of the Eighteenth AnnualSymposium on Computer Applications in Medical Care,(Journal of the AMIA 1, supplement), pp. 1066– 1070, 1999.
Rough Set Data Analysis Algorithms for Incomplete Information Systems K.S. Chin1 , Jiye Liang2 , and Chuangyin Dang1 1
Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Hong Kong 2 Department of Computer Science, Shanxi University Taiyuan, 030006, China [email protected] Abstract. The rough set theory is a relatively new soft computing tool for dealing with vagueness and uncertainty in databases. To apply this theory, it is important to associate it with effective computational methods. In this paper, we focus on the development of algorithms for incomplete information systems and their time and space complexity. In particular, by using measure of significance of attribute which is defined by us, we present a heuristic algorithm for computing the minimal reduct, the time complexity of this algorithm is O(|A|3 |U |2 ), and its space complexity is O(|A||U |). The minimal reduct algorithm is very useful for knowledge discovery in databases.
1
Introduction
The rough set theory as proposed in [1, 2] provides a formal tool for dealing with imprecise or incomplete information. This approach seems fundamentally important to artificial intelligence and cognitive science. In this paper, we develop rough set data analysis algorithms in [3] into incomplete information systems in [4]. Basic computational methods of rough set data analysis are given for incomplete information systems, their time and space complexity are analyzed. The minimal reduct algorithm which is proposed by us is very useful for knowledge discovery in databases.
2
Incomplete Information Systems
Let S = (U, A) be an incomplete information systems [4, 5], we will denote a null value (i.e., missing values of attribute) by ∗. Let P ⊆ A. We define tolerance relation: SIM (P ) = {(u, v) ∈ U × U | ∀a ∈ P, a(u) = a(v) or a(u) = ∗ or a(v) = ∗}. Let SP (u) denote the object set {v ∈ U |(u, v) ∈ SIM (P )}. SP (u) is the maximal set of objects which are possibly indistinguishable by P with u. Let U/SIM (P ) denote classification, which is the family set {SP (u)|u ∈ U }. Any element from U/SIM (P ) will be called a tolerance class or the granularity of information. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 264–268, 2003. c Springer-Verlag Berlin Heidelberg 2003
Rough Set Data Analysis Algorithms for Incomplete Information Systems
3
265
Computing the Low Approximation and Upper Approximation
Let S = (U, A) be an incomplete information system. Let W ⊆ U , and a ∈ A. For a classification U/SIM ({a}), Ra W = {u ∈ U |Sa (u) ⊆ W } is called the lower approximation to W from U/SIM ({a}); Ra W = {u ∈ U |Sa (u) ∩ W = ∅} is called the upper approximation to W from U/SIM ({a}). We now present an algorithm for computing the lower approximation. Algorithm L Let S = (U, A) be an incomplete information system. Let U/SIM ({a}) = {Sa (u1 ), Sa (u2 ), · · · , Sa (u|U | )}. Let W ⊆ U . This algorithm gives the lower approximation Ra W = {u ∈ U |Sa (u) ⊆ W } to W from U/SIM ({a}). L1. Input U/SIM ({a}). L2. Set ∅ → L. L3. For i = 1 to |U | Do If Sa (ui ) ⊆ W , then L ∪ {ui } → L. Endfor L4. Output L. The time complexity of Algorithm L is O(|U |), and its space complexity is O(1). Similarly, we can design an algorithm to compute the upper approximation as follows. Algorithm H Let S = (U, A) be an incomplete information system. Let U/SIM ({a}) = {Sa (u1 ), Sa (u2 ), · · · , Sa (u|U | )}. Let W ⊆ U . This algorithm gives the upper approximation Ra W = {u ∈ U |Sa (u) ∩ W = ∅} to W from U/SIM ({a}). H1. Input U/SIM ({a}). H2. Set ∅ → H. H3. For i = 1 to |U | Do If Sa (uj ) ∩ W = ∅, then H ∪ {uj } → H. Endfor H4. Output H. The time complexity of Algorithm H is O(|U |), and its space complexity is O(1).
4
Significance and Core
Definition 4.1 Let S = (U, A) be an incomplete information system. Let X be a non-empty subset of A: ∅ ⊂ X ⊆ A. Given an attribute x ∈ X, we say
266
K.S. Chin, J. Liang, and C. Dang
that x is significant in X if U/SIM (X) ⊂ U/SIM (X − {x}); and that x is not significant or non-significant in X if U/SIM (X) = U/SIM (X − {x}). In the following we introduce a quantitative measure for significance. Definition 4.2 Let X be a non-empty subset of A: ∅ ⊂ X ⊆ A. Given an attribute x ∈ X, we define the significance of x in X as sigX−{x} (x) =
|U | |SX−{x} (ui )| − |SX (ui )| i=1
|U | × |U |
.
The overall time complexity for computing a significance is O(|X||U |2 ), and the space complexity for computing a significance is O(|X||U |). Definition 4.3 Let X be a non-empty subset of A: ∅ ⊂ X ⊆ A. The set of attributes x ∈ X which are significant in X is called the core of X, denoted by CX . That is, CX = {x ∈ X|sigX−{x} (x) > 0}. Also, we define C∅ = ∅. Algorithm C Let S = (U, A) be an incomplete information system. Let X be a non-empty subset of A: ∅ ⊂ X ⊆ A. This algorithm obtains CX of X. Input S = (U, A), X. Set ∅ → CX . For every x ∈ X, compute SIM ({x}). For i = 1 to |X| Do Compute sigX−{xi } (xi ). If sigX−{xi } (xi ) > 0, then CX ∪ {xi } → CX . Endfor C5. Output CX . C1. C2. C3. C4.
The time complexity of Algorithm C is O(|X||U |2 + |X|2 |U |), and its space complexity is O(|X||U |).
5
Reducts
Definition 5.1 Let S = (U, A) be an incomplete information system. A subset A0 of A is said to be a reduct of A if A0 satisfies: (1) U/SIM (A0 ) = U/SIM (A); i.e., A0 ↔ A; and (2) If A ⊂ A0 , then U/SIM (A0 ) ⊂ SIM (A ); i.e., if A ⊂ A0 , then A ↔A. From this definition, the time complexity to find all reducts is exponential. First, we need to consider all |2A | = 2|A| subsets of A. And for every subset A0 , we need to compute U/SIM (A0 ). The time complexity to compute U/SIM (A0 ) 2 2 for one subset A0 is O(|A||U | ). So the total price is O(2|A| |A||U | ). We have the relationship between reducts and core as follows.
Rough Set Data Analysis Algorithms for Incomplete Information Systems
267
Theorem 5.1 Let S = (U, A) be an incomplete information system. Then s CA = A0i , where A01 , A02 , · · · , A0i , · · · , A0s are all reducts of A. i=1
6
Minimal Reduct
Definition 6.1 Let S = (U, A) be an incomplete information system, and C ⊆ A. We define the significance of a ∈ A − C about C as sigC (a) = sig(C∪{a})−{a} (a). Algorithm M Let S = (U, A) be an incomplete information system. Since the core is the common part of all reducts, it can be used as the starting point for computing reducts. The significance of attributes can be used to select the attributes to be added to the core. This algorithm finds an approximately minimal reduct. M1. Input S = (U, A). M2. Compute U/SIM (A), and CA = {a ∈ A|sigA−{a} (a) > 0}. Set CA → C. M3. Compute U/SIM (C). M4. While U/SIM (C) = U/SIM (A) Do (1) Compute sigC (a) for ∀a ∈ A − C. (2) Choose a ∈ A − C such that
sigC (a ) = max{sigC (a)|∀a ∈ A − C}.
(3) Set C ∪ {a } → C, and compute U/SIM (C). Endwhile M5. Set C = C − CA , |C | → N . For i = 1 to N Do (1) Remove the ith attributes ai from C . (2) Compute U/SIM (C ∪ CA ). (3) If U/SIM (C ∪ CA ) = U/SIM (A), then C ∪ {ai } → C . Endfor M6. Set C ∪ CA → C. Output C. The time complexity of Algorithm M is O(|A|3 |U |2 ), and its space complexity is O(|A||U |).
7
Conclusions
In this paper, we develop rough set data analysis algorithms in [3] to incomplete information systems in [4]. Time complexity and space complexity of the algorithms have been analyzed. In particular, by using measure of significance of attribute which is defined by us , we present a heuristic algorithm for computing
268
K.S. Chin, J. Liang, and C. Dang
the minimal reduct, the time complexity of this algorithm is O(|A|3 |U |2 ), and its space complexity is O(|A||U |). The importance of the minimal reduct is due to its potential for speeding up the learning process and improving the quality of classification. Acknowledgements. This work was supported by the national natural science foundation of China (No. 60275019) and the natural science foundation of Shanxi, China.
References 1. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht (1991) 2. Pawlak, Z., Grzymala-Busse, J.W., Slowi´ nski, R., Ziarko, W.: Rough Sets. Comm. ACM. 38(1995) 89–95 3. Guan, J.W., Bell, D.A.: Rough Computational Methods for Information Systems. Artificial Intelligence. 105(1998) 77–103 4. Kryszkiewicz, M.: Rough Set Approach to Incomplete Information Systems. Information Sciences. 112(1998) 39–49 5. Kryszkiewicz, M.: Rule in Incomplete Information Systems. Information Sciences. 113(1999) 271–292
Inconsistency Classification and Discernibility-Matrix-Based Approaches for Computing an Attribute Core Dongyi Ye and Zhaojiong Chen Department of Computer Science and Technology Fuzhou University, Fuzhou, 350002, P. R. China [email protected]
Abstract. In this paper, we firstly introduce a concept of inconsistency classification based on which we draw a qualitative conclusion that the approach by Hu and Cercone for computing an attribute core based on Skowron’s discernibility matrix is correct for both consistent and partially inconsistent decision tables, but may fail to work for entirely inconsistent ones. Secondly, we improve the work of Zhi and Miao concerning the computation of core attributes by defining a new binary discernibility matrix. Finally, as another application of inconsistency classification, we show that an attribute core from the algebra view is equivalent to that from the information view not only for consistent but also for partially inconsistent decision tables.
1 Introduction The attribute core of a decision table plays an important role in the theory of rough sets (see Pawlak and Slowinski [1]). Hu and Cercone [2] once proposed a method for computing core attributes based on Skowron’s discernibility matrix which has been used for many later documents. Unfortunately, their approach may lead to false results for some inconsistent decision tables (see Ye and Chen [3] ). Hence, it may be interesting to ask for what kind of decision tables this approach will be successful. In this paper, firstly, we revisit Hu’s approach in [2] and its improved version by Ye and Chen [3]. By introducing a concept of inconsistency classification based on which an inconsistent decision table can be further classified as either partially inconsistent or entirely inconsistent, we draw a qualitative conclusion that the approach by Hu and Cercone [2] is correct for both consistent and partially inconsistent decision tables, but may fail to work for entirely inconsistent ones. Secondly, we improve the work in [4] about the computation of core attributes by presenting a new binary discernibility matrix. Finally, as another application of inconsistency classification, we show that the concept of an attribute core from the algebra view is equivalent to that from the information view not only for consistent but also for partially inconsistent decision tables .
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 269–273, 2003. © Springer-Verlag Berlin Heidelberg 2003
270
D. Ye and Z. Chen
2 Inconsistency Classification and Core Attributes Let us consider a decision table represented as a quadruple L = (U , A, V , Fq ),
(1)
where U = {x1 ,..., x n } is a non-empty finite set of objects called universe of discourse A = C U D is a union of a non-empty finite set of condition attributes C and a non-empty finite set of decision attributes D , V = UV a is a union of domains Va of attributes
a belonging to A
and Fq : U × A → V is an
information function assigning attribute values to objects belonging to U , namely, Fq ( x, a ) ∈ V a for every a ∈ A and x ∈ U . When it comes to computing the attribute core of a decision table defined from the algebra view, a discernibility-based matrix is often employed. For instance, Hu and Cercone [2] developed a method for computing the attribute core of a decision table based on Skowron’s discernibility matrix defined as M = {m ij } n×n where
{a ∈ C : Fq ( x i , a) ≠ Fq ( x j , a)}, x i ∈ U , x j ∈ U , Fq ( x i , D) ≠ Fq ( x j , D) m ij = . (2) 0, otherwise Unfortunately, this method may lead to a false result for some inconsistent decision tables. Ye and Chen[3] presented a modified discernibilty-matrix-based method leading to the correct computation of an attribute core in any case. Their matrix is defined as M ’= {mij }n×n where m , min{d ( x i ), d ( x j )} = 1 m’ij = ij , (3) otherwise Φ, where mij is as defined in (2) and d ( x i ) = card{Fq ( y, D) : y ∈ [ x i ]C } . The condition that min{d ( x i ), d ( x j )} = 1 is of crucial importance for ensuring the correct generation of an attribute core in the presence of inconsistency. With this modified discernibility matrix, Ye and Chen [3] proved the following result. Theorem 1. a ∈ Core(C ) if and only if there exists some m’ij such that m’ij = {a} . However, the role that the above mentioned condition can play was not fully discussed in [3] and a qualitative answer to such question as when the approach by Hu and Cercone [2] is correct was not given either. We now proceed to deal with this problem. Actually, we can use the condition to classify a decision table as consistent, partially inconsistent or entirely inconsistent. Obviously, if d ( xi ) = 1 for all
xi ∈ U , then the table under consideration is consistent or deterministic, otherwise inconsistent. Now, by “partially inconsistent” we mean such a situation where there exists an object, say x k ∈ U , for which d ( x k ) > 1 , but
min{d ( xi ), d ( x j )} = 1 for any pair of objects. By “entirely inconsistent” we mean such a situation where there exists at least a pair of objects {xi , x j } with
Inconsistency Classification and Discernibility-Matrix-Based Approaches
271
min{d ( xi ), d ( x j )} > 1 . In other word, we have defined in the foregoing part a concept of inconsistency classification based on which an inconsistent decision table can be further categorized as either partially inconsistent or entirely inconsistent. Now, it is easy to check that if a decision table is either consistent or partially inconsistent, then the two matrices M and M ’as defined in (2) and (3) are identical with each other. Hence, we can say in a qualitative way that the above Hu’s method is correct for both consistent and partially inconsistent decision tables, but may fail to work for entirely inconsistent ones as can be seen from the counterexample in [3]. We now turn to examine a recent work by Zhi and Miao [4] in which a binary discernibility matrix due to Felix and Ushio [5] was used to calculate the attribute core of a decision table. Denoted as M T = {m((i, j ), k )} , the matrix is defined as follows: M T has a total of m columns, each column of M T corresponding to a condition attribute and each row to a pair of objects ( xi , x j ) satisfying Fq ( x i , D ) ≠ Fq ( x j , D) . The entry of M T at position
((i, j ), k ) reads as follows: 1, F ( x i , C k ) ≠ F ( x j , C k ) m((i, j ), k ) = . (4) otherwise 0, The following is a conclusion given in [4] for identifying core attributes: Let M T be the binary discernibility matrix defined in (4). If there exists a row in which only one entry takes value 1, then the condition attribute corresponding to the column where this very entry is located belongs to the attribute core Core(C ) . Unfortunately, this conclusion is not always true and the counterexample presented in [3] is applicable for this purpose, implying that the approach in [4] is, like Hu’s approach, not valid for entirely inconsistent tables. Hence, we can follow the same idea in our previous work [3] to improve the result of the work [4]. First, we need to modify the definition of the matrix M T . Def.1 The modified binary discernibility matrix of a decision table is defined as M T* = {m * ((i, j ), k )} with min{d ( x i ), d ( x j )} = 1 m((i, j ), k ), m * ((i, j ), k ) = , (5) 0, otherwise where {m((i, j ), k )} is defined by (4). Let M T* be a modified binary discernibility matrix as defined in (5). If there exists a row of M T* in which only one entry takes value 1, then the condition attribute corresponding to the column where this very entry is located is called a M T* -based key attribute, the row a M T* -based key row . Theorem
2.
Given
Key (C ) ={set of all
M T*
a
modified
binary
discernibility
matrix M T* .
Let
-based key attributes }, then Core(C ) = Key (C ) .
Proof. Let a = C i1 ∈ Key(C ) . By definition, there exist at least a row of elements
272
in M T* ,
D. Ye and Z. Chen
say m * ((i, j ), k ), k = 1,2,..., m
such that m * ((i, j ), i1 ) is the unique
element taking value 1 and min{d ( x i ), d ( x j )} = 1
It is easy to verify that
m’ij = {a} . By Theorem 1, a ∈ Core(C ) , i.e., Key (C ) ⊆ Core(C ) . The converse inclusion can be proved similarly by using Theorem 1.
3 Core Attributes from Algebra View and Information View It is known that an attribute core in the algebra view is equivalent to that in the information view for consistent decision tables(see Wang[6]). Moreover, Wang analyzed the relation between the attribute cores in both views for inconsistent decision tables. Given an inconsistent decision table, he drew that Core A (C ) = Core D (C ) ⊆ CoreQ (C ) ⊆ Core H (C ) , where Core A (C ) denotes the attribute core in the algebra view, CoreQ (C ) is the attribute core in the information view, Core D (C ) denotes the attribute core computed by Ye’s method [3] based on Theorem 1, and Core H (C ) refers to the attribute core computed by Hu’s method [2]. Actually, this conclusion can be made more precise for a class of inconsistent decision tables. We state our result in the following. Theorem 3. Core A (C ) = CoreQ (C ) for any partially inconsistent decision tables. Proof. Note that if a decision table is partially inconsistent, then the two matrices M and M ’as defined in (2) and (3) are identical with each other, so are Core H (C ) and Core D (C ) . It then follows from the above Wang’s result that
Core A (C ) = CoreQ (C ) .
4 Conclusion In this paper, we introduce a concept of inconsistency classification which allows us to analyze when Hu’s approach [2] can be correct. We also present a new binary discernibility matrix improving the work in [4]. Moreover, we show the equivalence between core attributes defined respectively from the algebra and information viewpoint for partially inconsistent decision tables.
Acknowledgements. This paper is partially supported by National Science Foundation of China (No.70071005) and Key Scientific Research Foundation by the State Education Ministry of China (No.00185).
Inconsistency Classification and Discernibility-Matrix-Based Approaches
273
References 1. 2. 3. 4.
5. 6.
Pawlak Z., Slowinski K.: Rough Set Approach to Multi-attribute Decision Analysis. European Journal of Operational Research, 72(1994) 443–459 Hu X., Cercone N.: Learning in Relational Databases: a Rough Set Approach. J. Computational Intelligence, 2(1995) 323–337 Ye D.Y., Chen Z.J.: A New Discenibility Matrix and the Computation of a Core. Acta Electronica Sinica, 30(2002) 1086–1088 Zhi T.Y., Miao D.Q.: The Binary Discernibility Matrix’s Transformation and High Efficiency Attributes Reduction Algorithm’s Conformation. Computer Science, 29(2002) 140–142 Felix R., Ushio T.: Rough Sets Based Machine Learning Using a Binary Discernibility Matrix. IPMM’s 99 Published, (1999) 299–305 Wang G.Y.: Attribute Core of Decision Table. Lecture Notes in Artificial Intelligence, Vol. 2475. Springer-Verlag, Berlin Heidelberg New York(2002) 213– 217
Multi-knowledge Extraction and Application 1
QingXiang Wu and David Bell
2
1
Faculty of Informatics, University of Ulster at Magee Londonderry, BT48 7JL, N.Ireland, UK [email protected] 2 Department of Computer Science, Queens University, Belfast, UK [email protected]
Abstract. Rough set theory provides approaches to the finding a reduct (informally, an identifying set of attributes) from a decision system or a training set. In this paper, an algorithm for finding multiple reducts is developed. The algorithm has been used to find the multi-reducts in data sets from UCI Machine Learning Repository. The experiments show that many databases in the real world have multiple reducts. Using the multi-reducts, multi-knowledge is defined and an approach for extraction is presented. It is shown that a robot with multi-knowledge has the ability to identify a changing environment. Multiknowledge can be applied in many application areas in machine learning or data mining domain.
1 Introduction Conventional approaches for knowledge discovery always try to find a good reduct or to select a set of features [1,2,3,4,5] and then to extract knowledge based on this reduct or the set of features. This knowledge is called a single body of knowledge because the knowledge is based on only one single reduct or feature set. However, if a manager cannot collect complete values for the feature set in some urgent business case, it is difficult to make decision with the knowledge. The same problem is encountered by a mobile robot identifying its environment when some sensors for detecting the feature set are blotted out or damaged. In another example, if some features of environment are changed a robot may make a wrong decision. This paper shows that the multi-knowledge extracted from a data set can cope with such cases.
2 Algorithm for Finding Multiple Reducts 2.1 Decision System Let , < U, A∪D > represent a decision system, where U = {u1, u2, …, ui, …, u|U| } is a finite non-empty set, called an instance space or universe, and where ui is called an instance in U. A = {a1, a2, a3, …, ai, …, a |A|}, also a finite non-empty set, is a set of
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 274–278, 2003. © Springer-Verlag Berlin Heidelberg 2003
Multi-knowledge Extraction and Application
275
attributes of the instances, where ai is an attribute of a given instance. D is a nonempty set of decision attributes, and A∩D = 0. For every a ∈ A there is a domain, represented by Va, and there is a mapping a(u) : U → Va from U into the domain Va , where a(u) represents the value of attribute a of instance u and is a value in the set Va. Va = a(U) = {a(u) : u ∈ U } for a ∈ A. (1) For a decision system, the domain of decision attribute is represented by Vd = d(U) = {d(u) : u ∈ U } for d ∈ D. (2) 2.2 Rough Set Concepts Let , < U, A∪D > represent a decision system and define an object relation R =U×U. An indiscernibility relation [6] is represented by IND(A)={( ui , uj )∈ R: a(ui)=a(uj), a∈A}. This means ui and uj are indistinguishable with respect to a set of attributes B in the system. The indiscernibility relation partitions U into equivalence classes U/IND(A), each element in U/IND(A) being a set of objects that are indiscernible with respect to A. For simplicity of notation U/A will be used instead of U/IND(A). Let B ⊆ A and X ⊆ U. The B-lower approximation set of X is defined as follows: APR B− ( X ) = U {Y ∈ U / B : Y ⊆ X } . (3) The B-upper approximation is defined as follows: APR B+ ( X ) = U {Y ∈ U / B : Y ∩ X ≠ 0} . (4) The D-positive region on equivalence classes U/B is a set of all instances in the Blower approximation set for decision space Vd. POS B (V d ) = U APR B− ( X ) (5) X ∈U / D − APR B (X ) denotes the
where lower approximation of the set X with respect to equivalence classes U/B. In order to describe the property of subset attributes B, certain-accuracy for B is defined as follows. ζ ( B ) = | POS B (V d ) | / | U | . (6) where ζ ( B ) ∈ [0,1]. If ζ ( B) = 1 , all the instances can be certainly assigned to the decision space Vd. If 1 > ζ ( B) >= 0 , the value of ζ (B ) represents certain-accuracy of the decision system, i.e. the percent of instances in the system can be certainly assigned to the decision space by means of partitions U/B. In order to describe the importance of attribute a for decision making. The significance of attribute [7,8], which is denoted by SIG a , is defined as
SIG a = ζ ( A) − ζ ( A − {a})
(7)
where A is the attribute set of a decision system, a∈A, and ζ (A) is certain-accuracy defined in equation (6). If SIGa=0 , attribute a is insignificant for decision.
276
Q. Wu and D. Bell
2.3 Reducts and Multi-reducts for Decision Space Let B ⊆ A. If (8) ∃B |= ζ ( B ) = ζ ( A) ∧ ζ ( B − {b}) < ζ ( A) ∀b ∈ B , B is called a reduct for decision space Vd. In other words, a reduct is an attribute subset that has the same certain-accuracy as attribute set A. by means of dropping all insignificant attributes in A, the set of remainder attributes is a reduct. The formal algorithm is as follows. A2.1 Algorithm of finding a reduct. 1 For i = 1 to | A | 2
SIG ai = ζ ( A) − ζ ( A − {a i })
3
If SIG ai = 0 then A = A - {ai}
4 End for 5 RED(A) = A Usually, there are many reducts in a decision system. It is obvious that different reducts can be obtained by changing the order of attributes in sequence A. Many strategies can be applied to change the order of attributes in sequence A. A simple strategy is proposed as follows. Firstly, a reduct is found by means of A2.1. In order to change the order of attributes in sequence A, the first attribute of A is moved to the end of sequence A. Then the new sequence A is applied to A2.1 to find a new redust. This procedure can be repeated |A| times. Finally, at most |A| reducts can be found. The formal algorithm is as follows. A2.2 Algorithm for finding multi-reducts. 1 For i =1 to |A| 2 REDi = Find a new reduct (A2.1), if no reduct, exit for 3 If REDi ⊄ RED then RED=RED+REDi 4 Move the first attribute to the end of A (i.e. Change order of attributes in A) 6 End for 7 Output: Reducts RED={RED1 , RED2 … REDN } 8 Output: Number of reductsÅ |RED| Applying the algorithm to UCI data sets, the numbers of reducts are found as follows. Data set--Breast-cancer has 19 reducts; Brigdes:10; Crx:60; Heart:109; Iris:4.
3
Multi-knowledge and Application
Definition: Given a decision system, < U, A∪D >. Multi-knowledge is defined as follows Φ = {ϕ B | B ∈ RED} (9) where ϕ B is a mapping from the condition space to the decision space. RED is a set of reducts from the decision system.
Multi-knowledge Extraction and Application
277
For example, Table 1 is regarded as an environmental feature decision system from a mobile robot that has gone through from room1 to room6. Table 1. Room Feature Decision System U
1 2 3 4 5 6
Are a a1 9 12 9 13 15 15
GroundColor a2
WallColor a3
CeilingHeight a4
Lightness a5
Room d
Yellow Yellow Blue Gray Gray Yellow
Yellow White White(Orange) White Yellow White
2.6 2.6 2.6 2.8 3.0 3.0
1 2 1 2 3 3
1 2 3 4 5 6
By means of algorithm A2.2, multi-reducts RED can be found as follows. RED={{a1, a2}, {a1, a3}, {a2, a5}, {a2, a4}, {a3, a4, a5}} From these 5 reducts, 5 single bodies of knowledge can be obtained. For example, let reduct B={a1, a3}. The existing condition vector space is {(9,Y), (12,W), (9,W), (13,Y),(15,Y),(15,W)}. Extracting rules in this condition vector space from Table 1 and generalizing the rules, the single body of knowledge is given by Room 1 if (a1 , a 3 ) = (9, Y ) Room 2 if (a ) = (12) 1 Room 3 if (a1 , a 3 ) = (9, W ) ϕ ( a1 ,a3 ) = Room 4 if (a1 ) = (13) Room 5 if (a1 , a 3 ) = (15, Y ) Room 6 if (a1 , a 3 ) = (15, W ) Unknown for other cases ϕ(a1,a2), ϕ(a2,a4), ϕ(a2,a5), and ϕ(a3, a4, a5) can be obtained by analogy and this set of knowledge is called the multi-knowledge. If only ϕ(a1,a3) is applied, ϕ(a1,a3) answers “unknown” when the WallColor of room3 has been changed to orange (ie.(a1,a3)=(9,O)) and the robot enters room3. The same problem is encountered by other single reduct-based approaches such as decision trees, and neural networks. In order to merge these decisions, decision support degree, which is denoted by Sup(di), is defined as follows. Sup(di) = P(di | di =ϕB) for ϕB ∈ Φ (10) where di ∈ Vd is a decision in the decision space, and ϕB is a single body of knowledge among the multi-knowledge Φ. The final decision is made by d Final = arg max Sup (d i ) . (11) d i ∈Vd
When the robot enters room3 after the room’s WallColor has been changed, ϕ(a1,a2) and ϕ(a3, a4, a5) answer “unknown”. ϕ(a1,a2), ϕ(a2,a4) and ϕ(a2,a5) give answer “in room3”. According to the rule, the final decision is “in room3”. It is obvious that multi-knowledge can cope with the changing environment much better than any single body of knowledge.
278
4
Q. Wu and D. Bell
Conclusion
Multi-reducts exist in many databases in the real world. The multi-knowledge defined from multi-reducts can have generalization in the knowledge representation domain. For example a combination [9] of multi-knowledge and Bayes Classifier have been used successfully to improve the accuracy of classification for data sets from UCI Machine Learning Repository. In this paper, it is shown that a robot with multiknowledge copes with a changing environment much better than single body representation approach. The multi-knowledge can be extracted from an information system. More reliable decision can be made with the multi-knowledge than with a single body of knowledge. Many important properties are not tackled in this paper due to pages limits, such as how many features can be changed while the multiknowledge guarantees to give correct answers. The issues will be tackled in other publications.
References 1.
Zhong N. and Dong J., Using rough sets with heuristics for feature selection, Journal of Intelligent Information Systems, vol.16, Kluwer Academic Publishers. Manufactured in The Netherlands (2001)199–214 2. Polkowski L., Tsumoto S., Lin T. Y., Rough Set Methods and Applications, New Developments in Knowledge Discovery in Information Systems, Physica-Verlag, A Springer-Verlag Company (2000). 3. Bell D. and Wang H, A Formalism for Relevance and Its Application in Feature Subset Selection, Machine learning, vol.41, Kluwer Academic Publishers. Manufactured in The Netherlands (2000)175–195 4. Kohavi R, Frasca B, Useful feature subsets and rough set reducts. In the International Workshop on Rough Sets and Soft Computing (RSSC), (1994). 5. Lin T.Y. and Cercone N.(eds), Rough Sets and Data Mining : Analysis for Imprecise Data, Boston, Mass; London : Kluwer Academic(1997) 6. Pawlak, Z..Rough Sets, International Journal of Computer and Information Sciences, 11, (1982)341–356. 7. Bell, D. A., Guan, J. W., Computational methods for rough classification and discovery, Journal of the American Society for Information Science, Special Topic Issue on Data Mining (1997). 8. Guan J. W., and Bell D.A., Rough Computational Methods for Information Systems, Artificial Intelligence 105(1998)77–103. 9. Wu Q.X., Bell, D.A., McGinnity M., Guo G., Decision Making Based on Multi-knowledge Representation, proceedings of ICDM02 Workshop on The Foundation of Data Mining and Discovery in the IEEE International Conference on Data Mining (2002) 10. Wu Q.X., Bell, D.A. et el, Rough Computational Methods on Reducing Cost of Computation in Markov Localization for Mobile Robots, proceedings of the 4th World Congress on Intelligent Control and Automation, Shanghai, IEEE, (2002)1226–1233.
Multi-rough Sets Based on Multi-contexts of Attributes Rolly Intan1,2 and Masao Mukaidono1 1 2
Meiji University, Kawasaki-shi, Kanagawa-ken, Japan Petra Christian University, Surabaya, Indonesia 60236
Abstract. Rough set deals with crisp granularity of objects given a data table called information system as a pair I = (U, A), where U is a universal set of objects and A is a non-empty finite set of attributes. We may consider A as a set of contexts of attributes, where Ai ∈ A is a set of attributes regarded as a context or background. Consequently, if there are n contexts in A, where A = {A1 , . . . , An }, it provides n partitions. A given set of object, X ⊆ U , may then be represented into n pairs of lower and upper approximations denoted as multi-rough sets of X. Some properties and operations are proposed and examined.
1
Introduction
Rough set may be viewed as a generalization of crisp set in representing a given crisp set of objects into two subsets derived from a partition on the universal set of objects [2,1]. The two subsets are called a lower approximation and an upper approximation. Partition of objects is generated using a data table called information system. Formally, information system is defined by a pair, I = (U, A), where U is a universal set of objects and A is a non-empty set of attributes such that for every a ∈ A, a : U → Va . The set Va is the value set of attribute a. In the real application, depending on the contexts, a given object may have different values of attributes. In other words, we may represent set of attributes based on different contexts, where they may provide different values for a given object. Context can be viewed as a background or situation in which somehow we need to group some attributes as a subset of attributes and consider the subset as a context. For example, let us consider humans as a universal set of objects. Every person (object) might be characterized by some sets of attributes corresponding to some contexts such as his or her status as student, employ, family member, club member, etc. Still using example of humans as objects, especially for fuzzy data or perception-based data, set of attributes such as height, weight and age, might have different values for a given object depending on viewpoints (contexts) of American, Japanese and so on. Related to the rough set, every context as a subset of attributes provides a partition of objects. Consequently, n contexts (n subsets of attributes) provide n partitions. A given set of object, X, may then be represented into n pairs of lower and upper approximations denoted as multirough sets of X. Related to the multi-rough sets, some properties and operations are proposed and examined. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 279–282, 2003. c Springer-Verlag Berlin Heidelberg 2003
280
2
R. Intan and M. Mukaidono
Multi-contexts Information System
A multi-rough sets is proposed based on multi-contexts, where every context is considered as a subset of attributes. Partitions of multi-contexts are generated from a multi-contexts information system. Formally, the multi-contexts information system is defined by a pair I = (U, A), where U is a universal set of objects and A is a non-empty set of contexts such as A = {A1 , . . . , An }. Ai ∈ A is a set of attributes and denoted as a context. Every attribute, a ∈ Ai , is associated with a set of Va as its values called domain of a. It is NOT necessary for i = j ⇒ Ai ∩ Aj = ∅. As given in previous example, attributes such as height and weight might belong to different contexts (i.e. American and Japanese) in which they may provide different values concerning a given object. Therefore, for x ∈ U, a(x)i ∈ Va is denoted as the value of attribute a for objects x in the context a ∈ Ai . An indiscernibility relation (equivalence relation) is then defined in terms of context Ai such as for x, y ∈ U : RAi (x, y) ⇔ a(x)i = a(y)i , a(x)i , a(y)i ∈ Va , ∀a ∈ Ai . Equivalence class of x ∈ U in the context Ai is given by [x]Ai = {y ∈ U |RAi (x, y)}. It should be verified that for i = j, ∃x ∈ U, [x]Ai = [x]Aj , otherwise Ai and Aj are redundant in the sense of providing the identical partition. By eliminating all redundant contexts, number of contexts in the relation to the number ofobjects are satisfied the m−1 following equation. For |U | = m, |A| ≤ B(m), B(m) = i=0 C(m−1, i)×B(i), where B(0) = 1 and C(n, k) = n!/(k!(n − k)!) is combination of size k from n elements. From set of contexts A, set of partitions of universal objects are derived and given by {U/A1 , . . . , U/An }, where U/Ai as a partition of the universe based on context Ai contains all equivalence classes of [x]Ai , x ∈ U .
3
Multi-rough Sets
A multi-rough sets is defined as an approximate representation of a given crisp set of objects on a set of partitions provided by multi-contexts in which each element of the multi-rough sets is a pair of lower and upper approximation corresponding to a given context. Formally, definition of multi-rough sets is defined as follows. Definition 1. Let U be a non-empty universal set of objects. RAi and U/RAi are equivalence relation and partition with respect to set of attributes in the context of Ai . For X ⊆ U , corresponding to a set of contexts, A = {A1 , A2 , . . . , An }, X is defined as multi-rough sets of X as given as follows. X = {(apr(X1 ), apr(X1 )), (apr(X2 ), apr(X2 )), . . . , (apr(Xn ), apr(Xn ))}. Thus, an element (apr(Xi ), apr(Xi )) of multi-rough sets is a pair of sets, lower and upper approximations in terms of context Ai . Similar to rough set, apr(Xi ) and apr(Xi ) are defined by apr(Xi ) = {u ∈ U | [u]Ai ⊆ X} = {[u]Ai ∈ U/Ai | [u]Ai ⊆ X} and apr(Xi ) = {u ∈ U | [u]Ai ∩ X = ∅} = {[u]Ai ∈ U/Ai | [u]Ai ∩ X = ∅}, respectively. Some basic relations and operations are given and defined as follows. For X and Y are multi-rough sets on U :
Multi-rough Sets Based on Multi-contexts of Attributes
281
inclusion: X ⊆ Y ⇔ (apr(Xi ) ⊆ apr(Yi ), apr(Xi ) ⊆ apr(Yi )), ∀i ∈ Nn , equality: X = Y ⇔ (apr(Xi ) = apr(Yi ), apr(Xi ) = apr(Yi )), ∀i ∈ Nn , union: X ∪ Y = {(apr(Xi ) ∪ apr(Yi ), apr(Xi ) ∪ apr(Yi ))|∀i ∈ Nn }, intersection: X ∩ Y = {(apr(Xi ) ∩ apr(Yi ), apr(Xi ) ∩ apr(Yi ))|∀i ∈ Nn }, sum-union: X ⊕ Y = {(M, N )|(M, N ) ∈ X or (M, N ) ∈ Y}, sum-intersection: X ⊗ Y = {(M, N )|(M, N ) ∈ X and (M, N ) ∈ Y}, where Nn means natural number less or equal to n. When union and intersection are applied for all pair elements of multi-rough sets X, we have: i i Γ (X) = apr(Xi ), Υ (X) = apr(Xi ), i i Φ(X) = apr(Xi ), Ψ (X) = apr(Xi ), ∀i ∈ Nn , where apr(X) = {(Γ (X), Υ (X))} and apr(X) = {(Φ(X), Ψ (X))} are defined as summary multi-rough sets in which they have only one pair element. It can be easily verified their relationship by: Ψ (X) ⊆ Φ(X) ⊆ X ⊆ Υ (X) ⊆ Γ (X), where we may consider pair of (Φ(X), Υ (X)) as a finer approximation and pair of (Γ (X), Ψ (X)) as a worse approximation of X ⊆ U . From the definition of summary multi-rough sets, it satisfies some properties such as: X ⊆ Y ⇔ [Ψ (X) ⊆ Ψ (Y ), Φ(X) ⊆ Φ(Y ), Υ (X) ⊆ Υ (Y ), Γ (X) ⊆ Γ (Y )], Ψ (X) = ¬Γ (¬X), Φ(X) = ¬Υ (¬X), Υ (X) = ¬Φ(¬X), Γ (X) = ¬Ψ (¬X), Ψ (U ) = Φ(U ) = Υ (U ) = Γ (U ) = U, Ψ (∅) = Φ(∅) = Υ (∅) = Γ (∅) = ∅, Ψ (X ∩ Y ) = Ψ (X) ∩ Ψ (Y ), Φ(X ∩ Y ) = Φ(X) ∩ Φ(Y ), Υ (X ∩ Y ) ≤ Υ (X) ∩ Υ (Y ), Γ (X ∩ Y ) ≤ Γ (X) ∩ Γ (Y ), (5) Ψ (X ∪Y ) ≥ Ψ (X)+Ψ (Y )−Ψ (X ∩Y ), Φ(X ∪Y ) ≥ Φ(X)+Φ(Y )−Φ(X ∩Y ), Υ (X ∪Y ) ≤ Υ (X)+Υ (Y )−Υ (X ∩Y ), Γ (X ∪Y ) ≤ Γ (X)+Γ (Y )−Γ (X ∩Y ).
(1) (2) (3) (4)
Special consideration is given to the following two characteristics of context. 1. Ai is called total ignorance (τ ) if x ∈ U, [x]τ = U. Therefore, ∀X ⊆ U, X = ∅ ⇒ apr(Xτ ) = ∅, apr(Xτ ) = U. 2. Ai is called identity (ι) if ∀x ∈ U, [x]ι = {x}. Therefore, ∀X ⊆ U ⇒ apr(Xι ) = apr(Xι ) = X. Obviously, related to union and intersection operations, we have the following properties: ∀Ai ∈ A, X ⊆ U , - Union: X = ∅ ⇒ apr(Xi ) ∪ apr(Xτ ) = U, apr(Xi ) ∪ apr(Xτ ) = apr(Xi ), apr(Xi ) ∪ apr(Xι ) = apr(Xi ), apr(Xi ) ∪ apr(Xι ) = X. - Intersection: apr(Xi ) ∩ apr(Xτ ) = apr(Xi ), apr(Xi ) ∩ apr(Xτ ) = ∅, apr(Xi ) ∩ apr(Xι ) = X, apr(Xi ) ∩ apr(Xι ) = apr(Xi ). From the relation with union and intersection operations, τ is identity context for union operation of lower approximation as well as for intersection operation of upper approximation. On the other hand, ι is identity context for union operation of upper approximation as well as for intersection operation of lower approximation. Furthermore, in order to characterize multi-rough sets, two count functions are defined as follows:
282
R. Intan and M. Mukaidono
Definition 2. ηX : U → Nn and σX : U → Nn are defined as two functions to characterize multi-rough sets by counting total number of copies of a given element of U in upper and lower sides of multi-rough sets X, respectively. Related to summary rough sets, these two count functions, η and σ, provide some properties such as for X, Y ∈ U, |A| = n: 1. 2. 3. 4. 5. 6. 7. 8.
ηX (y) ≥ σX (y), ∀y ∈ U, σX (y) > 0 ⇒ y ∈ X, y ∈ X ⇒ ηX (y) = n, y ∈ Υ (X) ⇔ ηX (y) = n, y ∈ Ψ (X) ⇔ σX (y) = n, ηX (y) > 0 ⇔ Γ (X), σX (y) > 0 ⇔ Φ(X), X ⊆ Y ⇒ ηX (y) ≤ ηY (y), σX (y) ≤ σY (y), ∀y ∈ U,
9. X = Y ⇒ ηX (y) = ηY (y), σX (y) = σY (y), ∀y ∈ U, 10. ηX∪Y (y) = ηX (y) + ηX (y) − ηX∩Y (y), 11. σX∪Y (y) = σX (y) + σX (y) − σX∩Y (y), 12. ηX⊕Y (y) = ηX (y) + ηX (y) − ηX⊗Y (y), 13. σX⊕Y (y) = σX (y) + σX (y) − σX⊗Y (y),
Simply, by dividing the count functions with total number of contexts (|A| = n), we define two membership functions, µX (y) : U → [0, 1] and νX (y) : U → [0, 1] by µX (y) = ηXn(y) and νX (y) = σXn(y) , where µX (y) and νX (y) represent membership values of y in upper and lower multi-set X, respectively. Actually, µ and ν are nothing but another representation of the count functions. However, we may consider pair of (νX (y), µX (y)) as an interval membership function of y ∈ U in the presence of multi-contexts of attributes. Similarly, by changing n to 1 in Property number 3-5, µ and ν have exactly the same properties as given by η and σ, respectively.
4
Conclusion
This paper proposed multi-rough sets based on multi-context of attributes. Basic operations and some properties ware examined. Two count functions as well as their properties ware defined and examined to characterize multi-rough sets. In the future work, we need to apply and implement the concept of multi-rough sets in the real world application. Acknowledgment. The authors sincerely thank Prof. Yiyu Yao for the discussion and his comments on the draft of this paper.
References 1. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A., ‘Rough Sets: A Tutorial’, In S.K Pal and A. Skowron (Eds.), Rough Fuzzy Hybridization, (1999), pp. 3–98. 2. Pawlak, Z., Rough sets, International Journal Computation & Information Science, 11, (1982), pp. 341–356.
Approaches to Approximation Reducts in Inconsistent Decision Tables Ju-Sheng Mi1,2 , Wei-Zhi Wu1,3 , and Wen-Xiu Zhang1 1
Institute for Information and System Sciences, Faculty of Science, Xi’an Jiaotong University, Xi’an, Shaan’xi, 710049, P. R. China 2 College of Mathematics and Information Science, Hebei Normal University, Hebei, Shijiazhuang, 050016, P. R. China 3 Information College, Zhejiang Ocean University, Zhoushan, Zhejiang, 316004, P. R. China [email protected]
Abstract. In this paper, two new concepts of lower approximation reduction and upper approximation reduction are introduced. Lower approximation reduction is the smallest attribute subset that preserves the lower approximations of all decision classes, and upper approximation reduction is the smallest attribute subset that preserves the upper approximations of all decision classes. For an inconsistent DT, an upper approximation consistent set must be a lower approximation consistent set, but the converse is not true. For a consistent DT, they are equivalent. After giving their equivalence definitions, we examine the judgement theorem and discernibility matrices associated with the two reducts, from which we can obtain approaches to knowledge reduction in inconsistent decision tables.
1
Introduction
The theory of rough set (RST) [1] is a new mathematical approach to deal with inexact, uncertain or vague knowledge. It has recently received wide attention on the research areas in both of the real-life applications and the theory itself. One fundamental aspect of RST involves a search for particular subsets of condition attributes which provide the same information for classification purposes as the full set of available attributes. Such subsets are called attribute reducts. Many types of knowledge reduction have been proposed in the area of rough sets [2–9]. Each of them aimed at some basic requirement. It is required to provide their consistent classification. This paper introduces two new kinds of reduction named as lower approximation reduction and upper approximation reduction in inconsistent decision tables. They are the smallest attribute subsets (in the sense of set inclusion) that preserve lower (upper, respectively) approximations of all decision classes. The lower and upper assignment reducts are also introduced which are equivalent to the lower approximation reduction and
Supported by the Nature Science Foundation of China (10271039)
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 283–286, 2003. c Springer-Verlag Berlin Heidelberg 2003
284
J.-S. Mi, W.-Z. Wu, and W.-X. Zhang
upper approximation reduction respectively. At last, we examine the judgement theorem and discernibility matrices associated with the two reducts, from which we can obtain approaches to knowledge reducts in inconsistent decision tables.
2
Information Systems and Rough Sets
An information system is a pair (U, A), where U is a non-empty, finite set of objects called the universe and A is a non-empty, finite set of attributes, such that a : U → Va for any a ∈ A, where Va is called the domain of a. Each non-empty subset B ⊆ A determines an indiscernibility relation RB = {(x, y) ∈ U × U : a(x) = a(y), ∀a ∈ B}. RB partitions U into equivalence classes U/RB = {[x]B : x ∈ U }. Let X ⊆ U , one can characterize X by a pair of lower and upper approximations: RB (X) = {x ∈ U : [x]B ⊆ X}; RB (X) = {x ∈ U : [x]B ∩ X = ∅}. The pair (RB (X), RB (X)) is referred to as the Pawlak rough set of X wrt. B. A decision table (DT) is an information system (U, A ∪ {d}), where d ∈ / A. A is called the condition attribute set, while d is called the decision attribute. If RA ⊆ R{d} , then (U, A ∪ {d}) is consistent, otherwise it is inconsistent.
3
Concepts of Approximation Reducts
Let (U, A ∪ {d}) be a DT, B ⊆ A. Denoted by U/R{d} = {D1 , · · · , Dr }. The lower and upper approximation distribution functions are defined as follows: B(d) = (RB (D1 ), · · · , RB (Dr )),
B(d) = (RB (D1 ), · · · , RB (Dr )).
Definition 3.1. Let A = (U, A ∪ {d}) be a DT, B ⊆ A. (1) If B(d) = A(d), we say that B is a lower approximation consistent set of A. If B is a lower approximation consistent set, and no proper subset of B is lower approximation consistent, then B is called a lower approximation reduct of A. (2) If B(d) = A(d), we say that B is a upper approximation consistent set of A. If B is a upper approximation consistent set, and no proper subset of B is upper approximation consistent, then B is referred to as a upper approximation reduct of A. It is easy to prove that an upper approximation consistent set must be a lower approximation consistent set, but the converse is not true for inconsistent DT (see Example 4.1). For a consistent DT, we can prove the following result. Theorem 3.1. Let (U, A ∪ {d}) be a consistent DT, B ⊆ A. Then B is a lower approximation consistent set iff B is an upper approximation consistent set. Theorem 3.2.
Let (U, A ∪ {d}) be a DT, B ⊆ A. For x ∈ U , denoted by Dj , if ∃j ≤ r such that x ∈ RB (Dj ), σB (x) = ∅, otherwise. δB (x) = {Dj : x ∈ RB (Dj )}.
Approaches to Approximation Reducts in Inconsistent Decision Tables
285
Then (1) B is a lower approximation consistent set iff σB (x) = σA (x), ∀x ∈ U . (2) B is an upper approximation consistent set iff δB (x) = δA (x), ∀x ∈ U .
4
Approaches to Approximation Reducts
Theorem 4.1. Let (U, A ∪ {d}) be a DT, B ⊆ A, then (1) B is a lower approximation consistent set iff if x, y ∈ U such that σA (x) = σA (y), then [x]B ∩ [y]B = ∅. (2) B is an upper approximation consistent set iff if x, y ∈ U such that δA (x) = δA (y), then [x]B ∩ [y]B = ∅. Proof. (1) “⇒” If [x]B ∩ [y]B = ∅, then [x]B = [y]B . By theorem 3.2 we have σB (x) = σB (y). The assumption that B is a lower approximation consistent set implies σB (x) = σA (x) and σB (y) = σA (y). Therefore σA (x) = σA (y). “⇐” For any x ∈ U , since RA ⊆ RB , we have that [x]B = ∪{[y]A : y ∈ [x]B }. If σA (x) = ∅, then by [x]A ⊆ [x]B we have that σB (x) = ∅. If σA (x) = Dj , for some j ≤ r, then [x]A ⊆ Dj . ∀y ∈ [x]B , Since [x]B = [y]B , we conclude from the assumption that σA (x) = σA (y). Thus [y]A ⊆ Dj . Therefore, y ∈ Dj . Thus we have [x]B ⊆ Dj . That is to say σB (x) = Dj . Therefore, σA (x) = σB (x), ∀x ∈ U . By Theorem 3.2 we conclude that B is a lower approximation consistent set. (2) It is similar to the proof of (1). Theorem 4.1 provides an approach to judge whether a subset of attributes is consistent or not. We can further obtain practical approaches to knowledge reductions in inconsistent systems. We first give the following notions. Definition 4.1.
Let (U, A ∪ {d}) be a DT, U/RA = {C1 , · · · , Cm }. We denote
D1∗ = {([x]A , [y]A ) : σA (x) = σA (y)}, D2∗ = {([x]A , [y]A ) : δA (x) = δA (y)}. Where ([x]A , [y]A ) identifies with ([y]A , [x]A ) in Dl∗ , l = 1, 2. Denoted by ak (Ci ) the value of ak wrt. the objects in Ci . Let {ak ∈ A : ak (Ci ) = ak (Cj )}, (Ci , Cj ) ∈ Dl∗ , Dl (Ci , Cj ) = (l = 1, 2), A, (Ci , Cj ) ∈Dl∗ . then Dl (Ci , Cj ), (l = 1, 2), are called lower and upper approximation discernibility attribute set respectively. And Dl = (Dl (Ci , Cj ), i, j ≤ m), (l = 1, 2), are called lower and upper approximation discernibility matrices respectively. Theorem 4.2. Let (U, A ∪ {d}) be a DT, B ⊆ A, then (1) B is a lower approximation consistent set iff B ∩ D1 (Ci , Cj ) = ∅ for all (Ci , Cj ) ∈ D1∗ . (2) B is an upper approximation consistent set iff B ∩ D2 (Ci , Cj ) = ∅ for all (Ci , Cj ) ∈ D2∗ . Proof. (1) Suppose B is a lower approximation consistent set. ∀(Ci , Cj ) ∈ D1∗ , we can find x, y ∈ U such that Ci = [x]A and Cj = [y]A , then σA (x) = σA (y).
286
J.-S. Mi, W.-Z. Wu, and W.-X. Zhang
We obtain by Theorem 4.1 (1) that [x]B ∩ [y]B = ∅. Thus there exists ak ∈ B such that ak (x) = ak (y), i.e., ak (Ci ) = ak (Cj ). Which implies ak ∈ D1 (Ci , Cj ), and then B ∩ D1 (Ci , Cj ) = ∅. Conversely, if there exists (Ci , Cj ) ∈ Dl∗ such that B ∩D1 (Ci , Cj ) = ∅, we can select x, y ∈ U satisfying Ci = [x]A , Cj = [y]A . It should be noted that σA (x) = σA (y), then for any ak ∈ B, we have ak ∈D1 (Ci , Cj ), therefore, ak (Ci ) = ak (Cj ). Consequently ak (x) = ak (y), for all ak ∈ B, which implies [x]B = [y]B . Thus by Theorem 4.1 (1) we conclude that B is not a lower approximation consistent set. (2) It is similar to the proof of (1). Theorem 4.2 provides an approach to approximation reduction in inconsistent systems. It is also available for consistent DT.
5
Conclusions
Attribute reduction is needed to simplify a decision table. Many types of attribute reduction have been proposed based on rough set theory. This paper has introduced new kinds of knowledge reduction named as lower and upper approximation reducts which preserve the lower and upper approximations of all decision classes respectively. The judgement theorem and discernibility matrices associated the two kinds of reduction are obtained. Thus we provided new approaches to knowledge reduction in inconsistent systems. Though the information systems we discussed here are inconsistent, they are all complete. Since incomplete information systems are more complicated than complete information systems, further research of knowledge reduction for different requirements in incomplete information systems is needed.
References 1. Pawlak, Z.: Rough sets. International Journal of Computer and Information Science 11 (1982) 341–356 2. Zhang, W.-X., Wu, W.-Z., Liang, J.-Y., Li, D.-Y.: Theory and method of rough sets. Beijing, Science Press (2001) 3. Slezak, D.: Searching for dynamic reducts in inconsistent decision tables. In: Proceedings of IPMU’98. Paris, France, Vol.2. (1998) 1362–1369 4. Slezak, D., Approximate reducts in decision tables. In: Proceedings of IPMU’96. Granada, Spain, Vol.3. (1996) 1159–1164 5. Zhang, W.-X., Mi, J.-S., Wu, W.-Z.: Approaches to knowledge reducts in inconsistent systems. Chinese Journal of Computers 25 (2003) (in press) 6. Ziarko, W.: Variable precision rough set model. Journal of Computer and System Sciences 46 (1993) 39–59 7. Beynon, M.: Reducts within the variable precision rough sets model: A further investigation. European Journal of Operational Research 134 (2001) 592–605 8. Kryszkiewicz, M.: Comparative study of alternative type of knowledge reduction in inconsistent systems. International Journal of Intelligent Systems 16 2001 105–120 9. Mi, J.-S., Zhang, W.-X., Wu, W.-Z.: Optimal decision rules based on inclusion degree theory. IEEE Proceedings of 2002 International Conference on Machine Learning and Cybernetics. Beijing, China. (2002) 1223–1226
Degree of Dependency and Quality of Classification in the Extended Variable Precision Rough Sets Model Malcolm J. Beynon Cardiff Business School, Cardiff University, Colum Drive, Cardiff, CF10 3EU, Wales, UK [email protected]
Abstract. In this paper an investigation on the utilisation of the degree of dependency and quality of classification measures in the extended variable precision rough sets model is undertaken. The use of (l, u)-graphs enable these measures to aid in the classification of objects to a number of categories for a choice of l and u values and selection of a (l, u)-reduct.
1 Introduction Katzberg and Ziarko [3] introduced the extended variable precision rough sets model (defined VPRSl,u), to incorporate asymmetric bounds (l and u) on the levels of missclassification of an object to a decision class or its compliment. The (l, u)-graph was introduced [1] to elucidate the levels of degree of dependency and quality of classification in VPRSl,u. In this paper, the (l, u)-graphs enable the identification of combinations of l and u which allow equitable dependency of the decision classes. An iterative approach to identifying a (l, u)-reduct is explored. An example problem with three decision classes is considered. Within a decision table there exists a set objects (U) each characterized and classified by sets of condition (C) and decision (D) attributes respectively. Using C and D, certain equivalence classes E(C) and E(D) are constructed, from which three approximation regions are utilised. In VPRSl,u with Z ⊆ U and P ⊆ C the u-positive and l-negative regions are defined by: POS u ( Z ) = U{ X i ∈ E ( P) : Pr ( Z / X i ) ≥ u} ,
NEGl ( Z ) = U{ X i ∈ E ( P) : Pr ( Z / X i ) ≤ l} ,
where in each region Pr(Z / Xi) is a conditional probability estimate of Z given Xi ∈ E(P). A member of E(P) is in the (l, u)-boundary region, defined by BNRl ,u ( Z ) = U{ X i ∈ E ( P) : l < Pr ( Z / X i ) < u} .
That is, for Xi ∈ E(P) then the Xi ∈ BNRl,u(Z) cannot be classified to Z or the compliment of Z with an acceptable level of miss-classification. Katzberg and Ziarko [3] l,u introduced the (l, u)-degree of dependency (l, u)-DoD (δ (P, Dj)) and represents the proportion of those objects in U which can be uniquely classified with not lower than G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 287–290, 2003. © Springer-Verlag Berlin Heidelberg 2003
288
M.J. Beynon
1 – l likelihood occurrence of Dj or not lower than u likelihood of occurrence of its complement U − Dj. A formulaic expression for (l, u)-DoD is given by card( POS1−l ( D j )) + card( POS u (U − D j )) δ l ,u ( P , D j ) = . card(U ) With respect to all the objects in U, the (l, u)-quality of classification (l, u)-QoC ( γ ( P, D ) ) represents the proportion of objects which are classified to single decision classes Dj and is given by card( U POSu ( D j )) l ,u
γ l ,u ( P, D) =
D j ∈E ( D )
. card(U ) The γl,u(P, D) measure, subject to the l and u values does not include those objects, which cannot be included in the respective u-positive regions. This definition is analogous to previous definitions of quality of classification [4]. In the next section these (l, u)-DoD and (l, u)-QoC measures are utilised on a subset of the well known wine data set described in [2]. To aid in the choice of l and u, (l, u)-graphs are constructed, following [1]. 0.2222
0 1 0 0.8571
11/40 0.75 19/40 0.6667 u
l
0.6667 0.75 0.85711 0 1 1 13/40 0.8889 0.8571 1 32/40 20/40 25/40 1 28/40 21/40 33/40 1
31/40
v
12/40
0.4167 l 0.8571 0.8571 1 1 24/40 1 31/40 1
28/40
33/40
v
u 0.4167 1
0.2222 1
δ (C,D1)
δ (C,D2)
l ,u
l ,u
0 0 0 0.25 0.2857 l 0.8889 0.9167 1 0 1 1 1 0.9167 4/40 19/40 0.9167 1 0.8889 0.8889 28/40 0.8571 16/40 0.75 1 12/40 31/40 0.6667 24/40 25/40 v u u δ0.6,0.9(C,D3 ) 0.4167 33/40
0.2857 0.25 32/40 0
v
0.2857 0.25 0.2222
1 1
l
δl,u(C,D3)
0
Fig. 1. (l, u)-DoD graphs for the decision classes D1, D2 and D3 and when equitable
1
Degree of Dependency and Quality of Classification
2
289
(l, u)-DoD and (l, u)-QoC Graphs
The (l, u)-DoD measure δ (C, Dj), is a value associated with each decision class in a problem. With three decision classes in the wine problem, three associated (l, u)-DoD graphs are constructed, see Fig. 1. Each (l, u)-DoD graph consists of regions with l,u different levels of the δ (C, Dj) value. Without any knowledge on the dominance of one decision attribute over another, then no degree of dependency for one decision class should be considered more favourable. As such, regions where the levels of degree of dependency are equal would indicate a choice of l and u which favours no one decision class, see Fig. 1 (bottom right). The next elucidation is to the quality of classification measure, in Fig. 2 the (l, u)QoC graph using C is shown. Importantly, the boundary lines partitioning the different regions are all horizontal, since only the u value effects the level of classification of condition classes to single decision classes. Katzberg and Ziarko [3] utilised the (l, u)DoD measure to aid in the identification of the associated (l, u)-reducts, here the quality of classification measure γl,u(C, D) is used. l,u
l
0 1
1
4/40 0.7778 0.75 0.7143
13/40 28/40
0.5833 u 1
21/40 v
γ
0.4,0.74
(C,D)
γ (C,D) l,u
0 Fig. 2. (l, u)-QoC graph with C = {c1, c2, c3}
In Fig. 3, a series of (l, u)-QoC graphs are shown for certain subsets of the condition attributes c1, c2 and c3. In each of the two (l, u)-QoC graphs in Fig. 3, the shaded regions show the combination of l and u values for which the subset of condition attributes is a (l, u)-reduct for this wine problem. The respective PoAQoC(P) values (from [1]) for the (l, u)-QoC graphs which would have been constructed from all possible subsets of C are given in Table 1. Table 1. PoAQoC(P) values for P ⊆ C
Subset - P PoAQoC(P)
{c1} 0.2304
{c2} 0.5463
{c3} 0.2722
{c1, c2} 0.6694
{c1, c3} 0.3403
{c2, c3} 0.5724
290
M.J. Beynon l
0 1
1
l
0 1
1
4/40
0.7778
0.7778 0.7333
4/40
19/40
u
0.5238 u
v
0.3889
v 1
1 γl,u({c2},D) 0
γ ({c1,c2},D) l,u
0 Fig. 3. (l, u)-QoC graphs for subsets of C {c2} and {c1, c2}.
The PoAQoC(P) values in Table 1 can be used iteratively to identify possible (l, u)reducts to be selected. Starting with a singleton P ⊆ C, then {c2} has the largest PoAQoC({c2}) value, and for two attributes {c1, c2} with PoAQoC({c1, c2}) = 0.6694. Then depending on a specific size of (l, u)-reduct required or level of PoAQoC(⋅) achieved a single (l, u)-reduct would be identified.
3
Conclusion
This paper has further investigated the notion of (l, u)-graphs in the extended Variable Precision Rough Sets model (VPRSl,u). In particular, the (l, u)-graphs for the measures degree of dependency and quality of classification are exposited. Using a part of the wine data set, which classifies objects to one of three decision classes, the notion of equitable dependency is highlighted. Also the selection of (l, u)-redut is considered
References 1. Beynon, M.: Investigating the choice of l and u values in the extended variable precision rough sets model. Rough Sets and Current Trends in Computing RSCTC2002 (2002) 61−68 2. Beynon, M.: Introduction and elucidation of quality of sagacity in extended VPRS model. RSKD2003, Warsaw, Poland (2003). 3. Katzberg, J.D. Ziarko, W.: Variable precision extension of rough sets. Fundamenta Informaticae 27 (1996) 155−168 4. Ziarko, W.: Variable precision rough sets model. Journal of Computer and System Sciences 46 (1993) 39−59
Approximate Reducts of an Information System Tien-Fang Kuo and Yasutoshi Yajima Department of Industrial Engineering and Management, Tokyo Institute of Technology, 2-12-1, O-okayama, Meguro-ku, Tokyo 152-8552, Japan [email protected]
Abstract. Rough set is a tool for data mining. From an information system, we find reducts to generate decision rules for classification. However, for an information system has some noises, reducts may become meaningless or not appropriate for classification. In this paper, we propose some indices to find approximate reducts. For finding indeices to measure subsets of attributes, we introduce the contingency matrix based on the number of objects in each class of the information system. The main advantage of using the contingency matrix is that there are some good properties for finding approximate reducts.
1
Introduction
Rough set theory deals with a data set, called an information system, composed by a set of objects U and a set of attributes A. Some information systems can be designed as decision tables if A is divided into two disjoint sets which are set of condition attributes C and set of decision attributes D. From a decision table, decision rules can be derived. Let C ∗ = {X1 , · · · , Xm } and D∗ = {Y1 , · · · , Yn } be families of equivalence classes of U with respect to C and D, respectively. We can extract relative reducts to find decision rules for classification. However, there are some restrictions of finding relative reducts. For example, if a database has some noises, there will be some contradiction of derivation if we also take these noises into account. If patterns of inconsistent objects are many, the representation of relative reduct will become meaningless. Ziarko, et.al has developed a β reduct by variable precision rough set model [4]. Ohrn, et.al [3] has developed the r-approximate reduct by hitting set. However, the time complexity is high. In this paper we develope several indices to measure C, to define approximate reducts with these indices. To compute these indices with lower time complexity, we introduce the contingency matrix, which helps to construct efficient algorithms related to the computation of indices we used.
2
The Contingency Matrix
To define approximate reducts, we have to develope some indices to measure C. For computing these indices we develpoed easily and speedy, we have to transfer G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 291–294, 2003. c Springer-Verlag Berlin Heidelberg 2003
292
T.-F. Kuo and Y. Yajima
the decision table to a contingency matrix according to the number of objects in each XCi ∈ C ∗ . The contingency matrix M(C) is X 1 Y1 x11 . . M(C) = .. .. Yn xn1
X2 x12 .. .
... ... .. .
Xm x1m .. . ,
xn2
. . . xnm
where the element xij is the number of objects in Xi ∩ Yj , and the number of columns and rows are m and n, respectively. Also, the order of generating a contingency matrix is O(|U |log|U |). From the contingency matrix, we can define columns and rows as vectors for formulizing indices we developed. Here, we denote Mj (C) and Mi (C) are the jth row and ith column vectors of M(C), respectively, N i (C) and pi (C) are total objects and the precision rate in ith column. As for each subset of C, the number of total objects in each row is fixed, and we denoted it by Nj .
3
Indices from Row and Column Vectors
We develope some indices from the contingency matrix to measure the set of attributes. From row vectors, we get the number of objects in each class of D∗ and could compute the number of distinguishable pairs. As we choose two objects from different classes of D∗ as a pair of objects, we will know whether we can distinguish these two objects from their values of condition attributes. Therefore, we define the index γ(C) using row vectors to measure C. From column vectors, we get the number of objects in each class of C ∗ and could compute the precision rate in it. We introduce the node impurity measures for two-class classification in classification trees to measure each class of C ∗ . Here we define indices δ(C) and (C) computed from column vectors by Gini index and entropy, respectively. γ(C) =
(
1≤i<j≤n
N i · Nj (Mi (C)Mj (C)T ) · ) Ni (C) · Nj (C) (Nk · Nl ) 1≤k
n n N i (C) N i (C) δ(C) = )pi (1 − pi ), (C) = − )pi log pi ( ( |U | |U | i=1 i=1
Ohrn, et.al has developed the r-approximate reduct by hitting set, and we find that index γ is the same as the parameter r.
4
Approximate Reducts
We can compute indices from the contingency matrix to measure C. Using some monotonic functions for mapping, we can scale these indices in the interval [0,1]. After scaling, we may use these scaled indices to define approximate reducts. For
Approximate Reducts of an Information System
293
example, C ⊂ C is a γ0.9 -approximate reduct of C iff the scaled γ(C ) ≥ 0.9 and for each C ⊂ C , the scaled γ(C ) < 0.9. However, if these indices are monotonic, we can find one approximate reduct by elimination. An index I is monotonic if C (n) ⊂ · · · ⊂ C ⊂ C, then I(C (n) ) ≥ · · · ≥ I(C ) ≥ I(C). Proposition 1 Indices γ, δ, and are monotonic. proof. We find that if C ⊂ C, each column vector of M(C ) is the sum of some column vectors of M(C). To simplify this proof, we choose a contingency matrix of two decision classes and depart the step of summation to the sum of two column vectors. Therefore, we assume for C ⊂ C, there are two column vectors (a, c)T and (b, d)T in M(C) and a column vector (a+b, c+d)T in M(C ). For index γ, it is obviously that ac + bd ≤ (a + b)(c + d). Therefore, we prove that γ(C ) ≥ α(C). For indices δ and , functions of Gini index(p(1 − p)) and entropy(−p log p) are concave and decreasing. We denote the precision rate of M(C) are max(a, c)/(a + c) and max(b, d)/(b + d), the precision rate of M(C ) is max(a + b, c + d)/(a + b + c + d). a+b+c+d a+c b+d ( (−p1 log p1 ) + (−p2 log p2 )) |U | a+b+c+d a+b+c+d a + b + c + d max(a, c) + max(b, d) max(a, c) + max(b, d) ≤ (− log ) |U | a+b+c+d a+b+c+d a + b + c + d max(a + b, c + d) max(a + b, c + d) ≤ (− log ) = (C ) |U | a+b+c+d a+b+c+d
(C) =
Here the first inequality is by the property of concave, the second one is by decreasing. Also, index δ can be proved by the same way.Q.E.D Table 1. Data set objects o1 o2 o3 o4 o5 o6 o7
5
c1 1 1 1 1 0 0 0
c3 1 1 1 0 1 1 1
c4 1 0 1 0 1 0 0
c5 1 1 1 1 0 0 1
c8 0 0 1 0 0 0 0
d count objects c1 c3 c4 c5 c8 d count 2 3 o8 0 0 1 1 0 2 1 2 4 o9 1 1 1 1 0 3 5 2 3 o10 1 1 0 1 0 3 2 2 1 o11 1 1 1 0 0 3 8 2 1 o12 1 1 0 0 1 3 1 2 1 o13 1 1 0 0 0 3 4 2 5 o14 1 1 1 0 1 3 1
Experiments
To further illustrate the finding in this study, we consider an example. M. Beynon had discretized a data set of wine from the UCI Machine Learning Repository internet set through the use of median-clustering using the Lance-Williams flexible method[1]. From this data set, 40 objects of decision classes 2 and 3 were selected. Table 1 is the data set we use, and Table 2 shows scaled indices. By taking the appropriate threshold, we can select the approximate reduct.
294
T.-F. Kuo and Y. Yajima Table 2. Scaled indices attributes c1 c3 c4 c5 c8 c1 ,c3 c1 ,c4 c1 ,c5 c1 ,c8 c3 ,c4 c3 ,c5 c3 ,c8 c4 ,c5 c4 ,c8 c5 ,c8
6
γ 0.4468 0.1117 0.5585 0.6702 0.2367 0.5027 0.7447 0.8563 0.6409 0.6143 0.7074 0.3378 0.8777 0.7048 0.7366
δ 0.4093 0.0889 0.0931 0.4847 0.0171 0.4747 0.4246 0.8428 0.4670 0.1790 0.5118 0.1090 0.5418 0.2585 0.5330
0.5465 0.2103 0.2766 0.6500 0.1466 0.6083 0.5536 0.8656 0.6057 0.3473 0.6083 0.2509 0.6757 0.4497 0.6690
attributes c1 ,c3 ,c4 c1 ,c3 ,c5 c1 ,c3 ,c8 c1 ,c4 ,c5 c1 ,c4 ,c8 c1 ,c5 ,c8 c3 ,c4 ,c5 c3 ,c4 ,c8 c3 ,c5 ,c8 c4 ,c5 ,c8 c1 ,c3 ,c4 ,c5 c1 ,c3 ,c4 ,c8 c1 ,c3 ,c5 ,c8 c1 ,c4 ,c5 ,c8 c3 ,c4 ,c5 ,c8
γ 0.7633 0.8750 0.6914 0.9547 0.8696 0.9121 0.8962 0.7553 0.7739 0.9229 0.9601 0.8856 0.9308 0.9946 0.9414
δ 0.4794 0.8663 0.5424 0.8604 0.6013 0.9229 0.5748 0.3592 0.5689 0.6514 0.8746 0.6496 0.9570 0.9859 0.7067
0.6105 0.8756 0.6724 0.8745 0.6995 0.9017 0.6898 0.5271 0.6843 0.7558 0.8801 0.7528 0.9181 0.9944 0.8120
Conclusion
We introduce the contingency matrix for obtaining measures of C. After this matrix is obtained, by using its row and column vectors, we may compute indices. To compare with the threshold we took, we may define approximate reducts. By using the contingency matrix, the advantage is that we can prove these indices are monotonic, and approximate reducts could be obtained by the method of elimination. Taking the precision rate into account, the approximate reduct is more reasonable and appropriate for representing the original set.
References 1. Beynon, M.: Reducts within the variable precision rough sets model: a further investigation. European journal of operational research 134(2001) 592–605 2. Cios, K., Pedrycz, W., Swiniarski, R.: Rough sets, Data mining methods for knowledge discovery. Kluwer Academic Publishers 27–72 3. Vinterbo, S., Ohrn, A.: Minimal approximate hitting sets and rule templates. International Journal of approximate reasoning 25(2000) 123–143. 4. Ziarko, W.: Variable precision rough set model. Journal of computer and system sciences 46(1993) 39–59.
A Rough Set Methodology to Support Learner Self-Assessment in Web-Based Distance Education Hongyan Geng and Brien Maguire Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {hgeng, rbm}@cs.uregina.ca Abstract. With the prevalence and explosive growth of distance education via the World Wide Web, many efforts are dedicated to make distance education more effective. We present a Rough Set model to provide an instrument for learner self-assessment when taking courses delivered via the World Wide Web. The Rough Set Based Inductive Learning Algorithm generates definite and probabilistic(general) rules, which are used to provide feedback to learners.
1
Introduction
With the growing popularity of the World Wide Web, Web-based education is becoming very popular owing to its flexibility in time and distance. It overcomes various problems associated with traditional central classroom teaching, such as the limitations imposed by distance, time scheduling, class size, cost, and individual learning speed barriers. It is a natural extension of the increased use of the Web to host course notes, tests and examples for on campus teaching and this approach has become very prominent recently. By definition [1,3], distance education takes place when instructors and students are separated by physical distance. The communication between them occurs through technologies such as audio and video teleconferencing, personal computers, email, fax, and multimedia systems, all of which are used to bridge the instructional gap [3]. For minimal cost, the Web is expanding educational opportunities for many who are unable to attend on campus classes due to limited time, physical distance or physical disabilities. However, Web-based education does face a number of deficiencies as a consequence of the delivery distance form of teaching. Lack of contact and immediate feedback between instructors and online students is one of the main problems in distance education [2]. In this paper we describe how to use Rough Set theory to generate rules to assess student performance. The model we propose tells students and teachers which sections of the course material are most important in terms of passing the final exam and provides self-assessment for learners when taking on-line courses. With the availability of such automatic advising systems, the task of distant education learners will gradually be eased. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 295–298, 2003. c Springer-Verlag Berlin Heidelberg 2003
296
2
H. Geng and B. Maguire
Rough Set Framework for Generating Default Rules
Rough Set theory was proposed by Z. Pawlak [7] in the early eighties. Rough Set methodology is concerned with the classificatory analysis of imprecise, uncertain or incomplete information or knowledge expressed in terms of data acquired from experience. The primary notions of the theory of Rough Sets are the approximation space, and the lower and upper approximations of a set [4,6,7]. The main theme of rough set analysis is discernibility. As one of many Rough Set methodologies, the discernibility matrix can be used to obtain a minimal set of rules by searching for differences between objects. A Discernibility Matrix is a matrix in which the entries are the attributes which can discern the classes between the corresponding row and column objects. The Relative Discernibility Function computes the minimal sets of attributes that are necessary to distinguish one class from other object classes defined by a set of attributes [6]. 2.1
Default Rule Generation
It is very important to be able to handle inconsistencies in the data. From such data we may still be able to extract considerable information of interest, especially knowledge that reflects the most common or normal situation. As a result, default reasoning is used to generate rules that cover the most general characteristics in the data when problems are related to incompleteness and inconsistency in an information system. It is demonstrated that these rules are more useful since they are less susceptible to noise when they are used on other input data. We define briefly here the framework for the generation of default rules, a process to represent and reason in nondeterministic information systems. An Information System (IS) is an ordered pair A = (U, A), which U is a nonempty finite set of objests – the Universe, and A is a nonempty finite set of elements called Attributes. IN D(B) is an Indiscernibility Relation in IS defined as IN D(B) = {(x, y) ∈ U 2 | a(x) = a(y) for every a ∈ B}. B ⊆ A, is a set of attributes. The attributes in A are classified in disjoint sets of condition attributes C and decision attributes D. Consider the general rule schema Des(Ei , C) → Des(Xj , D), where Ei is the class over the set of condition attributes and Xj is the class over the set of decision attributes. Des(Ei , C) is called a class description. There is a map between the condition classes Ei into the decision class Xj . To find whether the rule can be accepted as a default relation between class Ei ∈ U/IN D(C) and Xj ∈ U/IN D(D), we test the value for the membership function versus the threshold µtr by evaluating µC (Ei, Xj) ≥ µtr [6]. The rules that will be generated in the process depend upon the setting of µtr . 2.2
Finding More General Patterns
From any deterministic or nondeterministic system, we are often interested in finding more general patterns in the data, making the rules simpler in structure.
A Rough Set Methodology to Support Learner Self-Assessment
297
We do this by forming unions of the classes induced by the condition attributes C. As a result, sets may be constructed to cover more class objects with simpler descriptions. The core of the approach is the idea of creating indeterminacy in information systems and generating rules that cover the majority of the cases by selecting projections over the condition attributes, thus allowing certain attributes to be excluded from consideration [6]. By doing this, we effectively join equivalence classes over the condition attributes, classes which may otherwise be mapped into different decisions. By recursively invoking the above rule generation procedure for each case of condition attributes projection, we can effectively join equivalence classes over the condition attributes, and therefore generate simplified rules that cover the majority of the cases.
3
Using Rough Sets for Learner Self-Assessment
In applying the Rough Set approach to the problem of decision generation, several algorithms are designed to extract information from a set of primitive data by building propositional rules that cover the available input knowledge. In the current study, a Rough Set model is adapted to support learner self-assessment in distance education courses. There are several advantages to using our adapted Rough Set methodology in comparison to the traditional decision tree method. First, it searches for all minimal rules, which often generates shorter rules compared to rules in minimal cover or decision tree. Secondly, it offers better prediction for unforeseen cases which are not present in the training data, even though the data may be inconsistent and incomplete. By analyzing student grade information, rules behind the information tables can be derived. These rules can be applied either to identify the particular sections of the course material where the students fail to fully understand and which, therefore, leads to failures in the final exam [5] or to help students assess their expected performance in the final exam based on the grades they have obtained so far on assignments, tests, or exams. We use an information table of student grades recorded from an introductory Computer Science class (CS100) at the University of Regina in which students use the Web for their course materials. We represent the objects that have the same values for the attributes with their equivalence class. We apply the Rough Set methodology to the information table to find rules behind data. After calculating the discernibility matrix and relative discernibility function for the information table based on the set of prime implicants of the relative discernibility function, we can obtain both definite and default rules for the system. From the rule base, we can find the section of material that is most relevant to failure on the final exam. By selecting projections over the condition attributes and computing the new discernibility matrix and relative discernibility functions, more general rules are generated. Finally the generated rules can be used to predict the performance of a student in the final exam. In each rule, the decision values are distributed. For each distributed value of the decision attribute, there are probabilities S and
298
H. Geng and B. Maguire
C associated with it, which represent the strength and coverage of the rule. A former problem with application of all minimal rules is that it is hard to know which rule should be fired when one object fires more than one rule which points to different decision values. In our approach, by considering that all rules fired are reasonable, in other words, each case is possible in the real world, we solved this problem by merging all rules fired by the one object into one rule associated with probabilities. After selecting all necessary projections, we will get all the minimal rules and eventually obtain a series of definite and general rules. We call these rules a rule base. Since we have considered all possible cases, and in each situation the rules are generated from each relative reduct, we do not experience any information loss.
4
Summary
This paper provides a method to assess learner outcomes in online learning and therefore motivate the students to improve their overall performance. The systematic application of all minimal rules obtained from the discernibility matrix is a novel approach used recently. By applying Rough Set methodology to course data, general rules are deduced for evaluating learner performance in the courses. These rules can be used to inform instructors about how well students understand the course content and inform students of important sections of the course materials. With this method, the students can also predict their final grades based on their performances to date in a course. The method proposed in this paper provides additional feedback and interaction for distance learning environments, assisting in making distance learning a more effective learning approach.
References 1. Moore, M.G. & Thompson, M.M., with Quigley, A.B., Clark, G.C., & Goff, G.G. (1990). The Effects of Distance Learning: A Summary of the Literature. Research Monograph No. 2. Sherry, L. (1996). Issues in Distance Learning. International Journal of Educational Telecommunications, 1 (4), 337–365. 3. http://www.uidaho.edu/eo/distglan.html. January, 2002. 4. Y. Y. Yao, S. K. Wong and T. Y. Lin: A Review of Rough Set Models. In: T. Y. Lin, N. Cercone (eds.), Rough Sets and Data Mining: Analysis for Imprecise Data, Kluwer Academic Publishers, Bosten (1997) 47–75 5. A. H. Liang, B. Maguire, and J. Johnson. Rough Set Based WebCT Learning. In Proceedings of the 1st International Conference on Web-Age Information Management, Shanghai, P.R. China, June 21–23, 2000. Springer-Verlag LNCS 1846. 6. T. Mollestad and A. Skowron. A Rough Set Framework for Data Mining of Prepositional Default Rules. The 9th International Symposium on Methodologies for Intelligent Systems, ISMIS’96, Zakopane, Poland, June 9–13, 1996. 7. Z. Pawlak, Rough Set, International Journal of Computer and Information Sciences, 11, pp. 341–356, 1982.
A Synthesis of Concurrent Systems: A Rough Set Approach Zbigniew Suraj and Krzysztof Pancerz Chair of Computer Science Foundations University of Information Technology and Management Sucharskiego Str. 2, 35-225 Rzesz´ ow, Poland {zsuraj,kpancerz}@wenus.wsiz.rzeszow.pl
Abstract. The synthesis problem has been discussed in the literature for various types of formalisms. Our approach is based on the rough set theory and Petri nets. In the paper information systems are used for representing knowledge about the modeled concurrent system. As a model for concurrency coloured Petri nets are chosen. This paper provides an algorithm for constructing a model of a given concurrent system in the form of a net. The net construction consists of two stages. In the first stage, all dependencies represented by means the minimal rules between local states of processes in the system are extracted. In the second stage, a coloured Petri net corresponding to these dependencies is built. Our approach uses a new method for generating the minimal rules in order to solve the synthesis problem considered here. It can be used to deal with various problems arising from the design process of automated systems. The method proposed in the paper has been implemented in the ROSECON system running on IBM P C computers under Windows operating system. This system is permanently evolved. In the paper we assume that the reader is familiarized with the basic notions and notation of rough set theory as well as coloured Petri nets. Keywords: Information systems, minimal rules, knowledge discovery, concurrent systems, coloured Petri nets.
1
The Synthesis Problem
Let A = {a1 , ..., am } be a non-empty, finite set of local processes (attributes). With every local process a ∈ A is associated a finite set Va of its internal states (values of attributes). A behaviour of a modeled system can be presented in a form of a data table. Each row in the table includes record of local states of processes from A, and each record is labeled by an element from the set U of global states (objects) of the system. The columns in the table are labeled by the names of processes. A pair (U, A) is an information system [5] and it is denoted by S. Construct for a given information system S its concurrent model in the
Institute of Mathematics, Rzesz´ ow University, 35-310 Rzesz´ ow, Rejtana Str. 16A, Poland
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 299–302, 2003. c Springer-Verlag Berlin Heidelberg 2003
300
Z. Suraj and K. Pancerz
form of a coloured Petri net (CP -net) [1] CP NS with the following property: the reachability set of markings [M0 > of CP NS defines an extension S of S created by adding to S all new global states corresponding to markings from [M0 > where M0 denotes an initial marking of the net. Moreover S is the largest extension of S with that property [10]. The initial marking of CP NS corresponds to any global state of S.
2
The Solution
Let S = (U, A) be an information system and let OPT(S) be a set of all minimal (w.r.t the number of descriptors on the left hand side) rules in S [9]. ALGORITHM for constructing a concurrent model CP NS of S: Input: An information system S. Output: CP NS - the concurrent model of S in the form of a CP -net. Step 1. Extract the minimal rules from an information system S by using a method described in [9]. Step 2. Construct the net representing all attributes in a given information system S. Each place pa of CP NS corresponds to an attribute a ∈ A of S. The colour sets of places in the net are labeled by means the names of attributes of S. For each place the colour set of place consists of colours labeled by means the names of attribute values of a given attribute. There is only one transition t in the constructed net. The transition t represents the global state changes. The initial marking M0 of the net corresponds to an object u ∈ U of S chosen in an arbitrary way. Step 3. The net obtained in Step 2 is extended by adding a guard expression to the transition t. The guard expression is determined using the minimal rules in S. In order to realize the Step 3 of the above algorithm we can perform the following procedure. PROCEDURE for computing a guard expression: Input: A set OPT(S) of all minimal rules in S. Output: A guard expression corresponding to OPT(S). Step 1. Rewrite each rule from OPT(S) to the disjunctive normal form of Boolean formula. Step 2. Construct the conjunction of formulas obtained in Step 1. Step 3. Use Boolean algebra laws for simplification of the formula obtained in Step 2 in order to get its minimal disjunctive normal form. The resulting formula is the guard expression corresponding to OPT(S). Example 1. Consider an information system S = (U, A) with U {u1 , u2 , u3 , u4 }, A = {a, b} and the values of the attributes as in Table 1.
=
A Synthesis of Concurrent Systems: A Rough Set Approach
301
Table 1. An information system U/A u1 u2 u3 u4
a 0 1 0 2
b 1 0 2 0
The set OPT(S) of minimal rules in S is as follows: IF a1 THEN b0, IF a2 THEN b0, IF b1 THEN a0, IF b2 THEN a0. After execution of the above procedure we obtain the following Boolean expression: (a0 AND b0) OR (a0 AND b1) OR (a0 AND b2) OR (a1 AND b0) OR (a2 AND b0). The guard expression corresponding to OPT(S) is as follows: (ya = a0 AND yb = b0) OR (ya = a0 AND yb = b1) OR (ya = a0 AND yb = b2) OR (ya = a1 AND yb = b0) OR (ya = a2 AND yb = b0). The concurrent model of S in the form of CP -net constructed by using the above algorithm is shown in Fig. 1. The guard expression form associated with the transition t (see Fig. 1) differs slightly from the presented above. It follows from the syntax of the CP N M L language accessible into the Design/CP N system [11].
1‘a0 pa
color a color b var xa, var xb,
a
= with a0 | a1 | a2; = with b0 | b1 | b2; 1‘ya ya : a; yb : b;
1
1‘a0
1
1‘b1
1‘xb
1‘xa
t
1‘b1 b
pb
1‘yb
[(ya=a0 andalso yb=b0) orelse (ya=a0 andalso yb=b1) orelse (ya=a0 andalso yb=b2) orelse (ya=a1 andalso yb=b0) orelse (ya=a2 andalso yb=b0)]
Fig. 1. CP -net as a concurrent model of S.
It is easy to verify that the reachability set of markings [M0 > of CP NS from Fig. 1 defines an extension S of S created by adding to S all new global states corresponding to markings from [M0 > where M0 denotes an initial marking of the net. Moreover S is the largest extension of S with that property.
302
3
Z. Suraj and K. Pancerz
Concluding Remarks
In the paper a methodology for constructing a concurrent model from a given information system has been demonstrated. In the forthcoming paper [4] we study an application of the presented methodology for the synthesis of concurrent systems specified by the dynamic information systems [8]. Application of concurrent model obtained from a given information system in the form of a classical Petri net [6] has been discussed in [7], [8], [10]. A method for constructing a concurrent model from an information system in the form of a net with inhibitor expressions has been also discussed in [3]. A model in the form of a CP -net presented in this paper is coherent, readable and simple for analysis. Acknowledgment. We are grateful to the anonymous referee for helpful comments.
References 1. Jensen, K.: Coloured Petri Nets. Basic Concepts, Analysis Methods and Practical Use. Vol. 1. Springer, Berlin 1992. 2. Pancerz, K., Suraj, Z.: ROSECON – a system for automatic discovering of concurrent models from data tables. in: Proc. of the IX Environmental Conference on Mathematics and Informatics, Korytnica, June 12–16, 2002, Poland (in Polish), p. 34. 3. Pancerz, K., Suraj, Z.: From Data to Nets with Inhibitor Expressions: A Rough Set Approach. in: Z. Suraj (Ed.), Proc. of the Sixth International Conference on Soft Computing and Distributed Processing, Rzesz´ ow, June 24–25, 2002, Poland, pp. 102-106. 4. Pancerz, K., Suraj, Z.: Modelling Concurrent Systems Specified by Dynamic Information Systems: A Rough Set Approach. in: Proc. of the International Workshop on Rough Sets in Knowledge Discovery and Soft Computing, Warsaw, April 12–13, 2003 (to appear). 5. Pawlak, Z.: Rough Sets – Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht 1991. 6. Reisig, W.: Petri Nets. An Introduction. Springer, Berlin 1985. 7. Skowron, A., Suraj, Z.: Rough Sets and Concurrency. Bulletin of the Polish Academy of Sciences 41-3 (1993), 237–254. 8. Suraj, Z.: The Synthesis Problem of Concurrent Systems Specified by Dynamic Information Systems. in: L. Polkowski and A. Skowron (Eds.), Rough Sets in Knowledge Discovery, 2, Physica-Verlag, Berlin 1998, pp. 418–448. 9. Skowron, A., Suraj, Z.: Parallel Algorithm for Real-Time Decision Making: A Rough Set Approach. Journal of Intelligent Information Systems 7, Kluwer, Dordrecht 1996, 5–28. 10. Suraj, Z.: Rough Set Methods for the Synthesis and Analysis of Concurrent Processes. in: L. Polkowski, S. Tsumoto, T.Y. Lin (Eds.) Rough Set Methods and Applications, Springer, Berlin 2000, pp. 379–488. 11. http://www.daimi.au.dk/designCPN/
Towards a Line-Crawling Robot Obstacle Classification System: A Rough Set Approach James F. Peters1 , Sheela Ramanna1 , and Marcin S. Szczuka2 1
Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Manitoba R3T 5V6, Canada {jfpeters, ramanna}@ee.umanitoba.ca 2 Institute of Mathematics, Warsaw University Banacha 2, 02-097 Warsaw, Poland [email protected]
Abstract. The basic contribution of this paper is the presentation of two methods that can be used to design a practical robot obstacle classification system based on data mining methods from rough set theory. These methods incorporate recent advances in rough set theory related to coping with the uncertainty in making obstacle classification decisions either during the operation of a mobile robot. Obstacle classification is based on the evaluation of data acquired by proximity sensors connected to a line-crawling robot useful in inspecting power transmission lines. A fairly large proximity sensor data set has been used as means of benchmarking the proposed classification methods, and also to facilitate comparison with other published studies of the same data set. Using 10-fold cross validated paired t-test, this paper compares the rough set classification learning method with the Waikato Environment for Knowledge Analysis (WEKA) classification learning method.
1
Introduction
This paper presents sensor data change classification learning based on proximity sensor measurements using data mining methods from rough set theory [1, 6,7,8,9]. In the context of obstacle classification, the term data mining refers to knowledge-discovery methods used to find relationships among proximity sensor data sets and the extraction of rules useful in identifying obstacles encounter by a mobile robot. The derivation of rules that can be used by a robot in navigation planning and in mapping the environment of a robot. The focus of this paper is an introduction to an approach to solving this problem based on recent findings in rough set theory and the availability of a number of complete rough set toolsets. It has been shown rough sets work well in coping with the uncertainty in various classification systems [7], and the design of rough-set based classification systems [7]. This paper also compares the two rough set classification learning schemes with Waikato Environment for Knowledge Analysis (WEKA) classification learning [10,11]. The contribution of this paper is the presentation of two models for obstacle classification based on rough sets. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 303–307, 2003. c Springer-Verlag Berlin Heidelberg 2003
304
2
J.F. Peters, S. Ramanna, and M.S. Szczuka
LCR Navigation Problem
Basic features of the line-crawling robot (LCR) navigation problem are described in this section. A principal task of the LCR control system is to guide the movements of the robot so that it maintains a safe distance from overhead conductors and any objects such as insulators attached to conductor wires or towers used to hold conductors above the ground. To move along a conductor, the LCR must continuously measure distances between itself and other objects around it, detect and maneuver to avoid collision with obstacles. Let a1, a2, a3, a4 denote proximity sensors (e.g., ultrasonic sensors as in Fig. 1).
Fig. 1. Sample proximity Sensors of LCR
We want to obtain a classifier so that the LCR can make a decision about which type of movement it can safely make based on readings from its sensors. Table 1 gives some sample navigation decisions based on aggregate information from fusion of various sensors. Let d denote a decision class. Sensor measurements are separated in collections (a form of sensor fusion) used to construct convex sets. Each convex set contains sensor measurements (often with some noise) either close to a preset threshold or significantly greater than {less than} the threshold. In effect, this form of convex set represents what is known as an upper approximation in rough set theory. In addition, each convex set is associated with a LCR navigation decision that initiates a set of many movements of the robot parts to carry a LCR maneuver.
3
Comparison of Classification Learning Algorithms
A comparison of the Rough Set Exploration System (RSES 2.0, cf. [9]) and WEKA classification learning algorithms in terms of the error rate is given in
Towards a Line-Crawling Robot Obstacle Classification System
305
this section. Variations of error and a comparison of pairs of differences in error rates for the RSES and WEKA using the 10-fold cross-validated paired t-test are given in this section. This section gives a brief discussion of the variations in error rates using 10fold cross validation with RSES and WEKA. The variations in the error rates across the 10 folds for the sensor data with respect to the discretized case is shown in Fig. 2a. The plots show that the error-rate is consistently lower in the discretized case for RSES 2.0 whereas the error-rate is lower using WEKA in the non-discretized case (see Fig. 2b). With the k-fold cross-validated paired t-test we want to test the hypothesis that mean difference between the two classification learning algorithms used by RSES and WEKA is zero. Let µd denote the mean difference in the error rates during a 10-fold classification of sensor data. Let H0 denote the hypothesis to be tested (i.e., H0 : µd = 0). This is our null hypothesis. The paired difference t-test is used to test this hypothesis and its alternative hypothesis (HA : µd = 0). We start with pairs (ε11 , ε21 ), . . . , (ε1n , ε2n ), where ε1i , ε2i are the ith error rates resulting from the application of the RSES and WEKA classification learning algorithms, respectively, and i = 1, . . . , n. Let di = ε1i − ε2i . Underlying the null hypothesis H0 is the assumption that the di values are normally and independently distributed with mean µd and variance ¯ S 2 denote the mean difference and variance in the error rates of a σd2 . Let d, d random sample of size n from a normal distribution N (µd , σd2 ), where µd and σd2 are both unknown.
Fig. 2. ε-variations: a) Discretized case; b) Non-discretized case.
The t statistic used to test the null hypothesis is as follows: √ d¯ − µd d¯ − 0 d¯ n √ = √ = t= Sd Sd / n Sd / n where t has a Student’s t-distribution with n − 1 degrees of freedom [3]. The shape of the t distribution depends on the sample size n−1 (number of degrees of freedom). In our case, n−1 = 9 relative to 10 sample error rates. The significance
306
J.F. Peters, S. Ramanna, and M.S. Szczuka
level α of the test of the null hypothesis H0 is the probability of rejecting H0 when H0 is true. Let tn−1,α/2 denote a t-value to right of which lies α/2 of the area under the curve of the t-distribution that has n − 1 degrees of freedom. Next, formulate the following decision rule: Decision Rule: Reject H0 (µd = 0) at significance level α iff |t| >tn−1,α/2 Pr-values for tn−1,α/2 can obtained from a t-distribution table [2]. In what follows, α = 0.10, and n − 1 = 9. Consider, for example, the paired t-test applied to the error rates obtained from the use of RSES and WEKA in classifying the metric data from VME subsystem 1 (discretized case). With 9 degrees of freedom, we find that Pr(|t| < 1.833) = 0.95 where tn−1,α/2 = t9,0.05 = 1.833. It was found that the null hypothesis H0 can be rejected, since |t| = |-19.56| >1.833 (nondiscretized) and |t| = |18.06| >1.833 (discretized case) at the 10% significance level. In both the discretized and non-discretized cases, |t| > t9,0.05 . Hence, the null hypothesis is rejected (in effect, µd = 0) at the 10% significance level. The average error rates for WEKA and RSES differ quite significantly in both cases. It is noteworthy that RSES does better than WEKA for the discretized case in classifying the proximity sensor data. This is significant since sensor data is real-valued.
4
Conclusion
Two rough set methods for deriving obstacle rules useful mobile robot navigation. A rough set classification learning algorithm provided by RSES has been compared with WEKA classification learning algorithm using the 10-fold, crossvalidated paired t-test. In both the non-discretized and discretized cases, the results of the paired t-test reveal that the WEKA and RSES 2.0 classification algorithms are different. In classifying sensor data sets in the non-discretized case, the WEKA wrapper method outperforms the exhaustive method in RSES in classifying the obstacles. By contrast, RSES error rates are significantly lower than the WEKA error rates across the ten folds in the discretized case. The non-discretization method is more suitable for making obstacle decisions in a uniform, unchanging environment. However, rules derived using non-discretized data lack generality. The discretization method results in a set of obstacle classification rules cover more cases than rules produced using the non-discretization method. Based on the results of the study reported in this paper, the rough set methodology offers a promising basis in designing a navigation planner and environment mapper for a mobile robot that lives in a changing environment. Acknowledgements. The research of James Peters has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant 185986 and grants from Manitoba Hydro. The research by Sheela Ramanna has been supported by NSERC research grant 194376 and a grant from UITM. Marcin Szczuka has been supported by KBN grant 8T11C02519
Towards a Line-Crawling Robot Obstacle Classification System
307
and the Wallenberg foundation. The authors also wish to acknowledge the help and suggestions from Jeff Babb and Andrzej Skowron.
References 1. Bazan J., Szczuka M., Wr´ oblewski J., A new version of the rough set exploration system. In Proceedings of RSCTC’02, LNAI 2475. Springer-Verlag, Berlin, 2002, 397–404. 2. Beyer, W.H. Handbook of Tables for Probability and Statistics. CRC Press, Ohio, 1968. 3. Hogg R.V., Tanis E.A. Probability and Statistical Inference. Macmillan Publishing Co., New York, 1997. 4. Hussein, A., Dietterich, T.G. Efficient algorithms for identifying relevant features. Proc. of the 9th Canadian Conf. on AI, Vancouver, B.C., 1992, 38–45. 5. Mitchell, T.M. Machine Learning. McGraw-Hill, NY, 1997. 6. Pawlak, Z. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Boston, 1991. 7. Peters, J.F., Skowron, A., Suraj, Z., Pedrycz, W., Pizzi, N., Ramanna, S. Classification of meteorological volumetric radar data using rough set methods. Pattern Recognition Letters 24(6), 2002, 911–920. 8. Rosetta 1999. http://www.idi.ntnu.no/˜aleks/rosetta/ 9. RSES 2002. http://logic.mimuw.edu.pl/˜rses/ 10. WEKA 2002. http://www.cs.waikato.ac.nz/ml/weka 11. Witten, I.H., Frank, E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kauffman, San Francisco, 2000.
Order Based GGenetic Algorithms for the " Search of Approximate Entropy Reducts ´ ezak1,2 and Jakub Wro´blewski2 Dominik Sl¸ 1. Department of Computer Science University of Regina, Regina, Canada 2. Polish-Japanese Institute of Information Technology, Warsaw, Poland
Abstract. We use entropy to extend the rough set based notion of a reduct. We show that the order based genetic algorithms, applied to the search of classical decision reducts, can be used in exactly the same way in case of extracting optimal approximate entropy reducts from data.
1
Introduction
In the theory of rough sets [4] a universe of objects is the only source of knowledge usable to construct the reasoning models. In classification problems the goal is to approximate values of a decision attribute under information provided by conditional attributes. New objects are classified using “if..then..” decision rules learnt from the known cases. Due to the Minimum Description Length Principle (MDLP) [7], adapted to rough sets e.g. in [2], we search for the simplest decision rule based models, using various heuristics. As one of such heuristics, the order based genetic algorithm for extraction of minimal decision reducts was proposed [11]. We modify it to search for approximate reducts – the attribute subsets inducing rules, which approximate decision classes accurately enough. We label each subset B of available attributes A with its entropy H(B) [3], opposite to the strength of the model induced by B, conditional entropy H(d/B) – the model’s inaccuracy in predicting decision [2, 8], and H(B/d) – its sensitivity [6]. Due to the Approximate Entropy Reduction Principle (AERP) [10], we minimize H(B) (H(B/d)) keeping H(d/B) at a reasonable level.
2
Rough Sets andPProbabilities
In [4] data is represented as an information system A = (U, A), where U is the universe of objects and each attribute a ∈ A provides a function a : U → Va into the set of values on a. For any B ⊆ A and u ∈ U , we define B-information vector B(u) = !b1 (u), . . . , b|B| (u)", where bj (u) is the value of bj ∈ B. The set of all such vectors equals to VBU = {B(u) : u ∈ U }. Each B ⊆ A induces a U -partition with B-indiscernibility classes &(B, w)&A = {u ∈ U : B(u) = w}, for w ∈ VBU . !
Supported by Polish National Committee for Scientific Research (KBN) grant No. 8T 11C02519, as well as the Research Centre of PJIIT.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 308−311, 2003. Springer-Verlag Berlin Heidelberg 2003
Order Based Genetic Algorithms for the Search of Approximate Entropy Reducts
309
Pairs (B, w), B ⊆ A, w ∈ VBU , are patterns, interpreted as conjunctions of descriptors (a, v), a ∈ B, v ∈ Va . Classes &(B, w)&A are their supports. The data driven probability P (w) = |&(B, w)&A | / |U | reflects the strength of (B, w). The task of analysis is often concerned with defining a distinguished decision by the rest of attributes. In this case, we represent data as a decision system A = (U, A∪{d}), where d ∈ / A, Vd = {1, . . . , |Vd |}. For each k ∈ Vd , we define the k-th decision class Xk = {u ∈ U : d(u) = k}. The data driven probability P (k/w) = |&(B, w)&A ∩ Xk | / |&(B, w)&A | of k ∈ Vd conditioned by w ∈ VBU corresponds to the precision of decision rule B = w ⇒ d = k. In the rough set applications, one often operates with the object related decision rules B = B(u) ⇒ d = d(u), for u ∈ U . Then, precision (strength) is expressed as P (d(u)/B(u)) (P (B(u))). The precision-strength balance refers to the MDLP principle [7]. The strength can be replaced with the complexity of the rule’s description or, e.g., the rule’s sensitivity P (B(u)/d(u)) = |&(B, B(u))&A ∩ Xd(u) | / |Xd(u) |, opposing precision P (d(u)/B(u)) in the Relative Operating Characteristic (ROC) approach [6].
3
Information Entropy
Entropy evaluates information, which we get from the fact that a given random event occurred [2, 3]. Given the event’s probability p > 0, it is defined as h(p) = − log p. For instance, given w ∈ VBU , one can state the entropy of pattern (B, w) as equal to − log P (w). In its generalized form, entropy evaluates → − random distributions p = !p1 , . . . , pr " with the expected degree of information " − → H ( p ) = − k: pk >0 pk log pk . For instance, we can label B ⊆ A with its entropy " H(B) = − w∈V U P (w) log P (w) (1) B interpreted as the average degree of information about objects, obtainable from knowledge about their values on B. Conditional entropy H(d/B) of d given B evaluates information obtainable from d, given already provided B [2, 3]. Similarly, we interpret H(B/d). By definition, we have the following: H(d/B) = H(B ∪ {d}) − H(B)
H(B/d) = H(B ∪ {d}) − H({d})
(2)
We can use H(d/B) to label each B ⊆ A with the amount of uncertainty concerning d under information about B. We have the following inequalities: 0 ≤ H(d/A) ≤ H(d/B) ≤ H({d})
(3)
where: H(d/A) = 0 holds, iff A defines d, i.e. iff P (d(u)/B(u)) = 1 for any u ∈ U ; H(d/A) = H(d/B) holds, iff B makes d conditionally independent from A \ B, i.e. iff P (d(u)/B(u)) = P (d(u)/A(u)) for any u ∈ U ; and H(d/B) = H({d}) holds, iff d is independent from B. We have also the following equalities [10]: H(B) = − log G(B)
H(d/B) = − log G(d/B) H(B/d) = − log G(B/d) (4) ! ! where quantities G(B) = |U | Πu∈U P (B(u)), G(d/B) = |U | Πu∈U P (d(u)/B(u)) ! and G(B/d) = |U | Πu∈U P (B(u)/d(u)) are the average strength, precision and sensitivity of the object related decision rules induced by B ⊆ A.
310
4
D. Sl¸ ´ eczak and J. Wroblewski ´
Approximate Entropy Reducts
Due to the MDLP principle, we should tend to the model’s simplification, unless it causes a loss of its accuracy. This idea corresponds to the rough set notion of a decision reduct : an irreducible B ⊆ A defining d in A = (U, A ∪ {d}). If A is inconsistent, i.e. even the whole A does not define d, a question about the reduction criterion arises. In [8, 10] we considered µ-decision reducts: irreducible subsets B ⊆ A such that P (d(u)/B(u)) = P (d(u)/A(u)) for any u ∈ U . Subset B ⊆ A is a µ-decision reduct, iff it is a Markov boundary of d for the product distribution P over A ∪ {d}, i.e. it is an irreducible subset of attributes (random variables), which provides the same probabilistic information about d as A [5]. Equivalently, B is a µ-decision reduct, iff H(d/B) = H(d/A) and H(d/B) > H(d/B \ {a}) for any a ∈ B. If A = (U, A ∪ {d}) is consistent, i.e. A defines d, then such reducts-boundaries coincide with classical decision reducts. B is a decision reduct, iff H(d/B) = 0 and H(d/B \ {a}) > 0 for any a ∈ B. The reduction of attributes causes potential growth of conditional entropy, i.e., potential average decrease of precision of the object related decision rules. Let us consider constraint G(d/B) ≥ (1 − ε)G(d/A), pointing at subsets B ⊆ A, which induce rules being on average ε-almost as precise as those induced by the whole A. By taking the logarithm of both sides, we get the following criterion: H(d/B) + log(1 − ε) ≤ H(d/A)
(5)
We say that B is an ε-approximate µ-decision reduct, iff it is an irreducible set satisfying (5). Originally, reducts were evaluated due to the number of attributes involved. In [10] the following generalization was proposed: Given ε ∈ [0, 1), the Minimal ε-Approximate Decision Reduct Problem (MεDRP) is the task of finding a minimal (by cardinality) B satisfying (5). One can also evaluate reducts with numbers of distinct rules or the measures of strength and sensitivity. The Minimal Rule (MRεDRP), H-Strength Optimal (StεDRP) and H-Sensitivity Optimal (SeεDRP) ε-Approximate Decision Reduct Problems are the tasks of finding such minimal (by cardinality) ε-approximate µ-decision reduct C ⊆ A that f (C) = minB⊆A: B satisfies (5) f (B) (6) # U# for f : P(A) → R defined by f (B) = #VB #, H(B), and H(B/d), respectively.
5
Order Based Genetic Algorithms
The MεDRP, MRεDRP, StεDRP, and SeεDRP problems are NP-hard for any ε ∈ [0, 1) [10]. Therefore, one cannot expect fast and reliable tools for solving them in a deterministic way. We propose to extend the order based genetic algorithm (o-GA) for searching for minimal decision reducts [11], in purpose of finding (sub)optimal ε-approximate µ-decision reducts. As a hybrid algorithm [1], our o-GA consists of two parts: 1. Genetic part, where each chromosome encodes a permutation of attributes 2. Heuristic part, where permutations τ are put into the following algorithm:
Order Based Genetic Algorithms for the Search of Approximate Entropy Reducts
311
ε-REDORD algorithm: 1. 2. 3. 4.
Let A = (U, A ∪ {d}) and τ : {1, .., |A|} → {1, .., |A|} be given; Let Bτ = A; For i = 1 to |A| repeat steps 3 and 4; Let Bτ ← Bτ \ {aτ (i) }; If Bτ does not satisfy condition (5), undo step 3.
Comparing to [11], we replace the condition of defining d with criterion (5). Proposition 1. ε-REDORD always gives an ε-approximate µ-decision reduct. For any ε-approximate µ-decision reduct B there exists such τ that B = Bτ . Each genetic algorithm simulates the evolution of individuals [1, 11]. Its behavior depends on specification of the fitness function, which evaluates individuals. In the proposed o-GA, we define fitness of a given permutation-individual τ due to the quality of Bτ resulting from ε-REDORD. It can be done, e.g., as follows: f itness(τ ) = 2−f (Bτ )
(7)
for f : P(A) → R defined at the end of Section 4. The following result, together with Proposition 1, assures that o-GA with fitness (7) can be applied to search for solutions of the ε-approximate µ-decision reduct optimization problems: Proposition 2. B is a solution of MεDRP, iff B = Bτ for τ maximizing (7), for f (Bτ ) = |Bτ |. If B is a solution of MRεDRP, StεDRP, or SeεDRP, then # # B = Bτ for τ maximizing (7), for f (Bτ ) = #VBUτ #, H(Bτ ), H(Bτ /d), respectively. In some cases, the obtained reduct Bτ , which maximizes f itness(τ ), may be not the one with minimal cardinality. This is, however, very improbable, because smaller attribute subsets are obtained for larger families of permutations. It is an important advantage of the proposed hybrid algorithm.
References 1. Davis, L. (ed.): Handbook of Genetic Algorithms. Van Nostrand Reinhold (1991). 2. Duentsch, I., Gediga, G.: Uncertainty measures of rough set prediction. Artificial Intelligence 106 (1998) pp. 77–107. 3. Kapur, J., Kesavan, H.: Entropy Optimization Principles with Applications. Academic Press (1992). 4. Pawlak, Z.: Rough sets – Theoretical aspects of reasoning about data (1991). 5. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann (1988). 6. Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proc. of IMLC’98 (1998). 7. Rissanen J.: Minimum-description-length principle. In: S. Kotz, N.L. Johnson (eds), Encyclopedia of Statistical Sciences. Wiley (1985) pp. 523–527. ´ ezak, D.: Approximate reducts in decision tables. In: Proc. of IPMU’96 (1996). 8. Sl¸ ´ ezak, D.: Approximate decision reducts (in Polish). Ph.D. thesis, Institute of 9. Sl¸ Mathematics, Warsaw University (2001). ´ ezak, D.: Approximate Entropy Reducts. Accepted to Fundamenta Informaticae. 10. Sl¸ 11. Wr´ oblewski, J.: Theoretical Foundations of Order-Based Genetic Algorithms. Fundamenta Informaticae 28/3-4 (1996) pp. 423–430. 12. Wr´ oblewski, J.: Adaptive methods of object classification (in Polish). Ph.D. thesis, Institute of Mathematics, Warsaw University (2001).
Variable Precision Bayesian Rough Set Model ´ ezak1,2 and Wojciech Ziarko1 Dominik Sl¸ 1
Department of Computer Science University of Regina Regina, SK, S4S 0A2 Canada 2 Polish-Japanese Institute of Information Technology Koszykowa 86, 02-008 Warsaw, Poland
Abstract. We present a parametric extension of the Bayesian Rough Set (BRS) model. Its properties are investigated in relation to nonparametric BRS, classical Rough Set (RS) model and the Variable Precision Rough Set (VPRS) model.
1
Introduction
In [4] a non-parametric modification of the Variable Precision Rough Set (VPRS) model was introduced, where the prior probability of target event was used as a benchmark. We proposed to divide the universe of interest into three regions: the positive region where the probability of the target event (set) occurrence is higher than the prior probability, the negative region where the probability of the target set occurrence is lower than the prior probability, and the boundary region where it is equal to the prior probability. Such a categorization of the universe led us to generalized non-parametric definitions of set approximations in the style of the VPRS model. The resulting theory of approximately defined sets was called Bayesian Rough Set (BRS) model, a special case of VPRS model. In this paper, we introduce another modification, the Variable Precision Bayesian Rough Set (VPBRS) model, where the rough set approximation regions correspond to the following scenarios: (1) The acquired information increases enough our perception of the likelihood that the event of interest would happen; (2) The acquired information increases enough the assessment of the probability that the event would not happen; (3) The acquired information has almost no effect at all. The words ”enough” and ”almost” are defined in terms of constrains parameterized by appropriately tuned approximation threshold ε ∈ [0, 1).
2
Probabilistic Framework
Let U denote a finite universe of objects referred to as elementary events. We assume the existence of the prior probability function P (X), for any subset (event) X ⊆ U . We assume that all subsets under consideration in this article are likely
Supported by the research grant of the Research Centre of PJIIT, as well as the research grant of the Natural Sciences and Engineering Research Council of Canada.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 312–315, 2003. c Springer-Verlag Berlin Heidelberg 2003
Variable Precision Bayesian Rough Set Model
313
to occur and that their occurrence is not certain, that is that 0 < P (X) < 1. We assume the existence of an equivalence relation on U with the finite number of equivalence classes (elementary sets) E ⊆ U , such that P (E) > 0. The elementary sets are normally obtained by grouping together objects having identical values of a selected set of features (attributes). Each E is assumed to be assigned the conditional probability P (X|E). The values of conditional and prior probabilities are normally estimated from sample data by putting P (X|E) = card(X∩E)/card(E) and P (X) = card(X)/card(U ). In Bayesian reasoning [1], however, the prior probability P (X) is not necessarily derivable directly from data. In particular, it may represent background knowledge of a domain expert.
3
Rough Set Framework
The original Rough Set (RS) model provides the definitions of approximation regions of events. We express them in probabilistic terms to emphasize similarities and differences between various generalizations presented in the foregoing sections. Positive, negative, and boundary regions are defined, respectively, by: P OS(X) = {E : P (X|E) = 1} N EG(X) = {E : P (X|E) = 0} (1) BN D(X) = {E : P (X|E) ∈ (0, 1)} The above regions correspond to the areas of the universe where the occurrence of X is, respectively, certain, unlikely, and possible but not certain. There is a correspondence between the regions of X and ¬X, expressed by equalities P OS(X) = N EG(¬X), P OS(¬X) = N EG(X) and BN D(¬X) = BN D(X). If BN D(X) is empty, X is said to be definable [3].
4
VPRS: Variable Precision Rough Set Model
The VPRS model [5] of rough sets is an extension of the RS model aimed at increasing its discriminatory capabilities by using parameter-controlled grades of conditional probability associated with the elementary sets. The asymmetric VPRS generalization [2] is based on the lower and upper limit certainty threshold parameters l and u, which satisfy the constraints 0 ≤ l < P (X) < u ≤ 1. The u-positive, l-negative and (l, u)-boundary regions are given, respectively, by: P OSu (X) = {E : P (X|E) ≥ u} N EGl (X) = {E : P (X|E) ≤ l} (2) BN Dl,u (X) = {E : P (X|E) ∈ (l, u)} In the context of data mining, VPRS’s ability to flexibly control approximation regions’ definitions allows for capturing probabilistic relations existing in data. It is obvious that the original RS model is a special case of VPRS, for l = 0 and u = 1. Usually, however, more interesting results are expected for non-trivial settings, tuned to particular data and satisfying the constraints 0 ≤ l < P (X) < u ≤ 1.
314
5
´ ezak and W. Ziarko D. Sl¸
BRS: Bayesian Rough Set Model
In some applications, the objective is to achieve some improvement of certainty of prediction based on the available information rather than trying to produce rules satisfying preset certainty requirements. Therefore, it appears to be appropriate to not use any parameters to control model derivation. We present a modification of VPRS model, which allows for derivation of parameter-free predictive models from data while preserving the essential notions and methods of rough set theory. The BRS positive, negative, and boundary regions are defined, respectively, by: P OS ∗ (X) = {E : P (X|E) > P (X)} N EG∗ (X) = {E : P (X|E) < P (X)} (3) BN D∗ (X) = {E : P (X|E) = P (X)} P OS ∗ (X) and N EG∗ (X) are, respectively, the areas of the certainty improvement and loss with respect to predicting the occurrence of X. Information defining BN D∗ (X) is totally unrelated to X. Just like in the original RS model, we have a correspondence between approximation regions of X and ¬X, i.e. P OS ∗ (X) = N EG∗ (¬X), P OS ∗ (¬X) = N EG∗ (X), BN D∗ (¬X) = BN D∗ (X). Moreover, the following equalities hold: P OS ∗ (X) = {E : P (E|X) > P (E|¬X)} N EG∗ (X) = {E : P (E|X) < P (E|¬X)} (4) BN D∗ (X) = {E : P (E|X) = P (E|¬X)} They allow us to think about the BRS regions in terms of basic Bayesian reasoning tools for a hypothesis testing [1]. They indicate that elementary sets consisting the positive BRS region of X would occur more frequently as a consequence of the occurrence of X than due to the occurrence of ¬X. For example, in medical application it would mean that combination of symptoms represented by the description of the elementary set E is more likely to be caused by the presence of the disease X than by other reasons. Similar interpretation can be also provided for the BRS negative and boundary regions.
6
VPBRS: Variable Precision Bayesian Rough Set Model
In this section, we introduce a parameterized version of the BRS model, where the differences between probabilities P (X|E) and P (X) are measured with respect to the significance threshold ε ∈ [0, 1). The VPBRS positive, negative, and boundary regions are respectively defined as: P OS ε (X) = {E : P (X|E) ≥ 1 − ε(1 − P (X))} N EGε (X) = {E : P (X|E) ≤ εP (X)} (5) BN Dε (X) = {E : P (X|E) ∈ (εP (X), 1 − ε(1 − P (X)))} These are the areas where the probability of X respectively is: ε-entirely greater, ε-entirely lower, and ε-almost equal to P (X). Clearly, quantities εP (X) and
Variable Precision Bayesian Rough Set Model
315
1 − ε(1 − P (X)) correspond, respectively, to the upper and lower thresholds l and u in VPRS. We obtain both the duality conditions P OS ε (X) = N EGε (¬X), N EGε (X) = P OS ε (¬X), BN Dε (X) = BN Dε (¬X), as well as consistency with the constraints 0 ≤ l < P (X) < u ≤ 1 of the VPRS model, expressed here as 0 ≤ εP (X) < P (X) < 1 − ε(1 − P (X)) ≤ 1. The bounds of divergence between P (E|X) and P (E|¬X) can be expressed in terms of probabilities P (E) and P (X) as follows: P OS ε (X) = {E : P (E|X) − P (E|¬X) ≥ (1 − ε)P (E)/P (X)} N EGε (X) = {E : P (E|¬X) − P (E|X) ≥ (1 − ε)P (E)/(1 − P (X))}
(6)
As a result, we conclude that E ⊆ BN Dε (X) only if −(1−ε)P (E)/P (¬X) < P (E|X) − P (E|¬X) < (1 − ε)P (E)/P (X). It provides the bounds for the degree of closeness between P (E|X) and P (E|¬X). The bounds are used to declare that information associated with E is almost completely irrelevant relative to X, subject to the choice of ε ∈ [0, 1). That is, for lower values of ε, we are more likely to declare irrelevancy (independence) of E relative to X. In the special case of ε = 0, any E such that P (E|X) > 0 and P (E|¬X) > 0 is considered irrelevant as it falls into BN D0 (X) = BN D(X). On the other hand, for larger values of ε we are less likely to declare the irrelevancy of E with respect to X, with the extreme case corresponding to P (E|X) − P (E|¬X) = 0 of the BRS model, achievable in the limit with ε approaching 1.
7
Final Remarks
We introduced a parametric refinement of the Bayesian rough set model by allowing single parameter-controlled degree of ε-imprecision in the boundary area definition, based on prior probability of the target set as a reference. The use of the parameter ε makes the Bayesian rough set model more applicable to practical problems where small deviations from prior probability are likely to occur due to noise or measurement inaccuracy.
References 1. Box, G.E.P., Tiao, G.C.: Bayesian Inference in Statistical Analysis. Wiley (1992). 2. Katzberg, J. Ziarko, W.: Variable precision rough sets with asymmetric bounds. In: Proc. of the International Workshop on Rough Sets and Knowledge Discovery (RSKD’93) (1993) pp. 163–191. 3. Pawlak, Z.: Rough sets – Theoretical aspects of reasoning about data. Kluwer Academic Publishers (1991). ´ ezak, D., Ziarko, W.: Bayesian Rough Set Model. In: Proc. of the International 4. Sl¸ Workshop on Foundation of Data Mining and Discovery (FDM’2002). December 9, Maebashi, Japan (2002) pp. 131–135. 5. Ziarko, W.: Variable Precision Rough Sets Model. Journal of Computer and Systems Sciences, vol. 46. no. 1, (1993) pp. 39–59.
Linear Independence in Contingency Table Shusaku Tsumoto Department of Medical Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho Izumo City, Shimane 693-8501 Japan [email protected]
Abstract. A contingency table summarizes the conditional frequencies of two attributes and shows how these two attributes are dependent on each other. Thus, this table is a fundamental tool for pattern discovery with conditional probabilities, such as rule discovery. In this paper, a contingency table is interpreted from the viewpoint of granular computing. The first important observation is that contingency tables compare two attributes with respect to granularity, which means that a n×n table compares two attributes with the same granularity, while a m×n(m ≥ n) table can be viewed as the projection from m-partitions to n partition. The second important observation is that matrix algebra is a key point of analysis of this table. Especially, the degree of independence, rank plays a very important role in extracting a probabilistic model from a given contingency table.
1
Introduction
Independence (dependence) is a very important concept in data mining, especially for feature selection. In rough sets[1], if two attribute-value pairs, say [c = 0] and [d = 0] are independent, their supporting sets, denoted by C and D do not have a overlapping region (C ∩ D = φ), which suggest that one attribute independent to a given target concept may not appear in the classification rule for the concept. Although independence is a very important concept, it has not been fully and formally investigated as the relation between two attributes. In this paper, a contingency table of categorical attributes is focused on and characterizeed from the viewpoint of granular computing. The first important observation is that contingency tables compare two attributes with respect to granularity or partition. Since the number of values of a given categorical attribute is equivalent to the number of a given partition of the attribute, a given contingency table compares the characteristics of information granules: n × n table compares two attributes with the same granularity, while a m × n(m ≥ n) table can be viewed as the projection from m-partitions to n partition. The second important observation is that matrix algebra is a key point of analysis of this table. If a contingency table can be viewed as a matrix, several operations and ideas can be introduced into the analysis of contingency table. In those concepts, the degree of independence, rank plays a very important role in extracting a probabilistic model from a given contingency table. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 316–319, 2003. c Springer-Verlag Berlin Heidelberg 2003
Linear Independence in Contingency Table
2 2.1
317
Contingency Table from Rough Sets Preliminaries
In the subsequent sections, the notations introduced in [3] is adopted. Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, i.e., a : U → Va for a ∈ A, where Va is called the domain of a, respectively. Then, a decision table is defined as an information system, A = (U, A ∪ {d}). The atomic formulas over B ⊆ A ∪ {d} and V are expressions of the form [a = v], called descriptors over B, where a ∈ B and v ∈ Va . The set F (B, V ) of formulas over B is the least set containing all atomic formulas over B and closed with respect to disjunction, conjunction and negation. For each f ∈ F (B, V ), fA denote the meaning of f in A, i.e., the set of all objects in U with property f , defined inductively as follows. 1. If f is of the form [a = v] then, fA = {s ∈ U |a(s) = v} 2. (f ∧ g)A = fA ∩ gA ; (f ∨ g)A = fA ∨ gA ; (¬f )A = U − fa 2.2
Contingency Table
From the viewpoint of information systems, contingency tables summarizes the relation between attributes with respect to frequencies. These viewpoints have already been discussed in [6]. However, this study focuses on more statistical interpretation of this table. Definition 1. Let R1 and R2 denote multinominal attributes which have m and n attributes in an attribute space A. A contingency tables is a table of a set of the meaning of the following formulas: |[R1 = Aj ]A |, |[R2 = Bi ]A |, |[R1 = Aj ∧ R2 = Bi ]A |, |[R1 = A1 ∧ R1 = A2 ∧ · · · ∧ R1 = Am ]A |, |[R2 = B1 ∧R2 = A2 ∧· · ·∧R2 = An ]A | and |U | (i = 1, 2, 3, · · · , n and j = 1, 2, 3, · · · , m). This m table is arranged into the form n shown in Table 1, where: |[R1 = Aj ]A | = x = x , |[R = B ] | = 1i ·j 2 i A i=1 j=1 xji = xi· , |[R1 = Aj ∧ R2 = Bi ]A | = xij , |U | = N = x·· (i = 1, 2, 3, · · · , n and j = 1, 2, 3, · · · , m). Table 1. Contingency Table (n × m)
B1 B2 ··· Bm Sum
A1 x11 x21 ··· xm1 x·1
A2 x12 x22 ··· xm2 x·2
··· ··· ··· ··· ··· ···
An Sum x1n x1· x2n x2· ··· ··· xmn xm· x·n x·· = |U | = N
318
S. Tsumoto
One of the important observations from granular computing is that a contingency table shows the “counting” relations between partitions of two attributes. When two attributes have different number of partitions, the situation may be a little complicated. But, in this case, due to knowledge about linear algebra, we only have to consider the smaller number of partitions and the surplus number of partitions can be projected into other partitions. In other words, a n × m matrix or contingency table includes a projection from one attributes to the other one.
3
Rank of Contingency Table
3.1 Preliminaries Definition 2. A corresponding matrix CTa,b is defined as a matrix the element of which are equal to the value of the corresponding contingency table Ta,b of two attributes a and b, except for marginal values. Definition 3. The rank of a table is defined as the rank of its corresponding matrix. The maximum value of the rank is equal to the size of (square) matrix, denoted by d. According to linear algebra, when we have a m × n(m ≥ n) or n × m corresponding matrix, the rank of the corresponding matrix is less than n. Thus: Theorem 1. If the corresponding matrix of a given contingency table is not square and of the form m × n(m ≥ n) , then its rank is less than n. Especially, the row rn+1 , rn+2 , · · · , rm can be represented by: rk =
r
ki ri (n + 1 ≤ k ≤ m),
i=1
where ki and r denotes the constant and the rank of the corresponding matrix, respectively. This can be interpreted as: p([R1 = Ak ]) =
r
ki p([R1 = Ai ])
i=1
Finally, the relation between rank and independence in a multi-way contingency table can be obtained. Theorem 2. Let the corresponding matrix of a given contingency table be a square n × n matrix. If the rank of the corresponding matrix is 1, then two attributes in a given contingency table are statistically independent. If the rank of the corresponding matrix is n , then two attributes in a given contingency table are dependent. Otherwise, two attributes are contexually dependent, which means that several conditional probabilities can be represented by linear combination of conditional probabilities. Thus, dependent n rank = 2, · · · , n − 1 contextual independent 1 statistical independent
Linear Independence in Contingency Table
4
319
Conclusion
In this paper, a contingency table is interpreted from the viewpoint of granular computing and statistical independence. From the correspondence between contingency table and matrix, the following observations are obtained: in the case of statistical independence, the rank of the corresponding matrix of a given contingency table is equal to 1. That is, all the rows of contingency table can be described by one row with the coefficient given by a marginal distribution. If the rank is maximum, then two attributes are dependent. Otherwise, some probabilistic structure can be found within attribute -value pairs in a given attribute. Thus, matrix algebra is a key point of the analysis of a contingency table and the degree of independence, rank plays a very important role in extracting a probabilistic model. This paper is a preliminary study on the formal studies on contingency tables, and the discussions are very intuitive, not mathematically rigor. More formal analysis will appear in the future work. Acknowledgement. This work was supported by the Grant-in-Aid for Scientific Research (13131208) on Priority Areas (No.759) “Implementation of Active Mining in the Era of Information Flood” by the Ministry of Education, Science, Culture, Sports, Science and Technology of Japan.
References 1. Pawlak, Z., Rough Sets. Kluwer Academic Publishers, Dordrecht, 1991. 2. Rao, C.R. Linear Statistical Inference and Its Applications, 2nd Edition, John Wiley & Sons, New York, 1973. 3. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory. In: Yager, R., Fedrizzi, M. and Kacprzyk, J.(eds.) Advances in the Dempster-Shafer Theory of Evidence, pp. 193–236, John Wiley & Sons, New York, 1994. 4. Tsumoto S and Tanaka H: Automated Discovery of Medical Expert System Rules from Clinical Databases based on Rough Sets. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 96, AAAI Press, Palo Alto CA, pp. 63–69, 1996. 5. Tsumoto, S. Knowledge discovery in clinical databases and evaluation of discovered knowledge in outpatient clinic. Information Sciences, 124, 125–137, 2000. 6. Yao, Y.Y. and Zhong, N., An analysis of quantitative measures associated with rules, N. Zhong and L. Zhou (Eds.), Methodologies for Knowledge Discovery and Data Mining, Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining, LNAI 1574, Springer, Berlin, pp. 479–488, 1999.
The Information Entropy of Rough Relational Databases Yuefei Sui1 , Youming Xia2 , and Ju Wang3 1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China [email protected]. 2 Department of Computer Science, Yunnan Normal University, Kunming, China [email protected]. 3 Institute of Software, Chinese Academy of Sciences, Beijing, China [email protected].
Abstract. Beaubouef, Petry and Buckles proposed the generalized rough set database analysis (GRSDA) to discuss rough relational databases. Given any rough relational database (U, A) and an attribute a ∈ A, as in rough set theory, a definition of the lower and upper approximations based on φa is given. The entropy and conditional entropy of similarity relations in a rough relational database are defined. The examples show that the entropy of a similarity relation does not decrease as the similarity relation is refined. It will be proved that given any two similarity relations φ and ψ, defined by a set C of conditional attributes and a decision attribute d, respectively, if d similarly depends on C in a rough relational database then the conditional entropy of φ with respect to ψ is equal to the entropy of φ. Keywords: Rough set; Relational database; Entropy; Dependence.
1
Introduction
Rough set theory, proposed by Z. Pawlak [4], is used successfully in the rough set database analysis of relations, and captures the notion of indiscernibility or ambiguity instead of a fuzzy type of impression. The rough set database analysis (RSDA) is by the rough set theory to deal with the properties, such as functional dependencies and reducts of attributes, in relational databases or decision tables. In the rough relational database model, for every attribute a of a relational database, there is an equivalence relation θa on the domain Da of attribute a. A tuple can have multivalues on some attributes, which is forbidden in the ordinary relational database. By RSDA, we take the multivalues as new values of
The project was partially supported by the National NSF of China and the National 973 Project of China under the grant number G1999032701. The first author was partially supported by the National Laboratory of Software Development Environment. The second auther was partially supported by the Yunnan Provincial NSF grant 2000F0049M and 2001F0006Z.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 320–324, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Information Entropy of Rough Relational Databases
321
Da , equivalently, every attribute a has a domain P(Da ), the power set of Da , instead of Da . Beaubouef, et al. [1] considered the multivalued information systems (rough relational databases), and generalized the ordinary rough set database analysis (RSDA) to the generalized rough set database analysis (GRSDA). The latter does not take the multivalues as the new entities in the domain of attributes, like RSDA does. Beaubouef, et al. [1] defined the corresponding rough relational operators in rough relational databases as in the ordinary relational databases, and Beaubouef, et al. [2] defined the rough entropy E(U ) of a rough relational database (U, A) and gave one example to show that E(U ) decreases as the equivalence relations θa are refined. In this paper, based on the GRSDA, we shall give the definition of the upper and lower approximations in rough relational databases; define the entropy and conditional entropy of similarity relations in a rough relational database; give the examples to show that the information entropy of a similarity relation is different from that of an equivalence relation in that the entropy of a similarity relation does not decrease as the similarity relation is refined; and then prove that given any two similarity relations φ and ψ, defined by a set C of attributes and a decision attribute d, respectively, if d similarly depends on C then the conditional entropy of φ with respect to ψ is equal to the entropy of φ in the rough relational database. The results show that there are different kinds of rules in the rough relational databases by using the similarity relations, and the information entropy of the similarity relations has the different properties which that of the equivalence relations has.
2
Approximations and Dependencies in a Rough Relation
A rough relation is such a relation that associated with each attribute a ∈ A is an equivalence relation θa on the domain Da of attribute a. We denote the a corresponding partition of θa on Da by Pa : Z1a , ..., Zm . a Definition 2.1. A rough relation (U, A) is a subset of a∈A P(Da ) such that for every r ∈ U, every a ∈ A and every 1 ≤ i ≤ ma , | r(a) ∩ Zia |≤ 1. Given a rough relation (U, A), for every a ∈ A, θa induces a similarity relation φa on U : for any r, s ∈ U, rφa s if [r(a)]θa ∩ [s(a)]θa = ∅, where [r(a)]θa = {[b]θa : b ∈ r(a)}. φa induces a pseudo partition Pφa on U : Y1a , ..., Yna , such that Yia is a na a φa -similarity class, and P isa pseudo partition on U if 1) i=1 Yia = U ; and a a 2) for every 1 ≤ i ≤ na , Yi ⊆ i =i Yi . To define the definable subsets as closely as possible to that based on equivalence relations, we say that a subset X of U is a-definable if X is the union of φa -similarity classes. a Given a set X ⊆ U, the a-upper approximation of X is X = {Y : Y ∩ X = ∅, Y a-definable}; and the a-lower approximation of X is X a = {Y : Y ⊆ X, Y a-definable}. The a-lower and the a-upper approximation have almost the same properties as the ones when φa is an equivalence relation, except that the a-upper or a-lower approximation may not satisfy the following properties:
322
Y. Sui, Y. Xia, and J. Wang a
(2.1) X a ⊆ X a a ; (2.2) X ⊆ X (2.3) X
aa
a
a
a;
a
⊆ X ; (2.4) X a ⊆ X a .
Given a rough relation (U, A), let C be the set of the conditional attributes and d be the decision attribute. Without loss of generality, we assume that A = C ∪ {d}. We have similarity relations φC and φd , where φC is defined by: for any r, s ∈ U, rφC s iff for every a ∈ C, rφa s. Definition 2.2. d similarly depends on C if φC is a refinement of φd , i.e., for any r, s ∈ U, rφC s implies rφd s. Assume that d similarly depends on C in (U, A). Then, associated with every r ∈ U is a rule: ∀x [x(a)]θa ∩ [r(a)]θa = ∅ ⇒ [x(d)]θd ∩ [r(d)]θd =∅ . a∈C
B ⊆ A is a similarity reduct of A if φB is a refinement of φA , and for any B ⊂ B, φB is not a refinement of φA . If d similarly depends on C in (U, A) and B is a similarity reduct of Cthen the rule associated with every r ∈ U has the following simplified form: ∀x a∈B [x(a)]θa ∩ [r(a)]θa = ∅ ⇒ [x(d)]θd ∩ [r(d)]θd =∅ . It is not true that φC is a refinement of φd if and only if every similarity class of φC is included in one of the similarity classes of φd . Definition 2.3. d implicatively depends on C if every similarity class of φC is included in one of the similarity classes of φd . The similar dependence implies the implicative dependence. Assume that d implicatively depends on C. Let X, Y be the similarity classes of φC and φd , respectively, such that X ⊆ Y. Then associated with such a pair (X, Y ) is a rule of the form: ∀x [x(a)]θa ∩ [r(a)]θa =∅⇒ [x(d)]θd ∩ [r(d)]θd =∅ . a∈C r∈X
3
r∈Y
The Entropy of Similarity Relations
Given a rough relation (U, A), we define the entropy of a similarity relation θ on U. Let Y1 , ..., Yk be the pseudo partition on U induced by θ, and | U |= m. The entropy H(θ) of θ is defined as 1 H(θ) = |Yi | log2 |Yi |. m i=1 k
For equivalence relations φ, H(φ) decreases when φ is refined. But for the similarity relations, it is not true. When θ0 is an equivalence relation such that |Xi | = 1 for every i, H(θ0 ) = 0, it is the least value of H(θ). When θ1 is an
The Information Entropy of Rough Relational Databases
323
equivalence relation such that |X1 | = m, and |Xi | = 0 for every i ≥ 2, H(θ1 ) = log m, which is the largest value of H(θ) if θ is an equivalence relation. Example 3.1. Let U = {1, ..., m}, for every 1 ≤ i, j ≤ m − 1, iθj; for every 2 ≤ i, j ≤ m, iθj. Then the pseudo partition induced by θ is X1 = U −{m}, X2 = 2 U − {1}. Then H(θ) = m (m − 1) log(m − 1). When m > 3, H(θ) > log m. Hence, log m is not the greatest value of H(θ) if θ is a similarity relation. Corollary 3.2. It is not necessarily true that given two similarity relations φ and ψ, if φ is a refinement of ψ then H(φ) ≤ H(ψ). Given two similarity relations φ and ψ on (U, A), we can define the conditional entropy of φ with respect to ψ as follows: Assume that the pseudo partition induced by φ is X1 , ..., Xs , and the one induced by ψ is Y1 , ..., Yt . Let {Z1 , ..., Zu } ⊆ {Xi ∩ Yj : 1 ≤ i ≤ s, 1 ≤ j ≤ t} such that {Z1 , ..., Zu } is a pseudo partition on U. The corresponding similarity relation to pseudo partition {Z1 , ..., Zu } is the similarity relation φψ : for any x, y ∈ U, xφψy iff xφy and xψy. Then the entropy H(φ|ψ) is defined by H(φ|ψ) = H(φ, ψ) − H(ψ),
u
1 where H(φ, ψ) = m k=1 |Zk | log2 |Zk |. If d similarly depends on C in (U, A), let φ = θC , ψ = θd , then for every 1 ≤ i ≤ s, there is a 1 ≤ j ≤ t such that Xi ⊆ Yj , namely, t s 1 H(φ|ψ) ≤ |Xi ∩ Yj | log2 |Xi ∩ Yj | − H(ψ) ≤ (s − 1)H(ψ). m j=1 i=1
Hence, we have the following Theorem 3.3. If d similarly depends on C in (U, A) then H(φ|ψ) ≤ (s−1)H(ψ). Notice that if φ and ψ are two equivalence relations and ψ depends on φ functionally then H(φ|ψ) = 0. When φ and ψ are similarity relations we cannot use the conditional entropy H(φ|ψ) to measure the similar or implicative dependency of ψ on φ.
References 1. Beaubouef, T., Petry, F. E. and Buckles, B. P., Extension of the relational database and its algebra with rough set techniques, Computational Intelligence 11(1995), 233–245. 2. Beaubouef, T., Petry, F. E. and Arora, G., Information-theoretic measures of uncertainty for rough sets and rough relational databases, J. Information Sciences 109(1998), 185–195. 3. D¨ untsch, I. and Gediga, G., Uncertainty measures of rough set prediction, Artificial Intelligence 106(1998), 109–137. 4. Pawlak, Z., Rough sets - theoretical aspects of reasoning about data, Kluwer Academic Publishers, 1991.
324
Y. Sui, Y. Xia, and J. Wang
5. Polkowski, L., Skowron, A. and Zytkow, J., Tolerance based rough sets, in: T. Y. Lin and A. M. Wildberger (eds.), Soft Computing: Rough Sets, Fuzzy Logic, Neural Networks, Uncertainty Management, Simulation Councils, Inc., San Diego, 1995, 55–58. 6. Skowron, A. and Stepaniuk, J., Tolerance approximation spaces, Fundamenta Informaticae 27(1996), 245–253. 7. Yao, Y. Y., Wong, S. K. M. and Lin, T. Y., A review of rough set models, in: T. Y. Lin, N. Cercone (eds.), Roght sets and data mining: analysis for imprecise data, Kluwer Academic Pub., Boston, London, Dordrecht, 1997, 47–75.
A T-S Type of Rough Fuzzy Control System and Its Implementation 1
1
Jinjie Huang , Shiyong Li , and Chuntao Man 1
2
Harbin Institute of Technology, Department of Control Science and Engineering, Harbin, China, Postcode 150001 [email protected] 2 Harbin University of Science and technology, Department of automation, Harbin, China, Postcode 150080
Abstract. A new type of rough fuzzy controller and its design method are presented and show how the rough logic is combined with fuzzy inference. In this approach, rough set theory is used to derive the minimal set of rules from inputoutput data, and by complementing the information of output control corresponding to the rough reduced rules, a T-S type of rough fuzzy control system is constructed, which can solve the problem that the number of rules in a fuzzy controller increases exponentially with the number of variables involved. Key words: Rough set; rough fuzzy controller; T-S model
1 Introduction The design and implementation of industrial control systems often relies on quantitative mathematical models. At times, however, we encounter problems for which such models do not exist or are difficult and expensive to obtain. In such cases it is often necessary to observe human experts or experienced operators of the plants or processes and discover rules governing their actions for automatic control. Rough set theory provides a methodology for generating rules from empirical data and has been applied to industrial control successfully [2–9,11]. In the simplest case, it is used to record control actions taken by a human operator in form of decision tables and to derive rules through rough set theory. Such a processor of decision table with rule selection mechanisms is referred to as a “rough controller”[2]. Many examples [3–5] have been reported, but the result of rough control is coarse. One improved approach is to construct the so-called rough fuzzy controller (RFC), which combines the rough logic with fuzzy inference. As it is known that the rough set approach can find the minimal sets of rules mapping from input variables to output variables, whereas the Takagi and Sugeno’s fuzzy model can exactly approximate a nonlinear system with a combination of several linear systems, therefore, if the minimal set of rules acquired by rough set approach can be used to carry out the T-S type of fuzzy inference, then not only the amount of inference rules, but also the number of linguistic variables involved can be reduced greatly. Thus, the advantages of both rough control and T-S model can be combined and a new type of T-S model of rough fuzzy controller is suggested. However, in such a controller, the predecessor of the G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 325–328, 2003. © Springer-Verlag Berlin Heidelberg 2003
326
J. Huang, S. Li, and C. Man
rule may not offer all input variables that contain the output control information, so the output control function, which corresponds to the rule successor must be obtained by information complementing. The T-S type of rough fuzzy controller is also studied by the example of inverted pendulum system and a highly precise control can be achieved. For limit of pages, the simulation results will be presented in another paper.
2 Design of a T-S Type of Rough Fuzzy Control System 2.1 The Minimal Set of Decision Rules Since the rough set theory has a powerful ability to data analyzing, a relatively simple method – grid partition in dimensions, may be used in rough fuzzy inference systems. At the same time, an adaptive grid partition method can be constructed by introducing the fuzzy C-means clustering algorithms. If the intervals of each dimension are symbolized by 1, 2, 3 … respectively, the input-output observing data can be converted into a symbolic decision table. But such an original decision table may contain much redundant and/or conflicting information. The main task of rough set methodology is to simplify a decision table with the degree of dependency between condition and decision attributes unchanged. This includes finding reducts of condition attributes and value-reducts of condition attributes so as to make the number of attributes and the number of values of attributes involved in the decision table as small as possible, hence the problem that the number of rules increases exponentially with the number of variables involved can be solved, and the minimal set of decision rules of high adaptability can be obtained. So far, many efficient heuristic algorithms for reduction of a decision table have been developed. The algorithm for reduction of attributes based on the discernibility matrix and Boolean calculation and an inductive algorithm for value-reduction of attributes [8] are adopted in this paper. 2.2 The T-S Type of Rough Fuzzy Inference with Complementing Information The minimal set of decision rules derived from the observing data with rough set methodology presents some “rough” mapping relations between the input and output space, but the values of input and output variables in the above rules are still symbolic constants associated with intervals and cannot offer the fine numeric relation between the input and output space. Therefore, in order to obtain fine control with the minimal set of rules, it is necessary to do the following two aspects of work further. 2.2.1 Fuzzification of the Symbolic Constants of the Input and Output Variables This is a process that quantifies the symbolic constants associated with intervals of every dimension by proper fuzzy membership functions over the universe of discourse. For the fuzzy c-means clustering algorithm has been used in the partition of input and output spaces and the membership values of an observing data point relative to every clusters have been obtained, so we may just select a proper type of membership function and fit all these membership values. In this design, the psigmf function
A T-S Type of Rough Fuzzy Control System and Its Implementation
327
is chosen and the Gauss-Newton method with Levenberg-Marquardt type adjustment is utilized to estimate the parameters of membership function for each fuzzy cluster. 2.2.2 Information Complementing of T-S Models The number of input variables, k, contained in the predecessor of the decision rule after reduction by rough set methodology may be less than the total number of input variables m of the system. That is to say, for a MISO system, there exists a reduced decision rule such that
R i : if x j1 is A ij1 and x j 2 is Aij 2 L and x jk is A ijk
then y i is B i
(1)
where k<m. According to the definition of T-S fuzzy model [7], the output function corresponding to this rule may be represented as
y i = a0i + a1i x j1 + L + a ki x jk
(2)
Now the question is whether Eq. (2) can fully and validly express the output control i i information corresponding to the rule R . In fact, the only basis of deriving rule R by rough set methodology is to make its predecessor capable of distinguishing its successor from successors of other rules. The predecessor of the rule may not offer all input variables that contain the output control information. In other words, some variables related to the output control information in input space may not appear in the predecessors of the rules; they have been eliminated in the reduction process. Therefore, in order to establish the complete numeric relations between successors and predecessors of the rough rules, the variables omitted in predecessors have to be complemented. i Assuming a MISO system, for a rough reduced rule R like Eq. (1),
y ci = a ki +1 x jk +1 + a ki + 2 x jk + 2 + L + a mi x jm
(3)
i
is called the complementing output information of R ; and
y ri = a0i + a1i x j1 + L + aki x jk
(4)
i
is called the rough output information of R . Then, the complete output control infori mation of a rough reduced rule R can be expressed as
y i = y ri + y ci = a 0i + a1i x j1 + L + a ki x jk + a ki +1 x jk +1 + L + a mi x jm
(5)
In order to weaken the adverse effect of noise and error data, a robust regression technique[10], which uses an iteratively re-weighted least squares algorithm with the weights at each iteration calculated by applying the bisquare function to the residuals from the previous iteration, is employed to estimate the coefficients of the output Eq. (5). Thus, the T-S type of rough fuzzy controller is obtained and can be used to carry out fuzzy inference for new input data.
328
J. Huang, S. Li, and C. Man
3 Conclusion The characteristics of the T-S type of rough fuzzy control system lie in: (1) The number of rules and the time-consuming computation of predecessormatching of rules are greatly reduced and hence the “dimensional disaster” problem is effectively avoided by using rough set theory to find out the minimal set of rules directly from the observing data of input-output space for complex processes which are too difficult to be mathematically modeled easily. (2) By complementing the output control information corresponding to the reduced rules, the T-S type of rough fuzzy control system is constructed that uses several local linear controllers to approximate the nonlinear control laws so as to achieve the highly precise control. Acknowledgements. The authors would like to thank the anonymous reviewers for their valuable comments. It is a regret that only due to the limit of pages, some of the suggestions were not fulfilled in this paper.
References [1] Pawlak Z. Rough set: Theoretical Aspects of Reasoning about Data, Boston, London, Dordrecht: Kluwer Publishers, 1991 [2] Mrozek A., Plonka L., Kedziera J., The methodology of rough controller synthesis, In: Proceedings of the Fifth IEEE International Conference on Fuzzy Systems, 1996., 8–11 Sept. New Orleans, LA, USA 1135–1139 [3] Mrozek A., Rough sets in computer implementation of rule-based control of industrial process, In: R. Slowinski (Ed.), Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory, Kluwer Academic Publishers, Boston, 1992, 19– 31 [4] Lamber-Torres G., Application of rough sets in power system control center data mining, in: Power Engineering Society Winter Meeting, 2002. IEEE, 27-31 January 2002, Volume: 1, 627–631 [5] Planka L. and Mrozek A., Rule-based stabilization of the inverted pendulum, Computational Intelligence, 1995, 11(2), 348–356 [6] Czogala E. et al, Idea of a rough fuzzy controller and its application to the stabilization of a pendulum-car system, Fuzzy Sets and Systems, 1995,72(1): 61–73 [7] Takagi T. and Sugeno M., Fuzzy identification of systems and its applications to modeling and control, IEEE Trans. Syst., Man, Cybern., vol. SMC-15, pp. 116–132, Jan./Feb. 1985. [8] Wang Guo-Yin, Rough Set Theory and Knowledge Discovery, XiAn Jiaotong University Press, 2001.5, 117–152 [9] Donald W. Marquardt, An algorithm for least squares estimation of nonlinear parameters, Journal of the Society of Industrial and Applied Mathematics, 1963, 11:431–441 [10] Street, J.O., R.J. Carroll, and D. Ruppert, A note on computing robust regression estimates via iteratively reweighted least squares, The American Statistician, 1988,Vol.42, 152–154 [11] Peters J. .F., A. Skowron, Z. Suraj, An application of rough set methods to automatic concurrent control design. Fundamenta Informaticae, vol.43, Nos. 1–4, 2000, 269–290.
Rough Mereology in Knowledge Representation Cungen Cao1 , Yuefei Sui1 , and Zaiyue Zhang2 1
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {Cgcao, Yfsui}@ict.ac.cn. 2 Department of Computer Science, Technology College, Yangzhou University, Yangzhou, Jiangsu, China [email protected].
Abstract. The rough mereology proposed by Polkowski and Skowron is used in complex systems, multi-agent systems, knowledge engineering and knowledge representation. The function µ makes the rough mereology more like the fuzzy mereology. A new rough mereology is proposed, in which the rough inclusion is defined completely based on the upper and lower approximations of rough sets. The basic properties of the rough mereology, and applications in the knowledge representation and knowledge engineering are discussed. Keywords: Rough Set Theory, Mereology, Parts, Wholes.
1
Introduction
The rough set theory was proposed by Pawlak [3] in the earlier eighties of the last century. The rough set theory is assumed to be complemental to the fuzzy systems in the following sense: the former is used to describe the indiscernibility of certain objects, as one kind of the uncertainty in artificial intelligence, and the latter is used to describe the imprecision of the objects, as another kind of the uncertainty in artificial intelligence. Therefore, the structures based on the rough set theory should be different from those based on the fuzzy theory. Mereology is a theory of parts in the formal ontology, discussed in philosophy, and used in knowledge representation, knowledge reasoning and spatial reasoning [1]. The classical mereology [6] assumes that given an object x, its parts form a Boolean algebra under the partial order of the part-of relation, with the sum and minus. The rough mereology was proposed by Polkowski and Skowron [7],[8], in which the part-of relation ≤ is replaced by the rough part-of relation , and the rough part-of relation is something looking like a fuzzy relation. They used the rough mereology in complex systems and multi-agent systems to decompose the tasks [7], [8].
The project was partially supported by the National NSF of China and the National 973 Project of China under the grant number G1999032701. The second author was partially supported by the National Laboratory of Software Development Environment.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 329–333, 2003. c Springer-Verlag Berlin Heidelberg 2003
330
C. Cao, Y. Sui, and Z. Zhang
The main problem of combining the rough set theory with mereology is how to define the basic relations, such as the rough inclusion, the rough sum and rough minus. The main ingredient in the rough mereology proposed by Polkowski and Skowron is the function µ satisfying the certain conditions which make µ like the degree function of membership in the fuzzy set theory. To make the rough mereology different from the fuzzy mereology, we define the rough inclusion based on the upper and lower approximations, and such a definition has several choices. We apply the rough mereology to discuss the backbone of the concepts in the knowledge engineering, and discuss the further theoretical properties of the rough mereology in the knowledge representation.
2
The Rough Inclusion
Classical or General Extensional Mereology introduces a part-of binary relation ≤ as an ordering relation on certain domain satisfying the following principles: • Extensionality: Two individuals are identical if and only if they have the same parts; • Principle of Sum: There always exists the individual composed by any two individuals of the theory, i.e., the mereological sum; and • Supplementation: Given two individuals x and y, if x ≤ y then a different individual z exists which is the missing part from y. Fix a universe U. The part-of relation ≤ in mereology is a partial order, i.e., ≤ is reflexive, anti-symmetric and transitive. The main ingredient in the rough mereology proposed by Polkowski and Skowron is the function µ satisfying the following conditions: for any individuals x, y, z ∈ U, (2.1) µ(x, y) ∈ [0, 1]; (2.2) µ(x, x) = 1; (2.3) µ(x, y) = 1 → [µ(z, x) ≤ µ(z, y)], and (2.4) ∃x ∀x(µ(x, x ) = 1). The functions µ satisfying (2.1)-(2.4) are called rough inclusions. µ(x, y) can be |x ∩ y| defined by µ(x, y) = in case x = ∅. |x| Let us consider how to define a fuzzy mereology. First of all, we should define a fuzzy inclusion, and a function which is the degree of the inclusions and should satisfy the same conditions as (2.1)-(2.4). Therefore, the rough mereology proposed by Polkowski and Skowron should be called the fuzzy mereology, because it pays attention only to the fuzzyness of the inclusions, without paying attention to the indiscernibility of individuals. Assume that there is an equivalence relation φ on U. There are many choices of defining the rough inclusion. Let U, L and E be the operators on 2U defined as follows: for any X ⊆ U, U(X) = X;
L(X) = X;
E(X) = X,
where X = {x ∈ U : [x]φ ∩ X = ∅}, X = {x ∈ U : [x]φ ⊆ X}, and [x]φ is the φ-equivalence class containing x. Let Q, Q ∈ {U, L, E}. We define the rough inclusion ⊆QQ with respect to Q, Q : for for any X, Y ⊆ U,
Rough Mereology in Knowledge Representation
331
X ⊆QQ Y ⇔ Q(X) ⊆ Q (Y ). Hence, there are nine rough inclusions, and only one of them is not transitive. Proposition 2.1. ⊆LU is the only one which is not transitive. X is roughly included in Y (denoted by X φ Y or simply by X Y ) if X ⊆ Y and X ⊆ Y . X is φ-equivalent to Y, denoted by X ≡φ Y or simply by X ≡ Y , if X = Y and X = Y . Proposition 2.2. is a partial order on the power set of U. We can define functions ρ, ν to measure the degree of rough inclusions: given any two subsets X and Y such that X is roughly included in Y, |B(X) ∩ B(Y )| , |B(X) ∪ B(Y )| 1 |X ∩ Y | |X ∩ Y | ν(X, Y ) = + , 2 |X ∪ Y | |X ∪ Y | ρ(X, Y ) =
in case |B(X) ∪ B(Y )| = 0, or |X ∩ Y | = 0, where B(X) = X − X. ρ and ν describe the degree of the inclusion between two subsets. Proposition 2.3. ρ and ν satisfy (2.1)-(2.4). Theorem 2.4. Given two equivalence relations φ, ψ on U, if φ is a refinement of ψ then for any X, Y ⊆ U, (2.5) X φ Y → X ψ Y and (2.6) X ≡φ Y → X ≡ψ Y .
3
The Rough Mereology Based on the Rough Inclusion
We use the rough inclusions as the rough part-of relation in rough mereology, instead of using the fuzzy function, to denote the degree to which one part is a part of a whole. Assume that U is a universe of objects (individuals) such that every individual in U is a subset of U . Let φ be an equivalence relation on U. The choice of φ depends on applications. Given two individuals x and y, we say that x is a rough part of y (denoted by x y) if x yφ . With the relation , we define the identity relation ≡ . Definition 3.1. Given two individuals x, y ∈ U, x is φ-identical to y, denoted by x ≡φ y (simply by x ≡ y), if x y and y x. Theorem 3.2. For any individuals x, y ∈ U, x ≡ y iff every rough part of x is a rough part of y, and visa verse. I.e., x ≡ y iff for any z, z x ⇔ z y. We define the sum and meet of two individuals x and y as follows: x y = x ∪ y,
x y = x ∩ y.
It is easy to verify that xy and xy are the least upper bound and the greatest lower bound of x and y under , respectively.
332
C. Cao, Y. Sui, and Z. Zhang
The rough mereology can be used in knowledge representation. Consider the rough IS-A relation. First of all, the indiscernibility occurs in our everyday life. In philosophy, there is a debate about whether given any two objects there is a property distinguishing them. In common sense, everyone distinguishes only small portion of things and objects. The rough backbone: Let H be a taxonomy of concepts which are denoted by C, D... Every concept C ∈ H has the extent, a set of instances of C, denoted by E(C) ⊆ U, and the intent, a set of attributes or properties, denoted by I(C), such that p(x) is true for any instance x ∈ E(C) and any property p ∈ I(C). Given two concepts C and D, C subsumes D (denoted by D C) if E(D) ⊆ E(C). Given two concepts C and D, C roughly subsumes D (denoted by D C) if E(D) E(C). We use the Hφ to denote the rough backbone resulted by the above definition, where ≡φ is taken as the equality. Proposition 3.3. For any C, D, D ∈ H, (3.1) C C; (3.2) C D&D C → C ≡φ D; (3.3) D D&D C → D C; and (3.4) D C → D C; A simple application: Let H be a backbone which contains almost every concepts. If only concepts in biology are needed then we can set φ to be an identity relation on the biological individuals and a total relation on the other individuals, and Hφ is a taxonomy of the biological concepts, in which other concepts are contracted into one node on the taxonomy tree. Here, φ on U could be the indiscernibility of attribute values of the individuals. We assume that there is a set A of attributes and every attribute a has a domain Da such that every individual x has a value x(a) ∈ Da . For every attribute a, assume that there is an equivalence relation φa on Da . We define φ as follows: for any x, y ∈ U, xφy ⇔ ∀a ∈ A(x(a)φa y(a)). φ is an equivalence relation.
References 1. A. Artale, E. Franconi, N. Guarino and L. Pazzi, Part-whole relations in objectcentered systems: an overview, Data & Knowledge Engineering(1996). 2. S. Lesniewski, Foundations of the general theory of sets, in: Surma, Srzednicki, Barnett, Rickey (eds.), Stanislaw Lesniewski, Collected Works, Kluwer, Dordrecht, 1992. 3. Pawlak, Z., Rough sets – theoretical aspects of reasoning about data, Kluwer Academic Publishers, 1991. 4. L.Polkowski and A.Skowron, Rough mereology, in: Lecture Notes on Artificial Intelligence, vol. 869, Springer-Verlag, Berlin, 1994, 85–94. 5. L. Polkowski and A. Skowron, Rough mereology: a new paradigm for approximate reasoning, International J. Approximation Reasoning15(1996), 333–365.
Rough Mereology in Knowledge Representation
333
6. P. Simons, Parts: A Study in Ontology, Clarendon Press, Oxford, 1987. 7. A.Skowron and L.Polkowski, Rough mereological foundations for design, analysis, synthesis, and control in distributed systems, Information Sciences 104(1998), 129– 156. 8. B.Smith, Boundaries: an essay in mereotopology, in L.Hahn(edt.): The Philosophy of Roderick Chisholm (Library of Living Philosophers), LaSalle: Open Court, 1997, 534–561.
Rough Set Methods for Constructing Support Vector Machines Yuancheng Li1 and Tingjian Fang2 1
Department of Automation, University of Science and Technology of China. JinZhai Main Road, 230026, HeFei, P.R.China [email protected] 2 Institute of Intelligent Machines, Academia Sinica. 230031, HeFei, P.R.China [email protected]
Abstract. Analyzed the generalities and specialties of Rough Sets Theory (RST) and Support Vector Machines (SVM) in knowledge representation and process of regression, a minimum decision network combining RST with SVM in intelligence processing is investigated, and a kind of SVM information process system on RST is proposed for forecasting. Using RST on the advantage of dealing with great data and eliminating redundant information, the system reduced the training data of SVM, and overcame the disadvantage of great data and slow training speed. The experimental results proved that the presented approach could achieve greater forecasting accuracy and generalization ability than the BP neural network and standard SVM.
1 Introduction Along with the rapid development of nonlinear theory and artificial intelligent technologies, some soft-computing methods have been as powerful tools for analysis and forecast financial market, for example, Rough Set Theory (RST) and Support Vector Machines (SVM). Rough Set theory (RST) generally deals with discovering properties and interconnections in our knowledge usually represented as so called information system (or decision table). Methods based on that theory give us the ability to find some rules in our data and optimize them. Analyzed the generalities and specialties of RST and SVM in knowledge representation and process of regression, a minimum decision network combining RST with SVM in intelligence processing is investigated. Using RST on the advantage of dealing with great data and eliminating redundant information, a kind of SVM information process system on RST is proposed for forecasting, the system reduced the training data of SVM, and overcame the disadvantage of great data and slow training speed. Finally the paper contains some experimental results. Those exemplary experiments show the situations when the proposed methods are promising and can achieve greater forecasting accuracy and generalization ability than the BP neural network and standard SVM.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 334–338, 2003. © Springer-Verlag Berlin Heidelberg 2003
Rough Set Methods for Constructing Support Vector Machines
335
2 Rough Set Theory Information system is a pair A=(U,A), where U is a nonempty, finite set of objects called the universe and A is a nonempty, finite set of attributes i.e. a:U Æ Va for a Œ A , where Va is usually called a decision value set of a. An information system A=(U, A » {d } ), where d œ A is usually called a decision table. The elements of A we call the conditional attributes and d is called decision. Indiscernibility relation IND(B) is defined for any subset B Õ A ( B Õ A » {d } ) as follows: IND(B)= ( x , y ) ŒU ¥ U : for every a Œ B a ( x ) = a ( y ) (1)
{
}
Reduct is any subset B Õ A ( B Õ A » {d } ) such that IND(B)= IND(A) and IND(B-{a}) π IND(A) for any a Œ B . Hence it is enough to consider only the attributes from the reduct to distinguish objects in U. By RED(A) we denote the set of all reducts of A. A minimal reduct of A is an element of RED(A) minimal in order of cardinality. Core of A is the set CORE(A)= B . (2)
I
B ŒRED( A )
Attributes from CORE(A) cannot be eliminated from system without decreasing quality of object classification. CORE(A) can be empty if there exist two or more disjoint reducts of A. A subset B of the attribute set A is a relative reduct of a (consistent) decision table A=(U, A » {d } ) if B is minimal (with respect to the inclusion) set sufficient to discern objects from different decision classes (objects with different decision values). The set of all relative reducts in A is denoted by RED(A,d). Decision table B=(U’, A » {d } ) is called a subtable of A=(U, A » {d } ) if U ’Õ U . Let A=(U, A » {d } ) be a decision table and F be a family of subtables of A. By DR(A,F) we denote the set RED(A,d) « RED( B, d ) . (3)
I
B ŒF
Any element from DR(A,F) is called F-dynamic reduct of A. The above notions can also be extended to so called (F,e)-dynamic reducts that are even more flexible, and capable to deal with newcoming objects.
3 Theory of SVM in Regression Regression approximation addresses the problem of estimating a function based on a given set of data
G = {( xi , d i )}i =1 xi is the input vector, d i is the desired value, n
which is produced from the unknown function. SVM approximate the function in the following form: y = f ( x ) = wϕ ( x ) + b (4) Where
ϕ (x )
is the high dimensional feature space that is nonlinearly mapped
from the input space
x . The coefficients w and b are estimated by minimizing
336
Y. Li and T. Fang
Rsvm (c ) = c
1 n 1 Lε (d i , yi ) + w ∑ n i =1 2
2
d − y ≤ε
0 Lε (d , y ) = d − y − ε
otherwise
(5)
(6)
To obtain the estimations of w and b, Eq.(5) is transformed to the primal function given by Eq.(7) by introducing the positive slack variables Minimize
(
)
RSVM w, ξ (∗) =
Subjected to
ξi
and ξ i as follows: ∗
(
n 1 2 w + c∑ ξ i + ξ i∗ 2 i =1
wϕ ( xi ) + bi − d i ≤ ε + ξ i∗
d i − wϕ ( xi ) − bi ≤ ε + ξ i
)
ξ i∗ ≥ 0 ξ i ≥ 0
(7)
Finally, by introducing Lagrange multipliers and exploiting the optimality constraints, the decision function given by Eq.4 has the following explicit form:
(
)
n
(
)
f x,α i ,α i∗ = ∑ α i − α i∗ K ( x, xi ) + b i =1
(8)
4 The Proposed Approach In this section, a kind of SVM information process system on RST is proposed for forecasting, the system reduced the training data of SVM, and overcame the disadvantage of great data and slow training speed. In this system RST was used for preprocessing data. In other word, RST was as the first step in construction of the hybrid system. In the following step, SVM was used for forecasting tools. A forecasting system combining SVM with RST was as follows:
5
Experiment
Prediction of the economic indicators of a market is very crucial. In the following, we use the presented method in this paper for the financial forecasting problem. MSE errors was used and calculated as follows:
1 n ∑ ( yt − yˆ t )2 n t =1 Where yˆ t is the following value for y t . MSE =
(9)
Rough Set Methods for Constructing Support Vector Machines
Fig. 1. Scheme of approach
337
Fig. 2. Forecast accuracy of Stock index
ShangHai daily stock prices data are used as financial forecast application. Training data are sampled from June 15,1998 to October 9,1999. Test data are from November 19,2000 to February 15,2001. A Matlab implementation of the SVM based on RST (RSVM) is used. In addition to this, RSVM regression is compared with other techniques such as backpropagation neural network (BP) and standard SVM (SVM). The performance of all kinds of methods for the test data set is shown in the Figure 2. There are several issues that we need to consider in the presented method. First of all, we need to determine some parameters before running the particular algorithm. These parameters are ε , C and kernel parameters. In this paper, we set to ε =0.01, C=10000, σ =5, we get better results in terms of the prediction. From the Figure 2, we can conclude that the forecast accuracy of presented method was more than the other two methods, the results of experiments proved that RSVM is an effective and promising method.
6 Conclusions This paper presents a kind of SVM information process system on RST for forecasting. Using RST on the advantage of dealing with great data and eliminating redundant information, the system reduced the training data of SVM, and overcame the disadvantage of great data and slow training speed. The experimental results proved that the presented approach could achieve greater forecasting accuracy than traditional method.
338
Y. Li and T. Fang
References 1. Bazan J.G., Skowron A., Synak P., Dynamic reducts as a tool for extracting laws from decision tables, Proceedings of ISMIS’94. Lecture Notes in Artificial Intelligence, SpringerVerlag. Berlin (1994) 346–355 2. Pawlak Z., Rough Sets. Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dortrecht (1991) 3. Szczuka M.S., Applications of Rough Sets Methods in Neural Network Processing, ICS WUT Research Report (1993) 4. Mukherjee S, Osuna E, Girosi F., Nonlinear prediction of chaotic time series using support vector machines. Proceedings of IEEE NNSP ’97, Amelia Island,FL (1997) 5. Smola, A. and B. Scholkopf, A tutorial on support vector regression, NeuroCOLT Tech. Rep.TR 1998-030,Royal Holloway College, London, U.K (1998)
The Lattice Property of Fuzzy Rough Sets 1
2
Fenglan Xiong , Xiangqian Ding , and Yuhai Liu
1
1
2
Department of Mathematics, Ocean University of China, Qingdao 266003, China Information Engineering Center, Ocean University of China, Qingdao 266003,China Abstract. In this paper we modify the result in [2], discuss the lattice property of fuzzy rough sets and introduce lattice isomorphism from the equivalence class of fuzzy rough sets to intuitionistic fuzzy sets. The result of this paper extends the corresponding conclusion in [2] and will be applied in the area of computer science.
1
Introduction
Throughout this paper, L denotes a complete lattice with 0,1(0 ≠ 1) and with an order-reversing involution “ ’ ”[1]. U denotes a nonempty set. Dogan [2] has defined the order relation, the union and the meet operation between the fuzzy rough sets in X , and the definition of “ − ” operation is:
A : ( AL , AU ) ,
where µ A
L
( x) = ( µ AU ( x))’ ( x ∈ X U ) , µ AU ( x) = ( µ AL ( x))’
( x ∈ X L ) . We think that the “ − ”operation defined in [2] is improper. Obviously, the definition domain of
µ A (x) is not X U U
but
µ A (x) L
is not X L but
X U , the definition domain of
X L . So A is no longer the fuzzy rough set on ( X L , X U ) ,
thus “ − ”isn’t the operation of the fuzzy roughs set on
X . To define the
complementary operation on the fuzzy rough sets, we don’t focus on some given fuzzy rough sets E ( X ) on X , but on its extension sets i.e. E . In this paper, we define the equivalence relation ~ on E , and define order relation, union, meet operation and “ − ” on its equivalence class set E ~ . In view of isomorphism insertion, we may consider E ( X ) as the subset of and meet operations defined on
2
E ~ . The order relation, union
E ~ are the extension of those defined in [2].
Lattice Properties of Fuzzy Rough Set
Definition 1: Let X L , X U where
∈ P(U ) , X L ⊆ X U , A={µ(x)(x∈XL ),v(x)(x∈XU )}
µ (x) : X L → L , v( x) : X U → L ,
and for
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 339–341, 2003. © Springer-Verlag Berlin Heidelberg 2003
any
x ∈ X L , we have
340
F. Xiong, X. Ding, and Y. Liu
µ ( x) ≤ v( x) .
X L , A = {x | x ∈ X L µ ( x) > 0} , X U , A Denote
X = ( X L , X U ) . Let = {x | x ∈ X U v( x ) > 0} .
Then A is called the fuzzy rough set on
R = {( X L , X U ) | X L , X U ∈ P(U ), X L ⊆ X U } Let X = ( X L , X U ) ∈ R ,
E ( X ) be the set of all the fuzzy rough sets on X and denote E =
U E( X ) .
X ∈R
Let A , B ∈ E , A = {µ ( x)( x ∈ X L ), v( x)( x ∈ X U )} and
Definition 2
B = {α ( x)( x ∈ YL ), β ( x)( x ∈ YU )} then we define : A ~ B ⇔ X L, A = YL , B , X U , A = YU , B and they also satisfy the following properties: If X L , A ≠ φ , then
µ ( x) = α ( x) (∀x ∈ X L , A ) and If X U , A ≠ φ , then v( x) = β ( x)(∀x ∈ X U , A ) . ~ is he equivalence relation on E . We denote [ A] as the equivalence class of A , E ~ stands for all sets of the equivalence class on E . Definition 3: A, B are the same in the definition 2, we define: [ A] ≤ [ B ] ⇔ X L , A ⊆ YL , B X U , A ⊆ YU , B , and they satisfy the following properties: Obviously,
X L , A ≠ φ , then µ ( x) ≤ α ( x) (∀x ∈ X L , A )
1) If
2) If X U , A
≠ φ , then v( x) ≤ β ( x) (∀x ∈ X U , A ) .
] ∈ E ~ (∀i ∈ I ) , Ai = {µ i ( x) ( x ∈ X Li ) , v i (x) ( x ∈ X Ui )} , M L = ∪ X Li , M U = ∪ X Ui . We define ∪[ A i ] = [ A] , where
Definition
Let [ A
4:
i
i∈I
i∈I
i∈I
A = {∨ µ~ i ( x) ( x ∈ M L ) , ∨ v~ i ( x) ( x ∈ M U )} . We also define ∩[ A i ] = [ B] i∈I
i∈I
i∈I
~ i ( x) ( x ∈ M ) , v~ i ( x) ( x ∈ M )} . µ~ i ( x) is defined as where B = {∧ µ L U ∧ i∈I
follows: If
i∈I
M L = X Li , then µ~ i ( x) = µ i ( x) (∀x ∈ M L ) ;If M L − X Li ≠ φ ,
µ i ( x), x ∈ X Li i i . v% ( x ) is defined as follows: If M U = X U , ( x) = i 0, x ∈ M L − X L vi (x), x ∈ XUi i i i i then v% ( x ) = v ( x )(∀x ∈ M U ) ; If M U − X U ≠ φ , then v% ( x) = . i 0, x ∈ MU − XU i Theorem 1 A (i ∈ I ) and A are the same in the definition 4, then ~ then µ
i
U[ A ] = ∨[ A ] and I[ A ] = ∧[ A ] . i
i∈I
i
i∈I
Let σ
i
i∈I
i
i∈I
: E ( X ) → E ~ satisfying σ ( A) = [ A] for each A ∈ E ( X ) . Then σ is injective mapping from E ( X ) to E ~ reserving union and non-empty meet Theorem 2
The Lattice Property of Fuzzy Rough Sets
341
5 Let A be the same as definition 1, we define f ( A) = {µ f ( A) ( x) ( x ∈ U ) , v f ( A) ( x) (x∈U)},where µ f ( A) is defined as follows:
Definition If
X L = U , then
µ f ( A) = µ ( x)
µ ( x), x ∈ X L µ f ( A) ( x) = . v f ( A) ( x ) is 0, x ∈ U − X L
(∀x ∈ U ) ; If U − X L ≠ φ , then defined as follows: If
X U = U , then
v( x), x ∈ X U v f ( A) = v( x) (∀x ∈ U ) ; If U − X U ≠ φ , then v f ( A) ( x) = . 0, x ∈ U − X U Definition 6 Let A ∈ E (U ) A = {µ ( x) ( x ∈ U ) , v( x ) ( x ∈ U )} . We define A = {v ’( x) ( x ∈ U ) , µ ’( x) ( x ∈ U )} ∈ E (U ) .
~ , define “ − ”: E ~ → E ~ Where [ A] = f ( A) Theorem 3 “ − ” is order-reversing involution on E ~ . Definition 7
Let [ A] ∈ E
( E ~ , ≤) is complete lattice with order-reversing involution “ − ”. All the intuitionistic fuzzy sets [2,3] on U are denoted by K (U ) . Theorem 5: F is bijective mapping from E ~ to K (U ) reserving meet, union and “ − ”, where F : E ~ → K (U ) , ∀[ A] ∈ E ~ , F ([ A]) = {µ f ( A) ( x ) ( x ∈ U ) , Theorem 4
v’f ( A) ( x) ( x ∈ U )} ∈ K (U ) .
References 1. Guo-jun Wang, Ying-Yu He, Intuionistic fuzzy sets and L-fuzzy sets, Fuzzy Sets and Systems 110(2000) 271–274. 2. Dogan Coker, Fuzzy rough sets are intuitionstic L-fuzzy sets, Fuzzy Sets and Systems 96(1998) 381–383. 3. K. Atanassov, Intuitionstic fuzzy sets, Fuzzy Sets and Systems 20(1986) 87-96. 4. K. Atanassov, new operations defined over the intuitionstic fuzzy sets, Fuzzy Sets and Systems 61(1994) 137–140.
Querying Data from RRDB Based on Rough Sets Theory* 1,3
2
1
1
Qiusheng An , Guoyin Wang , Junyi Shen , and Jiucheng Xu 1
College of Electron and Communication,Xi’an Jiaotong University, 710049 Xi’an,China 2 Institute of Computer Science and Technology, ChongQing University of Posts and Telecommunications, 400065 Chongqing, China 3 Shanxi Teachers University, 041004 Linfen,China [email protected]
Abstract. Based on RRDM(Rough Relational Database Model),the querying theory of RRDB(Rough Relational Database) is analyzed from decomposition principle, projection principle and the definability of RRDB. We divide rough data querying into three types: crisp querying, rough complete querying and rough combinatorial querying. In addition, we discuss the rough data querying from the three aspects and do some computational simulation to obtain the results that are good agreement with the conclusion in this paper.
1
Introduction
RRDM was introduced by Beaubouef Theresa and Frederick E.Petry. Sui and Wang discussed several different definitions of lower approximation and upper approximation of rough relation, and discussed the entropy rules. Up to now, there are only some theoretical research on RRDB and fewer applications. The original intention of RRDM was to improve the querying performance on database, so we study rough data querying systematically in this paper [1] [2].
2 Basic Concepts The RRDB has several features in common with the ordinary relational database. A tuple ti has the form (di1, di2, … , dim), where dij is a value of a particular domain set Dj. In ordinary relational database,dij∈Dj. In the rough relational database, dij ⊆ Dj, and dij does not have to be a singleton, dij ≠ φ , let P(Di) denotes the powerset of Di- φ [1]. * This paper is partially supported by NSF 60173058 and 69803014 of P.R.China.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 342–345, 2003. © Springer-Verlag Berlin Heidelberg 2003
Querying Data from RRDB Based on Rough Sets Theory
343
Definition 1. An interpretation α=(a1, a2, … , am) of a rough tuple ti=(di1,di2,…,dim) is any value assignment such that aj ∈ dij for all 1 ≤ j ≤ m,aj is called a sub-interpretation of dij. Definition 2. A rough relation is a subset of the set cross product P(D1) × P(D2) × …P(Dm).
3 Basic Theory of Querying for RRDB 3.1 Decomposition Principle Definition 3. To every multi-value for a RRDB, extracting one of its subinterpretation to replace itself, the rest of single-value doesn’t change, the result is called a single-value decomposition of the RRDB. Assume that a multi-value is r(ai), its single-value decomposition is s(ai), we have s(ai) ∈ r(ai). Definition 4. To every multi-value for a RRDB, extracting one of its subset to replace itself, and the rest of multi-value doesn’t change, the result is called a multi-value decomposition of the RRDB. Assume that a multi-value is r(ai),its multi-value decomposition is s(ai),we have s(ai) ⊆ r(ai).We use Γ stands for the decomposition operator. 3.2 Projection Principle To a rough relation, the rough projection of X onto B, Π B(X) be a relation Y with schema Y(B), where Π B(X)={t(B)|t ∈ X}. 3.3 The Definability of RRDB Definition 5. Let S=(U, A, Di) be a RRDB, and to an attribute a ∈ A, the common relation onto U is θ a, where r θ as ⇔ r(a) ∩ s(a) ≠ φ , given an attribute subset B ⊆ A, the common relation θ B is ∩a∈Bθa onto B [3]. Definition 6. Let θ a be the common relation onto RRDB, and to an arbitrary u ∈ U, its common class is [u]θa ={v ∈ U|v(a) ∩ u(a) ≠ φ }. Definition 7. To a RRDB, let X be the result set of rough data querying. When X can be expressed by these tuples of R, we call X is accurately definable in R, R X= R X={ri|ri ∈ R ∧ ri ∈ X,1 ≤ i ≤ |U|}, when X can be expressed by these tuples of R, and can’t be expressed accurately, we call X is rough definable in R, we have R X={ri| ∃ i(ri ∈ R) ∧ |ri(a)| ≥ 1 ∧ ri(ai) ∩ C ≠ φ ,1 ≤ i ≤ |U|,C ∈ X}, R X={ri|ri ∈ R ∧ |ri(aj)|= 1 ∧ ri ∈ X,1 ≤ i ≤ |U|,1 ≤ j ≤ |A|},where ri denotes any tuples of R, and |ri(aj)| denotes the number of sub-interpretation , C is an attribute value of X.
344
Q. An et al.
4
The Rough Data Querying from RRDB
4.1
Crisp Querying
For RRDB, crisp querying search for these records that fully matches the querying conditions. Firstly, we analyze its equivalence classes (containing the attribute values that having same semantics), and turn the RRDB into a subtable of original RRDB according to equivalence classes. Secondly, analyze the inborn subtable, and use the SQL to query the data, if it is a single-value decomposition. Use rough relation operations to finish the last task, if it is a multi-value decomposition. For example, the query to Table 1 that returns the IDs of those Subregions that containing US. “US” and “USA” belong to the same equivalence class, so above demand can denotes as follows: Π 1,2(σCOUNTRY=“US”or COUNTRY=“USA” (Subregions)), the result by crisp querying is equal to {U123, U124, U125, U126, U147} [1]. Theorem 1. The result of crisp querying is the minimal set that satisfies querying conditions, X= R X= R X={ri|ri ∈ R ∧ ri ∈ X ∧ ri(a)=C,1 ≤ i ≤ |U|},where ri(a) is one of attribute values, and C is the data that user want to query. 4.2
Rough Complete Querying
Rough complete querying finds these records that satisfies all possible matching with querying conditions. For example, the query from Table 1 returns the IDs of those Subregions containing US. The result of the rough complete querying is {U123, U124, U125, U126, U147, U157}. We can find that the result by crisp querying is equal to the lower approximation, that is {U123, U124, U125, U126, U147}, whereas the result of rough complete querying is equal to the upper approximation, that is {U123, U124, U125, U126, U147, U157}, where U157 is the borderline element. Theorem 2. The result of rough complete querying is the maximal set that satisfies querying conditions, and we can denote its result as follows: R X={ri| ∃ i(ri ∈ R) ∧ |ri(a)| ≥ 1 ∧ Γ j(ri(a))=C,1 ≤ i ≤ |U|,1 ≤ j ≤ K}. 4.3 Rough Combinatorial Querying Rough combinatorial querying is a querying that combines several conditions to query the database. For example, the query from Table1 return the IDs of those Subregions containing {SAND,ROAD}.Above demand can be denoted as follows: Π 1,3( σ FEATURE=“SAND” ∧ FEATURE=“ROAD”(Subregions)). After using SQL language to query the records, the result is P={U147, U157, M007, M008, M009} and Q={U147,U157,M007}. Rough relation P and Q is obviously union compatible, so the final result is M=P ∩ Q={U147, U157, M007}.
Querying Data from RRDB Based on Rough Sets Theory
345
Table 1. Subregions
ID U123 U124 U125 U126 U147 U157 M007 M008 M009 CO39 CO40
COUNTRY US US USA US US {US,MEXICO} MEXICO MEXICO MEXICO BELIZE {BELIZE,INT}
FEATURE {MARSH,LAKE} MARSH {MARSH,PASTURE,RIVER} {FOREST,RIVER} {SAND,ROAD,URBAN} {SAND,ROAD} {SAND,ROAD} BEACH SAND JUNGLE {JUNGLE,COAST,SEA}
References 1. Beaubouef, Theresa., Frederick, E.Petry., Bill,P. Buckles.:Extension of the relational database and its algebra with rough set techniques. Computational Intelligence, Vol. 11. 2 (1995) 233–245 2. Sui Yuefei, Wang Ju,Jiang Yuncheng. :Rough Relational Database: the Basic Definitions. Computer Science , vol. 28.5 (2001) 122–124 3. K.Y.Hu, Y.F.Sui,Y. C.Lu et.al.:Multi-value rough set model. Computer Science, vol. 28.5 (2001) 122–124
An Inference Approach Based on Rough Sets
1
Fuyan Liu and Shaoyi Lu Hangzhou Institute of Electronic Engineering, 310037, Hangzhou, China
Abstract. In this paper we present an inference approach, which is based on rough sets theory, for inducing rules from examples. The features of the inference approach lies in that it combines a criterion of dependency degree with decision makers’ priori knowledge in selecting attributes of objects. For each rule can have several reductions, it uses an algorithm to implement a proper reduction and select the most effective attribute subset. As such, a practical and effective reduced knowledge rule set can be obtained.
1 Introduction With the popularity of data application and the maturation of database technology, the amount of data accumulated is increasing at exponential speed. Rough sets theory is a new method for analyzing, inducing, studying and discovering of incomplete data. It has been shown that classical set theory approach is very successful in plenty of applications. But it is also revealed that there exist limitations for some problems such as data mining, machine learning etc. because the assumption of the classical approach is too ideal to be true in the real world. As an extension of the classical set theory, the rough sets theory incorporates the model of knowledge into its formalism, thus it represents sets approximately in terms of the available context knowledge, and it leads to approximate decision due to containing the imperfect context knowledge. The rough sets theory is capable of dealing with uncertain problems, so it creates new possibilities to deal with complex objects, systems or natural phenomena for data mining, machine learning, and for variety of other areas. In this paper we present an inference approach which is based on rough sets theory for inducing rules from examples. Following a brief introduction of rough sets theory, the knowledge representation and relevant concepts of rough sets theory as well as a knowledge inference approach are introduced. Finally, an application of the approach under study and conclusions are given.
2 Knowledge Representation Rough sets theory is concerned in uncertainty of problems. It can be used in data reduction such as deleting irrelevant fields or records, evaluating data significance, similarity or discrepancy analysis of objects, cause and effect relation mining etc. 1
It is supported financially by Zhejiang Provincial Natural Science Foundation of China (698058)
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 346–349, 2003. © Springer-Verlag Berlin Heidelberg 2003
An Inference Approach Based on Rough Sets
347
In rough sets theory, the basic components of a knowledge representation system are a set of objects. The knowledge of these objects is described by attributes and attribute values of designated objects. So that a knowledge representation system S can be expressed as: S={U,A,V,F}, where UØ is a non-empty finite set, A is a finite set of attributes and A=C D, here C D=Ø, C is a set of condition attributes and D is a set of decision attributes; V= Va is a union of domains Va of attribute a A; F: UhA V denotes an information function from UhA to V that F(x,a) Va for each x U and a A, i. e. assigns attribute values to objects belonging to U. Therefore, a knowledge representation system describes the universe as a table with 2-dimensions, where each row denotes an object and each column denotes an attribute. In general, attributes of objects are divided into condition attributes and decision attributes in rough sets. Equivalence class is classified according to identical values of each attribute. Generally, there exist three situations between equivalence class RC of condition attributes C and equivalence class RD of decision attributes D. These are lower approximation: RD contains RC, i.e. RC ⊆ RD; upper approximation: intersection of RC and RD is not empty, i.e. RC 5DØ; independency: intersection of RC and RD is empty, i.e. RC 5D=Ø. Deterministic rules are set up for lower approximation, indeterminate rules (with confidence level) are set up for upper approximation, and there are no rules for irrelevant situation.
3 Knowledge Inference Rough sets theory concerns the classificatory analysis of imprecise, uncertain or incomplete information. It is a very effective methodology for data analysis and discovering rules in the attribute-value based domains. It is also an efficient tool for database mining in relational databases. In this paper, we define that a concept is a set of domain objects with a particular value for decision attributes, so that the goal of concept learning is to find a discriminating description of objects with a particular value for decision attribute in an attribute-value system. Also we define a relative reduction as a minimal sufficient subset of an attribute set, which has the same ability to discern concepts as the full set. Usually there are too many attributes for all to be used by a learning system. Some may be redundant or irrelevant. Attribute selection attempts to select a subset of the attributes that is necessary and sufficient to describe a target concept before rule generation. It is also important to accelerate learning and to improve learning quality. We use Quinlan’s information gain to assign significance values to each attribute to evaluate its ability to discriminate objects in U. Typically, higher significance values for a attribute indicates greater influence on decision attributes. If the attribute is irrelevant to the current decision attribute, then eliminate this attribute while finding classification table for decision attributes. For an attribute-value system S={U,C D,V,F}, suppose that Ci denotes a condition instance in RC, Dj denotes a decision instance in RD; RC and RD are equivalence classes of C and D respectively over U. Then we define that the degree of dependency of decision instance on condition instance is measured by Sij: Ci Dj,
348
F. Liu and S. Lu
Sij=card(Ci
Dj)/card (Ci) .
(1)
Therefore if condition instance Cj belongs to or contained in the lower approximation C_(Dj) of decision instance, then Sij=1, otherwise if condition instance í Cj belongs to or contained in U-C (Dj), then Sij=0 A relative reduction CRED of C is a maximal independent subset of C with respect to D. In order to compute a relative reduction each condition attribute is tested by removing it temporarily from C. At each step, formula (1) is used to calculate the degree of dependency based on the remaining attributes in C. Usually, there are several reductions for a knowledge representation system, which hold classification ability identical to that with original condition attributes. Therefore, effective attributes are reasonably selected to get a minimum decision rules to describe the universe correctly or approximately. Under general circumstance, decision makers possess a priori knowledge of weight values for each condition attributes. The weight values can be used to measure the dependent significance of the attributes. But under different decision environment, the same attribute can have different influence on decision output, i.e. weight values are sensitive to environment. The dependency of attribute in rough sets theory indicates the effect of an attribute on decision rules under current data environment. For example, if the dependency of an attribute equals 0 then it indicates that positive region of classification for decision rules is not affected if removing this attribute. However, it cannot reflect the priori knowledge that decision makers possessed. Therefore, combining the two methods as a criterion algorithm for selecting effective attributes is a reasonable solution scheme. Steps for selecting effective attributes are as follows. (1) For a given condition attributes and decision attributes in a universe, a 2dimensional data set or a decision rule table is constructed. (2) Determine a data classification criterion, represent each attribute value in normalized form, and eliminate redundant attributes. (3) Calculate dependency of each attribute in formula (1) under current data environment, if there are the attributes, which have dependency equal to 0 and minimum priori weight value then eliminate them. (4) Calculate possible reductions and the core of each decision instance, and select an attribute reduction table with regard to effective decision rules according to certain criterion described below, then the simplest rules will be obtained. Because in real systems, each rule can have several reductions, whose combination may be a very big rule set and is very complicated to use. Therefore, in order to implement a proper reduction we select the most effective attribute subset, i.e. select one reduction, which includes the biggest weight from the reductions of each rule to express a decision rule of the universe. This can be done by a practical and effective method as follows. Suppose that an attribute set of a reduced decision table is {a1, a2, ..., an}, their priori weights are w(a1), w(a2), ..., w(an) respectively; and rule i has m possible reductions. We define weights of each reduction as: Wi k=( p(aj)*w(aj)) »QN ,m . (2) j=1
An Inference Approach Based on Rough Sets
349
In which if aj is a designated value in a reduction, then p(aj)=1, otherwise p(aj)=0. We select that reduction for rule i, which is corresponding to one with a maximum of Wi k. As such, take combination of reductions with the biggest weight, and then a practical and effective reduced decision rule set can be obtained.
4
Application of the Approach
In this paper a prototype of part transportation system is used as an application environment. Knowledge rules, which are derived, are used to control operation modes of transportation robots in the system and to drive the system to arrive at an optimal operation status under which it runs with a maximum output part flow. The system consists of an input buffer, two belt conveyers, one loop conveyer and two robots. The input part flow comes into the system at a random speed ([1] in detail). As an application of the inference approach presented in the above, at first we describe the knowledge representation system of the system under study as S={U, C D, V, F}; Here condition attributes C include numbers of parts in buffer and individual conveyers, and their particular locations on different conveyers; decision attributes D include two control variables. After a 2-dimensional data set is constructed through building a basic simulation model of the prototype system and running it as well as collecting useful data, then an attribute-value system is formed. The attribute reduction technique is applied to eliminate irrelevant attributes, so that the number of condition attributes has been reduced from 54 to 4. When effective attributes are selected according to the above-mentioned steps, we chose that reduction for rule i, which is one with a maximum of Wi k from formula (2) and take combination of reductions with the biggest weights. As such, a practical and effective reduced decision rule set can be obtained. In summary, this paper presents an inference approach, which is based on rough sets theory, for inducing knowledge rules and applying to optimization control of a prototype simulation system. The feature of the inference approach lies in that it combines formula (1), which is used as a criterion of dependency degree, with decision makers’ priori knowledge in selecting attributes. For each rule can have several reductions, it uses formula (2) to implement a proper reduction and select the most effective attribute subset, which can accelerate learning and later result in simpler rules. It is proved that the inference approach is useful in data mining and knowledge discovery.
References 1. Lu Shaoyi and Liu Fuyan.: A State Table Approach Performing an Optimal Control. Computer Engineering and Applications, 8 (1997).47–49
Classification Using the Variable Precision Rough Set Yongqiang Zhao, Hongcai Zhang, and Quan Pan Automatic Control Department Northwest Ploytechnical University Xi’an 710072 China [email protected]
Abstract. In this paper, we present a new version of discernibility matrix, so called variable precision discernibility matrix, which can tolerate the noise of information, in addition, a reduction algorithm is also presented based on the partial order relation of the conditional attribute and used to do image classification. Compared with traditional rough set and BP network, the results will be much better.
1 Introduction Classification or recognition is, in a sense, a problem of forecasting the decision for new case by comparing its condition features with that of the already known instance. Neural network is one of the most popular methods in pattern recognition field, but it is not good enough for the case of multi-class classification [7]. Rough set theory (RST) offers an interesting and novel approach to generate the classification rules. Recently, Cyran [4] and Bell [5] proposed a classification method, they employed rough set based inductive reasoning for discovering optimal features set and getting decision rules, but the method is based on the rough set on crisp set space, and supposed that data in decision tables are complete and crisp. However the real data are always corrupted by noises, so it is not feasible to deal with the noisy data with the traditional definition of rough set. Variable precision rough set(VPRS) proposed by Ziarko [6] is defined on the probabilistic space, and will give us a new way to deal with the noisy data. In this paper, we will use the VPRS to deal with the multi-class noisy images and the simulation results are satisfying.
2 Basic Concept In RST, an information system is defined as I={U,A}, where U is a finite, nonempty set called the universe and A=C D is an attribute set where C and D are condition attribute and decision attribute respectively. An important concept in RST is the discernibility matrix M [1]: M(C)={mi,j}, φ d ( u i ) = d (u j ) mi, j = { c ∈ C | c ( u ) = c ( u )} d (u i ) ≠ d (u j ) i j
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 350–353, 2003. © Springer-Verlag Berlin Heidelberg 2003
(1)
Classification Using the Variable Precision Rough Set
351
The element of discernibility matrix M is the collection of attributes which can distinguish ui and uj. It is easy to compute the core and reduction of an information system by the discernibility matrix[1]. But it is difficult for commonly used discernibility matrix to deal with the noisy data. Thus a variable precision rough set (VPRS) model was suggested by Ziarko [6], and it is a powerful tool to deal with the noisy data, this will be shown in section 3.
3 The VPRS Reduction According to Ziarko in [6], a probability on rough set is defined as follows: P ( D | [u i ] IND ) = card ( D I [u i ] IND ) /( card ([u i ] IND ))
(2)
Where [ui]IND is an equivalence class of ui, card([ui]IND) is the number of the equivalence class of ui and card(D [ui]IND) is the number of the equivalence class of ui whose decision is D. According to the description of the P(D|[ui]IND)and the meaning of the discernibility matrix, we have the definition of the probabilistic discernibility matrix (PDM). Proposition 1 (PDM based on the definition of the discernibility matrix of Skowron): For an information system I={U,A=C D}, a discernibility matrix M(C) on probabilistic space is defined as follows : M(C)={mi,j}, φ (P(D|[ui ]IND) ≥ β ∧P(D|[uj ]IND) ≥ β)∨(P(D|[ui ]IND) < β ∧P(D|[uj ]IND) < β) (3) mi, j = {c∈C:| c(xi ) −c(xj )|> NTh} (P(D|[ui ]IND) ≥ β ∧P(D|[uj ]IND) < β) ∨(P(D|[ui ]IND) < β ∧P(D|[uj ]IND) ≥ β)
Where NTh is the noise threshold decided by the noise level of the decision table. The equation (3) define an attribute set that can distinguish ui and uj with the probability£when the noise threshold is NTh. The use of probability in the definition corresponds to the use of ‘majority inclusion’ relations for classification rather than ‘equivalence relation’ and this will tolerate the noise of data and improve the classifying ability for the noisy data. Because it is a NP-hard problem [1] to get the optimal reduction of the decision table. Therefore, the most important thing of the next step is to find a reduction algorithm with low computation cost. According to Pawlark in [2], a reduction must satisfy an independent condition, that is, a reduction ¶ R ⊆ C, where R should be independent if ∀ r R,POSR(D) POS{R-r}(D). Here POS means the positive region of the attribute R according to the decision D. In [2], a reduction satisfying the independent condition is called a Pawlark reduction. If an algorithm can ensure that the solution will be a Pawlark reduction for any information system, then we will say this algorithm is complete for Pawlark reduction. On this basis, a new version of reduction algorithm can be obtained as follows. In [2], a reduction algorithm based on the ordered attributes was proposed and proved to be complete for Pawlark reduction. But this method did not consider the core attribute of the decision table. In fact, the reduction algorithm based on core
352
Y. Zhao, H. Zhang, and Q. Pan
attribute will be very fast and simple. The following algorithm is the ameliorative version. Supposed that for an information system I={U,A}, n=card(U), M is the PDM of the information system. Before computing the discernibility matrix of the information system, we define a partial order relation ‘ ’ for all condition attributes: PO: a1 a2 … an. and the elements in mi.j will satisfy the partial order relation ‘ ’. At the beginning , let Re=¶, the reduction Algorithm will be as follows: Algorithm: (1) Let Re=Re Core, mi.j= mi.j-Core,C=C-Core; (2) c=FirstOd(C), choose ¢ from [c], ¢=cB; (3) Re=Re c, mi.j= mi.j-{c} and mi.j= mi.j-B; (4) go to step (2) until mi.j =¶. In the algorithm, [c] is the mi.j that contain the attribute c, and B is the attribute set by removing c. We will prove that the algorithm will be complete for Pawlark reduction. Proof. Since the reduction Re must satisfy M’={¢: ¢ Re=¶, ¢ M}=¶, the Re is a Pawlark reduct according to [2], then we will show Re will be independent. Assume that the reduction attribute set Re={c1, c2, … , cm} is independent, then we need to prove that ∀ ci Re is indispensable. According to definition of the indispensable of attribute ci must satisfy the relation: M’={¢: ¢ (Re-ci) ¶, ¢ M}=¶. According the procedure of step 2, it is clear that the every attribute in reduction Re will satisfy the above relation. So the reduction algorithm is complete for Pawlark reduction.
4
An Illustrative Application
In this section, the method proposed here is used to classify noisy images and we have a total of ninety images and for nine kinds of airplanes. The first thing of image classification is to get the invariant characters of these images.The Hu invariant moment [7],for simplicity, is used. Suppose we take the Hu’s seven invariant moment to be condition attribute and the kind of airplane to be decision attribute, then we can get the decision table, which contains ninety objects and nine classes. For the case of multi-class classification, it is not suitable to use traditional neural network, in our experiment, the recognition rate is no more than 60%, and the training time is very long, and the initial weight is very difficult to set. In order to solve the problem by rough set, we will construct two decision tables. Table 1 consist of nine objects and every kind just contains one object. Table 2 consist of ninety objects and every kind contains ten objects. The core of table 1 is Core={c2,c7}. The order of the condition attributes is c1 c2 c3 c4 c5 c6 c7, by using the previous reduction algorithm we can get a reduction of table 1 Re={ c2,c5,c6,c7}. But the core of table 2 is Core={ c1,c2,c3,c4,c5,c6,c7}, and the reduction is Re=Core. It seems absurd that table 1 is only a part of table 2, in fact it is the problem that there is too little training data, just as appearing in the learning of NN. So, if we want to get the complete classification rule, had to get more training example. The Fig 1 shows the relationship of classification result to training data:
Classification Using the Variable Precision Rough Set
353
Fig. 1. The relationship between the number of training data (X-coordinate) and recognition rate (Y-coordinate)
From fig.1, we can see that as the number of training data increase, the classification result is become better and the best classification rate is more than 95%. But for traditional rough set, the best classification rate is no more than 85%.
5 Summary and Conclusion This paper is focused on the classification noisy data through VPRS. We analyzed the shortage of the Skoworn Discenibility Matrix in dealing with the noisy data, and a variable precision Discenibility Matrix is constructed to deal with the noisy data. To get the complete reduction, we revise the reduction algorithm in [6], and prove the completeness of the reduction algorithm. Through experiment, the result is much better.
References 1. Skowron A., Rauszer C.: The Discernibility Matrix and Function in Information Systems. Intelligent Decision Support—Handbook of Application and Advances of the Rough Sets Theory. Sloiniski R. (ed.) 1991, 331–362 2. Wang Jue and Wang Ju: Reduction Algorithms Based on Discenibility Matrix: The Ordered Attributes Method , Journal of Computer Science and Technology, Vol 16 (2001) 489–504 3. Dominik Slezak and Jaku Wroblewski: Classification Algorithm Based on Linear Combination of Feature , Proc of PKDD’99 Springer-Verlag (1999) 548–553 4. D.A. Bell and J.W. Guan: Computational Methods for Rough Classification and discovery, Journal of the American Society for Information Science , Vol 5 (1998) 403–414 5. Daijin Kim and Sung-Yang Bang: A Handwritten Numeral Character Classification Using Tolerant Rough Set , IEEE Trans on Pattern Analysis and Machine Intelligence, Vol 22 (2000) 923–937 6. Ziarko, W.: Variable Precision Rough Set Model, Journal of computer and System Sciences, Vol 46 (1993) 39–59 7. Hu, M.: Visual Pattern Recognition by Moment Invariant, IRE Trans on Inf. Theory, Vol 8 (1962) 179–187.
An Illustration of the Effect of Continuous Valued Discretisation in Data Analysis Using VPRSβ Malcolm J. Beynon Cardiff Business School, Cardiff University, Colum Drive, Cardiff, CF10 3EU, Wales, UK [email protected]
Abstract. This paper explores the effect of the results from the continuous valued discretisation (CVD) of condition attributes in an analysis using the variable precision rough sets model (VPRSβ). With the utilisation of an ordered criteria for the identification of an associated β-reduct, a ‘leave n out’ approach is undertaken in the subsequent analysis. A small example problem is considered for which three alternative methods of CVD are used to intervalise the continuous condition attributes. For each CVD approach used 1500 decision tables are constructed and VPRSβ analysis undertaken.
1 Introduction In this paper, the effect of continuous value discretisation (CVD) on the results from a series of variable precision rough sets model (VPRSβ) [4] analyses is investigated. Moreover, an example problem with continuous attribute values is considered and three methods of CVD utilised. Through a ‘leave n out’ approach 1500 runs are made each generating random groups of in-sample objects (decision tables) and the subsequent rules constructed applied to out-of-sample objects. Descriptive measures are explored to elucidate any effects of the different CVD used. Central to VPRSβ is the decision table, of condition (C) and decision (D) attributes in categorical form, with equivalence classes E(C) and E(D) constructed. From these classes, with Z ⊆ U (set of all objects) and P ⊆ C the POSβ(Z), NEGβ(Z) and BNRβ(Z) approximation regions are defined as: POS β ( Z ) = U{ X i ∈ E ( P) : Pr ( Z / X i ) ≥ β } ,
NEG β ( Z ) = U{ X i ∈ E ( P) : Pr ( Z / X i ) ≤ β } ,
BNRβ ( Z ) = U{ X i ∈ E ( P) : β < Pr ( Z / X i ) < β } .
The β-quality of classification (γβ(P, D)) represents the proportion of objects which are classified to single decision classes Dj, and is given by card( U POS β ( D j ))
γ β ( P, D) =
D j ∈E ( D )
card(U )
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 354–357, 2003. © Springer-Verlag Berlin Heidelberg 2003
.
An Illustration of the Effect of Continuous Valued Discretisation in Data Analysis
355
Subsequently the concept of a β-reduct can be given, whereby a β-reduct is a minimal subset of the condition attributes P ⊆ C, for which γβ(P, D) = γβ(C, D) [4]. To select an acceptable β-reduct, the following ordered criteria is used [2]; i) The highest quality of classification possible, ii) The highest β value from those satisfying i), iii) Least number of attributes in the β-reduct from those satisfying ii) and iv) The largest interval domain of β from those satisfying iii). In the case of more than one β-reduct identified using this ordered criteria, a β-reduct is then randomly chosen from them. The set of rules constructed and method of prediction follows those used in [2].
2
Description of Data
In this paper a data set presented in [1] is used to exposit the analysis considered here. It consists of condition attributes on each state of the USA; % White (students), % Black, % Hispanic, % Asian, % American/Alaskan Indian, teacher-pupil ratio, % young under poverty level and % disabled, defined c1, c2, .., c8 respectively. These condition attributes enable the classification of a state to one of three levels of expenditure on education (d1), categorising 9, 24 and 18 of the 51 states. Here, three CVD methods are considered; first whether the condition attribute value is above or below its mean (MN) or median (MD) value and the minimum-entropy method (ME) [3] for which it constructed two intervals. In Table 1 the results of these three methods of CVD are presented. Table 1. Intervalisation of condition attributes c1, c2, .., c8 Att c1 c2 c3 c4
3
MN 71.35: 22, 29 15.30: 32, 19 7.54: 36, 15 3.51: 41, 10
MD 75.3: 25, 26 9.8: 25, 26 3.3: 24, 27 1.5: 24, 27
ME 81.15: 31, 20 10.80: 28, 23 2.150: 19, 32 1.55: 26, 25
Att c5 c6 c7 c8
MN 2.31: 42, 9 16.77: 27, 24 17.53: 31, 20 12.50: 27, 24
MD 0.5: 25, 26 16.6: 25, 26 16.6: 25, 26 12.4: 24, 27
ME 0.45: 25, 26 19.05: 44, 7 33.30: 49, 2 13.90: 40, 11
Elucidation of ‘Leave n out’ VPRSβ Analyses
Following [2], the effectiveness of the ordered criteria for β-reduct selection is given in Table 2. In this paper 1500 runs were undertaken for each CVD, with (for each run) 46 and 5 states in the in and out-of-samples respectively (‘leave 5 out’). Table 2. Number of β-reducts ideintifed using ordred criteria and MN, MD and ME CVD CVD MN MD ME
1 1357 (90.5%) 1292 (86.1%) 1442 (96.1%)
2 128 (8.5%) 172 (11.5%) 52 (3.5%)
3 14 (0.9%) 25 (1.7%) 5 (0.3%)
4 1 (0.1%) 5 (0.3%) 1 (0.1%)
5
6
4 (0.3%)
2 (0.1%)
In Table 2, in the majority of runs a single β-reduct is identified, the most is six βreducts (MD CVD). Table 3 reports a number of measures on the VPRSβ analyses.
356
M.J. Beynon
Table 3. Descriptive statistics for VPRS analysis based on 1000 runs using MN, MD and ME Variable
CVD MN MD ME MN MD ME
Prediction Accuracy
Number of Rules
Min 0 0 0 2 2 2
Max 1.0 1.0 1.0 28 26 19
Mean 0.8477 0.8080 0.6611 4.0693 5.8960 7.1400
Median 1.0 1.0 0.6000 2 2.5 6
Mode 1.0 1.0 0.8000 2 2 5
Frequency
To accompany the predictive accuracy of the analysis (probabilistic rules) on the (five) out-of-sample states in each run, in Fig. 1 the distribution of the predictive accuracy in the 1500 ‘leave n out’ runs, with similarity between MN and MD shown. CVD
1000 800 600 400 200 0
-MN
-MD
1045
-ME
799 476
462
8
4 10
140 159 76 49 75
0
0.2
0.4
264
258 147
231
213
84
0.6
0.8
1.0
Fig. 1. Frequency of predictive accuracy (proportion) 1026
Frequency
1000 800 600 400 200
88 45 81
62 5 14
40 33 17 27 19 11 9 3 1 8 3 3 2
1 1
1
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 Frequency
1000 800
750
600 400 200
153161 93 28 51 34 37 21 22 16 10 4 3 4 2 5 5 4 1 1 10 25 39 21
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 Frequency
1000 800 600 400
374 237
200 5
77
220 131
76 55 75 65 29 27 22 24 35 30 15 3
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30
Fig. 2. Frequency of number of rules from MN, MD and ME CVD methods
An Illustration of the Effect of Continuous Valued Discretisation in Data Analysis
357
In Fig. 2 the frequency of the number of rules is described for each of the CVD methods. In Fig. 3 the top ten most identified β-reducts are shown (with frequency) from the three CVD methods adopted. These ten β-reducts make up 74.4%, 63.4% and 45.4% of those identified in each of the 1500 runs for the MN, MD and ME CVD methods used respectively.
Frequency
{c1,c2,c3,c4,c6,c7 ,c8} {c1 ,c2,c3 ,c5,c6,c7,c8} {c1,c2,c3 ,c6,c7,c8} {c1 ,c2,c3 ,c6,c7 } 240 226 {c1,c3,c5 ,c6,c7,c8} {c1,c4,c6,c7,c8} 204 199 200 { c { c , c , c } 1, c3,c5 , c6} 1 3 6 160 {c1,c2,c3 ,c6} {c1,c3,c6,c7} 117 117 120
80 40
64
62
48
40
39
MN
Frequency
Frequency
5 7 1 2 3 4 6 8 9 10 {c1,c3,c4 ,c5,c6} {c1,c4,c5,c6} {c1,c3,c5,c6,c8} 240 {c2,c3 ,c4} {c1,c3,c4,c5,c6,c8 } {c1 ,c2,c3 ,c4,c5,c6,c7,c8} {c ,c ,c ,c ,c ,c } 200 181 1 2 3 5 6 8 164 160 {c2,c3 ,c4,c5 ,c7,c8 } 160 {c2,c3,c4,c5,c8} {c2,c3,c4,c7} 120 105 99 80 61 49 45 41 46 40 MD 5 7 1 2 3 4 6 8 9 10 {c2,c3,c4,c5,c6,c7,c8} {c2,c3 ,c4,c5 ,c6} {c ,c ,c ,c } 240 {c1,c2 ,c3 ,c4 ,c5,c6 ,c7} 2 4 6 8 {c1 ,c2,c3 ,c6,c7} {c2,c3 ,c4 ,c5 } {c1,c2,c3,c6} 200 {c4 ,c5} {c2,c3,c4,c8} 160 {c2,c3,c4,c6} 120 113 109 90 84 81 80 48 47 40 35 34 40 ME 5 7 1 2 3 4 6 8 9 10
Fig. 3. Frequency of top ten most frequently selected β-reducts (also given)
References 1. 2.
3. 4.
Beynon, M., Curry, B., Morgan, P.: Classification and rule construction using rough set theory, Expet Systems (2000) 136−148 Beynon, M.: Investigating the choice of l and u values in the extended variable precision rough sets model. Rough Sets and Current Trends in Computing RSCTC2002, Penn State University USA (2002) 61−68 Fayyad, U.M and Irani, K. B. (1992) On the handling of continuous-valued attributes in decision tree generation, Machine Learning 8, 87–102 Ziarko, W.: Variable precision rough sets model. Journal of Computer and System Sciences 46 (1993) 39−59
Application of Fuzzy Control Base on Changeable Universe to Superheated Steam Temperature Control System 1
1
1
2
Keming Xie , Fang Wang , Gang Xie , and T.Y. Lin 1
College of Information Engineering, Taiyuan University of Technology, Taiyuan, Shanxi, 030024 P.R.China [email protected] 2 Dept. of Mathematics & Computer Science, San Jose State University, San Jose, California 95192-0103 USA [email protected]
Abstract. The cascade control systems are generally adopted in the units of power plants, same to the application showed in this paper. The main controller employs the FLC based on the changeable universe of discourse that is designed in the paper. According to the inputs of the main FLC, the paper introduces a fuzzy variable α to turn automatically the range of universe of discourse and simulates the man’s fuzzy control process from rough to exact control process. The improved fuzzy logic controller has some self-adaptive adjusting capability. It is combined with PI controller for the cascade control system and applied to superheated temperature control system of some 1900t/h once-through boiler. The comparison of this scheme with traditional cascade PID control and conventional FLC demonstrates the efficacy of this scheme. Keywords. Fuzzy logic control (FLC), changeable universe of discourse, fuzzy variable α , superheated steam temperature system.
1 Introduction Traditional cascade PID control needs exact mathematical model in the adjusting process of control parameters, which makes it difficult to meet control requests. Fuzzy logic controller (FLC) with good robustness, nonlinear characteristics needn’t establish precise mathematical model of objects. It is suited for control of nonlinear, time change objects with large time delay. However there are still many other unsolved problems in theoretical and practical aspects. Researchers keep exploring methods to improve fuzzy control effects, for example, optimizing fuzzy inference systems [1–4] by self-adaptive learning capability of neural network or genetic algorithm (GA), combining FLC with traditional control methods (such as fuzzy PID[5] etc.) to realize exact and effective control. But these methods are not accordant to precise and rapid effects of man’s fuzzy control. So this paper introduces a fuzzy variable α to real-time turn the range of universe of output variable, simulates man’s control process from rough to exact control, and tries to enhance fuzzy control quality. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 358–362, 2003. © Springer-Verlag Berlin Heidelberg 2003
Application of Fuzzy Control Base on Changeable Universe
359
2 Design of FLC Based on Changeable Universe 2.1 Changeable Universe of Discourse Take fuzzy controller of double inputs and single output for example. Suppose the ranges of the universe of discourse of input variables are X and Y, and that of output variable is Z, where X= [-e, e], Y=[-m, m], Z=[-n, n] and m, n, e N (N is natural number). The format of fuzzy rules Rij of conventional FLC is: if x = Ai and y = Bj, then z = Cij, where Ai, Bj and Cij are linguistic terms of X, Y and Z respectively. As we know, the precision of FLC is related to the number of linguistic terms in universe, that is, if the distance between them is enough close, obtained control functions will approximate ideal control function sufficiently. But thus increases the rules’ number geometric serially and enhances the difficulty of generation of fuzzy rules. Decrease of terms’ number, namely, elongate of their distance will simplify the establishment of the fuzzy rules, reduce calculation and cost the control accuracy simultaneously. Moreover, the output approaches setpoint gradually by using policy of rough to exact control in the course of man-controlled process. So authors introduce a fuzzy variable α to automatically expand or compress the range of universe of output variable based on the inputs x and y on the premise of holding the original fuzzy rules. Fig.1 is the sketch map of compression and expansion of the range of universe. Thus, when the values of input x and y are relatively great, rough control is operated to expand the range of output variable’s universe and increase control quantity, which makes output enter local range of setpoint rapidly. When values of x and y are relatively small, that is, system is close to steady state, the exact adjustment is needed. The range of output variable is compressed and the number of fuzzy linguistic terms is relatively increased to enhance control accuracy. The structure of improved fuzzy logic controller is shown as Fig.2. In fact this fuzzy controller based on the changeable universe is one with self-adaptive adjusting capability. Its control process is similar to the man’s and the constitution of fuzzy rules only needs their tendency instead of much domain [6] knowledge. FLC is an interpolator in itself . The membership functions of the linguistic terms representing premise part of fuzzy inference are interpolator’s basic functions in interpolative sense. Optimal control functions gradually obtained and steady-state error is minimized in the course of range compression of universe. Suppose seven linguistic terms are NB, NM, NS, ZE, PS, PM, PB and the membership functions of linguistic terms are separated symmetrically, then FLC regards the ZE interval of variable Z, [ − 13 n 13 n ], as 0, which is called dead zone. While the range of universe is cut down, that is, n lessened, the dead zone is also decreased correspondingly. Quantity control of fuzzy variable α still selects x and y as input variables, and their universe of discourse and linguistic terms are one and the same to that of conventional FLC. Since α is still a scaling factor, we can select the range of output α is [a b] avoiding the transform from fuzzy to real universe for defuzzification, where 0
360
K. Xie et al.
expansion respectively. If the universe of conventional FLC’s output variable is [-n n] at the beginning, then on-line adjusted range of universe is [ − α *n α *n] and the range of universe of every linguistic term is changed by α times. For example, suppose the initial range of ZE is [ − 1 3
1 3
n
1 3
n ], then the range turns to [ − 13 n *¥
n *¥] after adjustment, So do other linguistic terms.
Fig. 1. Range compression and expansion of universe
Fig. 2. Structure of FLC based on changeable universe
2.2 Design of Conventional FLC Error (E) and error change (EC) are chosen as the inputs of conventional FLC. This paper uses FLC of integral type, namely, u (t ) = u (t − 1) + ∆u(t ) . Given input and output variables’ linguistic terms are [NB, NM, NS, ZE, PS, PM, PB], denoting negative big, negative middle, negative small, zero, positive small, positive middle, positive big. In the paper fuzzy sets of all these linguistic terms are defined by Gaussian membership functions which can express man’s instinctive reasoning way. Control quantity u (t ) has different correcting variable ∆u(t ) based on error and error change. All of fuzzy rules of ∆u(t ) and fuzzy variable α (a=0.2 b=1.8) are listed in table 1. Table1. Fuzzy rules of fuzzy variable EC NB NM NS ZE PS PM PB
NB EB/NB EB/NB EB/NM ES/NM EB/NM EB/NS EB/ZE
NM EB/NB ES/NM ES/NM OK/NM ES/NS ES/ZE EB/PS
NS ES/NM OK/NM OK/NM CS/NS OK/ZE OK/PS ES/PS
E ZE OK/NM OK/NM CS/NS CB/ZE CS/PS OK/PS OK/PM
α / ∆u
PS ES/NM OK/NS OK/ZE CS/PS OK/PS OK/PM ES/PM
PM EB/NS ES/ZE ES/PS OK/PS ES/PM ES/PM EB/PB
PB EB/ZE EB/PS EB/PS ES/PM EB/PM EB/PB EB/PB
Application of Fuzzy Control Base on Changeable Universe
361
3 Steam Temperature Control System and Simulation Study The superheated steam temperature is an important index of evaluating the boiler’s operating quality. It relates the service life of the superheater and the steam turbine blades. The thermal efficiency of the unit is also influenced by it. In a word it affects the safe and economic operation of the power station. Whereas the controlled object has some unfavorable characteristics such as large delay and fluctuant parameters. A great attention has been paid on how to control it efficiently. At present the typical control system pattern of the superheated steam temperature is the cascade control system, same to simulation structure in this paper. The transfer functions of a onethrough boiler of 1900t/h under four typical load points approximated by high order [7] inertia function K (1 + Ts) n are listed in table 3 . The parameters of the cascade PID control system are given in paper [7] for forceful compare. By following simulation results by MATLAB shown in fig.3-4, it is concluded that FLC based on changeable universe of discourse has better control quality with rapid speed, small overshoot and good robustness in compare with traditional cascade PID and conventional FLC.
Fig. 3. Step response under 100% and 50% load where respond 1 represents FLC based on changeable universe and respond 2 and 3 conventional FLC and PID algorithm.
Fig. 4. Output response with disturbance under 100% and 50%load where respond 1 represents FLC based on changeable universe and respond 2 and 3 conventional FLC and PID algorithm.
362
K. Xie et al.
4 Conclusion This paper designs FLC based on changeable universe of discourse and introduces a fuzzy variable α to turn the range of universe of output variable of conventional FLC. The application in superheated steam temperature system with fluctuant gain and large time delay is given and simulation shows changeable universe based FLC is effective in control of superheated steam temperature system and has some selfadaptive adjusting capability. This control system has features of good robustness and small overshoot.
Acknowledgement. The research is supported by the Natural Science Fund (20001034) and the Visiting Scholar Fund (200027) of Shanxi Province, P. R. China.
References 1. Keming Xie, Jianfeng Nan, T.Y. Lin. A New Fast Fuzzy–neural Feedback Network and Its 2. 3. 4. 5. 6. 7.
Application to Steam Temperature Control System in Power Plant. (IFAC’99), 1999, 339– 344. Keming Xie, T.Y. Lin, Jianfeng Nan. A Fuzzy-Neural Network Based on the TS Model with Dynamic Consequent Parameters and its Application to Steam Temperature Control System in Power Plants. (IFSA’99), 1999,690–694 X.M. Qi, T.C. Chin, Genetic Algorithm – based Fuzzy Controller for High Order Systems, fuzzy sets syst. vol.91, 279–284, 1997. Keming Xie, Changhua Mou, Gang Xie, A MEBML-based Adaptive Fuzzy Logic Controler. (IECON-2000), 2000, 1492–1496. Keming Xie, Changhua Mou, Gang Xie, The Superheated Steam Temperature Cascade System with The Fuzzy PID Controller. Proceeding of the 19th Chinese Control Conference, 2000, 800–803. Hongxing Li, To See The Success of Fuzzy Logic from Mathematical Essence. Fuzzy System and Mathematics. Vol.9. no.4, 1995: 1–14. Yongsheng Fang, Zhigao Xu, Laijiu Chen. Study of Adaptive Fuzzy Control of Boiler Sperheated Steam Temperature Based on Dynamic Mechanism Analysis.Vol.17. No.1. 1997:23–28
Application of Fuzzy Support Vector Machines in Short-Term Load Forecasting Yuancheng Li1 and Tingjian Fang2 1
Department of Automation, University of Science and Technology of China. JinZhai Main Road, 230026, HeFei, P.R.China [email protected] 2 Institute of Intelligent Machines, Academia Sinica. 230031, HeFei, P.R.China [email protected]
Abstract. A new method using Fuzzy Support Vector Machines (FSVM) is presented for Short-Term Load Forecasting (STLF). In many regression problems, the effects of the training points are different. It is often that some training points are more important than others. In FSVM, we apply a fuzzy membership to each input point such that different input points can make different contributions to the learning of decision surface. The results of experiment indicate that FSVM is effective in improving the accuracy of STLF.
1 Introduction The forecasting of electricity load has always been an important issue in the power industry. Short-Term Load Forecasting (STLF), in particular, has become increasingly important since the rise of the competitive energy markets. Research on STLF has attracted wide attention for many years. Several major methods and techniques have been proposed and developed, including time series models, regression models, Box-Jenkins transfer function, expert system models, neural network models and fuzzy logic. This paper presents a new method, Fuzzy Support Vector Machine (FSVM) for STLF. Support Vector Machines (SVM) is a new and promising technique for data classification and regression [1]–[3]. In many applications, SVM has been shown to provide higher performance than traditional learning machines [1]. However, some input points may not be exactly assigned to one of these two classes. Some are more important to be fully assigned to one class so that SVM can separate these points more correctly. Some data points corrupted by noises are less meaningful and the machine should better to discard them. SVM lacks this kind of ability. In FSVM, we apply a fuzzy membership to each input point of SVM such that different input points can make different contributions to the learning of decision surface and can enhances the SVM in reducing the effect of outliers and noises in data points. The results of experiment indicate that FSVM is effective in improving the accuracy of STLF.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 363–367, 2003. © Springer-Verlag Berlin Heidelberg 2003
364
Y. Li and T. Fang
2 FSVM In this section we briefly review the description about the idea and formulations of FSVM in regression problem. In regression problem, the effects of the training points are different. It is often that some training points are more important than others, so we apply a fuzzy membership 0 < si ≤ 1 associated with each training point xi . This fuzzy membership
si can be regarded as the attitude of the corresponding training point toward the mapping function and the value (1- si ) can be regarded as the attitude of meaningless. In a result, the traditional SVM was extended as FSVM. 2.1 The Formulation of FSVM
Given a set
s of training points with associated fuzzy membership (x1 , y1 , s1 )L(xl , yl , sl )
(1)
yi ∈ R and a fuzzy membership σ ≤ si ≤ 1 with i = 1,Ll , and sufficient small σ > 0 . The FSVM regression
Where
xi ∈ R n are input vectors and
solves an optimization problem
(
)
l 1 T w w + c ∑ si ξ i + ξ i∗ w,b ,ξ ,ξ 2 i =1 T Subject to yi − w ϕ ( xi ) + b ≤ ε + ξ i i = 1,L, l
min ∗
(
(w ϕ (x ) + b ) − y
)
T
i
(2)
i
≤ ε + ξ i∗ i = 1,L, l
ξ i ,ξ i∗ ≥ 0 i = 1,L, l Where
xi is mapped to a higher dimensional space by the function ϕ , ξ i is the up-
per training error ( ξ i is lower), subject to the ∗
(
ε -insensitive tube
)
y − w ϕ (x ) + b ≤ ε T
(3)
The parameters that control the regression quality are the cost of error C , the width of the tube
ε , the mapping function ϕ and the fuzzy membership si .
Usually, it is more convenient to solve the dual of (2) by introducing Lagrange multipliers
α i∗ ,α i , and leads to a solution of the form n
(
)
f ( x ) = ∑ α i∗ − α i K ( x, xi ) + b
(4)
0 ≤ α i∗ ,α i ≤ si C
(5)
i =1
We can control the tradeoff of the respective training point
xi in the system with dif-
Application of Fuzzy Support Vector Machines in Short-Term Load Forecasting
ferent value of
365
si . A smaller value of si makes the corresponding point xi less im-
portant in the training. There is only one free parameter in SVM while the number of free parameters in FSVM is equivalent to the number of training points. 2.2 Generating the Fuzzy Membership It is easy to choose the appropriate fuzzy memberships in STLF. First, we choose σ > 0 as the lower bound of fuzzy memberships. Second, we make fuzzy membership
si be a function of time t i
si = f (t i )
(6)
xl be the most important and choose sl = f (t l ) = 1 , and the first point x1 be the least important and choose s1 = f (t1 ) = σ . If we want to let We suppose the last point
fuzzy membership be a linear function of the time, we can select
si = f (t i ) = at i + b =
t σ − t1 1−σ ti + l t l − t1 t l − t1
(7)
If we want to make fuzzy membership be a quadric function of the time, we can select 2
t −t si = f (t i ) = a (t i − b ) + c = (1 − σ ) i 1 + σ t l − t1 2
(8)
3 Experiment In this section, we will introduce an example to see the benefits of FSVM in electric short-term load forecasting. The load forecasting accuracy is reported in terms of the mean absolute percentage error (MAPE) and mean square percentage error (MSE) defined by the following equation
MAPE =
1 n Actual (i ) − Forecast (i ) × 100% ∑ n i =1 Actual (i )
1 n A(i ) − F (i ) MSE = ∑ A(i ) ×100% n i =1
(9)
2
(10)
The training and forecast for the model was performed on the load and temperature data of one-year data (1 January 1999 to 30 December 1999) obtained from YanTai Electric Network. In the experiment, the Gauss kernel parameters δ =30 c =100 ε =0.001, resulting into an effective number of parameters, these hyperparameters and kernel parameters were kept fixed for the out of sample, the model parameters w and b or α and b were reestimated each time. 2
366
Y. Li and T. Fang
In this paper, we forecast the load at 10 AM in working day from 2002/4/10 to 4/28 with FSVM and SVM algorithm. The results are show in the table 1 below. Moreover, we forecast the load at 10 AM in playday from 2002/8/12 to 9/30. The results are show in the table 2.
From the above table, we can see that the FSVM algorithm can provide an accurate load forecast. The forecasted results show FSVM with Gauss kernel have much better potential in the field of time series prediction.
4 Conclusions A novel Fuzzy Support Vector Machines approach is proposed for STLF. As demonstrated in the experiment, FSVM forecast significantly better than classical SVM. This motivates further research in this direction for future work.
References 1. Chun-Fu Lin, Sheng-De Wang, Fuzzy Support Vector Machines, IEEE Trans on Neural Networks. Vol. 13 No.2 March (2002) 464–471 2. V. N. Vapnik, Statistical Learning Theory. New York: Wiley (1998)
Application of Fuzzy Support Vector Machines in Short-Term Load Forecasting
367
3. B. Schölkopf, C. Burges, and A. Smola, Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press (1999) 4. Smola, A.J. Regression Estimation with Support Vector Learning Machines. Master’s thesis, Technische Universität München (1996) 5. Mukherjee S., Osuna E., Girosi F. Nonlinear prediction of chaotic time series using support vector machines. Proceedings of IEEE NNSP ’97, Amelia Island, FL (1997) 6. Smola, A. and B. Scholköpf, A tutorial on support vector regression, NeuroCOLT Tech. Rep.TR 1998-030,Royal Holloway College, London, U.K (1998)
A Symbolic Approximate Reasoning Mazen El-Sayed and Daniel Pacholczyk University of Angers, 2 Boulevard Lavoisier, 49045 ANGERS Cedex 01, France {elsayed,pacho}@univ-angers.fr
Abstract. We study knowledge-based systems using symbolic many-valued logic. In previous papers we have proposed a symbolic representation of nuanced statements. Firstly, we have introduced a symbolic concept whose role is similar to the role of the membership function within a fuzzy context. Using this concept, we have defined linguistic modifiers. In this paper, we propose new deduction rules dealing with nuanced statements. More precisely, we present new Generalized Modus Ponens rules within a many-valued context. Keywords: Knowledge Management, Imprecision, Many-valued logic.
1
Introduction
The development of knowledge-based systems is a rapidly expanding field in applied artificial intelligence. The knowledge base is comprised of a database and a rule base. We suppose that the database contains facts representing nuanced statements, like “x is mα A” where mα and A are labels denoting respectively a nuance and a vague or imprecise term of natural language. The rule base contains rules of the form “if x is mα A then y is mβ B”. Our work presents a symbolicbased model which permits a qualitative management of vagueness in knowledgebased systems. In dealing with vagueness, there are two issues of importance: (1) how to represent vague data, and (2) how to draw inference using vague data. When the imprecise information is evaluated in a numerical way, fuzzy logic which is introduced by Zadeh [5], is recognized as a good tool for dealing with aforementioned issues and performing reasoning upon vague knowledge-bases. A second formalism, refers to multiset theory and a symbolic many-valued logic [3,4], is used when the imprecise information is evaluated in a symbolic way. In previous paper [2], we have proposed a symbolic model to represent nuanced statements. This model is based on multiset theory and a many-valued logic proposed by Pacholczyk [4]. In this paper, our basic contribution has been to propose deduction rules dealing with nuanced information. For that purpose, we propose deduction rules generalizing the Modus Ponens rule in a manyvalued logic context [4]. The first version of this rule has been proposed in a fuzzy context by Zadeh [5] and has been studied later by various authors [1]. In section 2, we present briefly the basic concepts of the M-valued predicate logic which forms the backbone of our work. Section 3 introduces briefly the symbolic representation model previousely proposed. In section 4, we propose new Generalized Modus Ponens rules. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 368–373, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Symbolic Approximate Reasoning
2
369
M-Valued Predicate Logic
Within a multiset context, to a vague term A and a nuance mα are associated respectively a multiset A and a symbolic degree τα . So, the statement “x is mα A” means that x belongs to multiset A with a degree τα . The M-valued predicate logic [4] is the logical counterpart of the multiset theory. In this logic, to each multiset A and a membership degree τα are associated a M-valued predicate A and a truth degree τα −true. In this context, the following equivalence holds: x is mα A ⇔ x ∈α A ⇔ “x is mα A” is true ⇔ “x is A” is τα −true. One supposes that the membership degrees are symbolic degrees which form an ordered set LM = {τα , α ∈ [1, M ]}. This set is provided with the relation of a total order: τα ≤ τβ ⇔ α ≤ β. We define in LM two operators ∧ and ∨ and a decreasing involution ∼ as follows: τα ∨ τβ = τmax(α,β) , τα ∧ τβ = τmin(α,β) and ∼ τα = τM +1−α . On this set, an implication → and a T-norm T are defined respectively as follows: τα → τβ = τmin(β−α+M,M ) and T (τα , τβ ) = τmax(β+α−M,1) . Example 1. For example, by choosing M=9, we can introduce: L9 ={not at all, little, enough, fairly, moderately, quite, almost, nearly, completely}.
3
Representation of Nuanced Statements
Let us suppose that our knowledge base is characterized by a finite number of concepts Ci . A set of terms Pik is associated with each concept Ci , whose respective domain is denoted as Xi . As an example, terms such as “small”, “moderate” and “tall” are associated with the particular concept “size of men”. We have assumed that some nuances of natural language must be interpreted as linguistic modifiers. In the following, we designate by mα a linguistic modifier. In previous papers [2], we have proposed a symbolic-based model to represent nuanced statements. In the following, we present a short review of this model. We have proposed firstly a new method to symbolically represent vague terms. In this method, we assume that a domain of a vague term, denoted by X, is not necessarily a numerical scale. This domain is simulated by a “rule” (Figure 1) representing an arbitrary set of objects. Our basic idea has been to associate with each multiset Pi a symbolic concept which represents an equivalent to the membership function in fuzzy set theory. For that, we have introduced a new concept, called “rule”, which has a geometry similar to a membership L-R function and its role is to illustrate the membership graduality to the multisets. [Li]2
small
moderate
...
[Li]8
tall
...
[Ri]2
Ci X
X
X
(b)
Sp(Pi)
τ2 τ3 τ4 τ5 τ6 τ7 τ8
Fig. 1. The universe X
[Ri]8
τΜ
τ8 τ7 τ6 τ5 τ4 τ3 τ2
Sp(Pi)
Fig. 2. A “rule” associated with Pi
(a)
370
M. El-Sayed and D. Pacholczyk
By using the “rule” concept we have defined some linguistic modifiers. We have used two types of linguistic modifiers. – Precision modifiers: They increase or decrease the precision of the basic term. We distinguish two types of precision modifiers: contraction modifiers and dilation modifiers. We use M6 = {mk |k ∈ [1..6]} ={exactly, really, ∅, more or less, approximately, vaguely} which is totally ordered by j ≤ k ⇔ mj ≤ mk . – Translation modifiers: They operate both a translation and precision variation on the basic term. We use T9 = {tk |k ∈ [1..9]} ={extremely little, very very little, very little, rather little, ∅, rather, very, very very, extremely} totally ordered by k ≤ l ⇐⇒ tk ≤ tl . The multisets tk Pi cover the domain X. τ2 τ3 τ4 τ5 τ6 τ7 τ8 τ2 τ3 τ4 τ5 τ6 τ7 τ8 τ2 τ3 τ4 τ5 τ6 τ7 τ8 τ2 τ3 τ4 τ5 τ6 τ7 τ8 τ2 τ3 τ4 τ5 τ6 τ7 τ8
τ2 τ3 τ4 τ5 τ6 τ7
Extremely little Pi
τΜ τΜ τΜ
τ8 τ7 τ6 τ5 τ4 τ3 τ2 τ8 τ7 τ6 τ5 τ4 τ3 τ2 τ8 τ7 τ6 τ5 τ4 τ3 τ2
τΜ
8
very little Pi rather little Pi
more or less Pi Pi
τ8 τ7 τ6 τ5 τ4 τ3 τ2
τ7 τ6 τ5 τ4 τ3 τ2
very very little Pi
approximately Pi
τ8 τ7 τ6 τ5 τ4 τ3 τ2
τΜ τ τΜ τ 8
vaguely Pi
really Pi exactly Pi
Pi rather Pi very Pi very very Pi Extremely Pi X
Fig. 3. Precision modifiers
Fig. 4. Translation modifiers
In this paper, we continue to propose our model for managing nuanced statements and we focus to study the exploitation of nuanced statements.
4
Exploitation of Nuanced Statements
In this section, we are interested to propose some generalizations of the Modus Ponens rule within a many-valued context [4]. We notice that the classical Modus Ponens rule has the following form: If we know that {If “x is A” then “y is B” is true and “x is A” is true} we conclude that “y is B” is true. In a many-valued context, a generalisation of Modus Ponens rule has one of two forms: (1) If we know that {If “x is A” then “y is B” is τβ -true and “x is A ” is τ -true} and that {A is more or less near to A}, what can we conclude for “y is B”, in other words, to what degree “y is B” is true? (2) If we know that {If “x is A” then “y is B” is τβ -true and “x is A ” is τ -true} and that {A is more or less near to A}, can we find a B such as {B is more or less near to B} and to what degree “y is B ” is true?. In this section, we propose new versions of GMP rule in which we use new relations of nearness. 4.1
First GMP Rule
In Pacholczyk’s versions of GMP rule, the concept of nearness binding multisets A and A is modelled by a similarity relation which is defined as follows: Definition 1. Let A and B be two multisets. A is said to be τα -similar to B, denoted as A ≈α B, iif: ∀x|x ∈γ A and x ∈β B ⇒ min{τγ → τβ , τβ → τγ } ≥ τα .
A Symbolic Approximate Reasoning
371
This relation is (1) reflexive: A ≈M A, (2) symmetrical: A ≈α B ⇔ B ≈α A, and (3) weakly transitive: {A ≈α B, B ≈β C} ⇒ A ≈γ C with τγ ≥ T (τα , τβ ) where T is a T-norm. By using the similarity relation to model the nearness binding between multisets, the inference rule can be interpreted as: {more the rule and the fact are true} and {more A and A are similar }, more the conclusion is true. In particular, when A is more precise than A (A ⊂ A) but they are very weakly similar, any conclusion can be deduced or the conclusion deduced isn’t as precise as one can expect. This is due to the fact that the similarity relation isn’t able alone to model in a satisfactory way the nearness between A and A. For that, we add to the similarity relation a new relation called nearness relation and which has as role to define the nearness of A to A when A ⊂ A. In other words, it indicates the degree to which A is included in A. Definition 2. Let A ⊂ B. A is said to be τα -near to B, denoted as A <α B, if and only if {∀x ∈ F(B), x ∈β A and x ∈γ B ⇒ τα → τβ ≤ τγ }. The nearness relation satisfies the following properties: (1) Reflexivity: A <M A, and (2) Weak transitivity: A <α B and B <β C ⇒ A <γ C with τγ ≤ min(τα , τβ ). In the relation A <α B, the less the value of α is, the more A is included in B. Finally, by using similarity and nearness relations, we propose a first Generalized Modus Ponens rule.
Proposition 1. Let A and A be predicates associated with the concept Ci , B be predicate associated with the concept Ce . Given the following assumptions: 1. it is τβ -true that if “x is A” then “y is B”
2. “x is A ” is τ -true with A ≈α A.
Then, we deduce : “y is B” is τδ -true with τδ = T (τβ , T (τα , τ )). if A is such that A <α A, we deduce: “y is B” is τδ -true with τδ = T (τβ , τα −→ τ ). Example 2. Let “really tall” ≈8 “tall” and “really tall” <8 “tall”. If we have: - if “x is tall” then “its weight is important” is true, - “Pascal is really tall” is quite-true, then we can deduce: “Pascal’s weight is really important” is almost-true. 4.2
GMP Rules Using Precision Modifiers
In the previous paragraph we calculate the degree to which the conclusion of the rule is true. In the following, we present two new versions of GMP rule in which the predicate of the conclusion obtained by the deduction process is not B but a new predicate B which is more or less near to B. More precisely, the new predicate is derived from B by using precision modifiers (B = mB). The first version assumes that the predicates A and A are more or less similar. In other words, A may be less precise or more precise than A. The second one assumes that A is more precise than A. Proposition 2. Let the following assumptions:
372
M. El-Sayed and D. Pacholczyk
1. it is τβ -true that if “x is A” then “y is B”
2. “x is A ” is τ -true with A ≈α A. Let τθ = T (τβ , T (τα , τ )). If τθ > τ1 then there exists a τn(δ) −dilation modifier m, with τδ ≤ T (τα , τβ ), such that “y is mB” is τ -true and τ = τδ −→ τθ . Moreover, we have: B ⊂ mB and mB ≈δ B.
This proposition prove that if we know that A is more or less similar to A, without any supplementary information concerning its precision compared to A, the predicate of the conclusion obtained by the deduction process (mB) is less precise than B (i.e. B ⊂ mB) and which is more or less similar to B. Proposition 3. Let the following assumptions: 1. it is τβ -true that if “x is A” then “y is B”
2. “x is A ” is τ -true with A <α A. Let τθ = T (τβ , τα −→ τ ). If τθ > τ1 then there exists a τn(δ) −contraction modifier m, with τδ ≥ τβ −→ τα , such that “y is mB” is τ -true and τ = T (τδ , τθ ). Moreover, we have: mB <δ B.
This proposition prove that from a predicate A which is more or less near to A we obtain a predicate mB which is more or less near to B. More precisely, if A is more precise than A then mB is more precise than B. Example 3. Given the following data: - if “x is tall” then “its weight is important” is true, - “Pascal is really tall” is moderately-true. - “Jo is more or less tall” is moderately-true. Then, we can deduce: “Pascal’s weight is really important” is moderately-true, and “Jo’s weight is more or less important” is moderately-true.
5
Conclusion
In this paper, we have proposed a symbolic-based model dealing with nuanced information. This model is inspired from the representation method on fuzzy logic. In previous papers, we have proposed a new representation method of nuanced statements. In this paper, we proposed some deduction rules dealing with nuanced statements and we presented new Generalized Modus Ponens rules.
References 1. B. Bouchon-Meunier and J. Yao. Linguistic modifiers and imprecise categories. Int. J. of Intelligent Systems, 7:25–36, 1992. 2. M. El-Sayed and D. Pacholczyk. A qualitative reasoning with nuanced information. 8th European Conference on Logics in Artificial Intelligence (JELIA 02), 283–295, Italy, 2002.
A Symbolic Approximate Reasoning
373
3. M. De Glas. Knowladge representation in fuzzy setting. Technical Report 48, LAFORIA, 1989. 4. D. Pacholczyk. Contribution au traitement logico-symbolique de la connaissance. PhD thesis, University of Paris VI, 1992. 5. L. A. Zadeh. A theory of approximate reasoning. Int. J. Hayes, D. Michie and L. I. Mikulich (eds); Machine Intelligence, 9:149–194, 1979.
Intuition in Soft Decision Analysis Kankana Chakrabarty School of Mathematics, Statistics and Computer Science The University of New England Armidale 2351, NSW, Australia [email protected]
Abstract. The notions of fuzzy sets and intuitionistic fuzzy sets(IFS’s) play significant roles in case of some knowledge representation problems where mathematical complexity and practical inconvenience can be caused due to our inability to differentiate events exactly in real situations and thus to define instrumental notions in precise form. This paper discusses the notions of fuzzy sets, IFS’s and describes that how these tools can be used for the explanation of uncertainty as well as for modelling intuitionistic behavioural patterns occurring in case of decision analysis and systems designing problems. . . .
1
Introduction
The wisdom associated with the philosophically grounded knowledge complements the power of scientific analysis and representational techniques. In order to conceive, analyse, and design intelligent systems, the soft computing methodologies are applied. These methodologies support knowledge representation under uncertainty, imprecision, and tolerates the imperfection in knowledge, partial truth, and vagueness in order to produce intelligent explanation, and more human-like results. The notions of fuzzy sets[4] and intuitionistic fuzzy sets[1] are found to be extremely useful in order to deal with the classes with unsharp boundaries involving non-crisp and non-deterministic circumstances. The present paper will discuss about these tools and will indicate how they can be used for modelling intuitionistic behaviour patterns in decision analysis and systems designing problems.
2
Fuzzy Sets and Intuitionistic Fuzzy Sets
For decades, the philosophers have realized the fact that the notion of exactness is someway artificial and forced. Humans became intelligent in the context of collaborative creation around them. The intelligent agents could come to associate themselves in the modification of their world in order to produce a mutually useful and productive simulated psychological reality which can accommodate certainty, uncertainty, and hesitation. The contemporary concern about knowledge representation under uncertainty had initiated some useful extensions of G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 374–377, 2003. c Springer-Verlag Berlin Heidelberg 2003
Intuition in Soft Decision Analysis
375
elementary set theory, such as the concepts of ‘Fuzzy Set Theory’ by Zadeh [4], ‘Rough Set Theory’ by Pawlak[3], and ‘IFS Theory’ by Atannasov[1]. Human cognition and interaction with the outer world involves objects that are not sets in the classical sense, but classes with unsharp boundaries in which the transition from membership to non-membership is gradual. Fuzzy sets act as metaphors for ordinary thought and they also take active part in data analysis and construction theories. Fuzzy sets deal with the techniques which lay in the form of mathematical precision to human thought processes that in many ways are imprecise by the standard codes of classical mathematics, and reflects itself as a multi-dimensional field of inquiry, contributing to a wide spectrum of areas ranging from para-mathematics to human perception and judgement. In case of the situation where it is difficult to determine the exact boundaries of the class, each element of the class is evaluated by a measure which expresses its place and role in the class. This measure is called the grade of membership in the given class, and the class in which each element is characterized by its membership grade is called the fuzzy set. Let X be a classical set of objects called the universe, and let x be any arbitrary element of X. Membership in a classical subset A of X is often viewed by a characteristic function µA defined as µA : X −→ V where V = {0,1} is called the valuation set. If the valuation set V is allowed to be the real interval [0,1], then A is called a fuzzy set, and µA (x) is called the grade of membership of x in A. The closer the value of µA (x) to 1, the more x belongs to A. Here the membership value µA (x) can be interpreted as the ‘degree of compatibility’ of the predicate associated with A and the object x, or the degree of possibility that a phenomenon x is restricted to A. Atanassov [1] generalized the notion of fuzzy sets by introducing the concept of IFS’s. In case of a fuzzy set F of an universe U , the degree of belongingness of x(∈ U ) is a real number µF (x) in [0,1], and consequently the degree of non-belongingness is {1 − µF (x)} . In the theory of fuzzy sets, it is assumed that the indeterministic part of each element is nil as far as the notion of belongingness is concerned. But interestingly, in many real life applications, it has been observed that the indeterministic part is not always nil, and evaluation for each element x is to be done by two informations ‘degree of belongingness’ µF (x) and ‘degree of non-belongingness’ νF (x), while the rest part 1 − {µF (x) + νF (x)} can be considered as the psychological hesitation part. Let a set E be fixed. An Intuitionistic Fuzzy Set (IFS) [1] A in E is an object having the form A∗ = {< x, µA (x), νA (x) >: x ∈ E} where the functions µA : E −→ [0, 1] and νA : E −→ [0, 1] define respectively the degree of membership and the degree of non-membership of the element x ∈ E to the set A, and for every x ∈ E, 0 ≤ µA (x) + νA (x) ≤ 1. If πA (x) = 1 − {µA (x) + νA (x)}, then πA (x) is the degree of indeterminacy or hesitation on belongingness of the element x in the IFS A.
3
Effect of Intuition in Soft Decision Analysis
In case of soft computing techniques, the variables are often found to be extremely difficult to quantify, and the dependencies are found to be too ill defined
376
K. Chakrabarty
in order to admit precise characterisation in terms of differential equations. As every Crisp Set (CS) can be well described in terms of Fuzzy Sets (FS), every fuzzy set can again be described in terms of IFS’s. Hence IFS’s are considered as the generalizations of FS. The concept of IFS provides an alternative technical approach when the amount and the nature of the available information is not sufficient enough for the definition of an imprecise concept using the means of a conventional FS. The very moment a decision is taken by an agent, it completes the contextual effect of its analytical behaviour pattern, often forcing a concrete perception of the situation. There always remain the underlying complex factor to quantify and simulate human behaviour as exactly as possible, because the analytical factors, the judgement factors, and the hesitation factors closely related to the cognition process can vary over the time and hence the time-space frame of reference can play an interesting role in modelling these time varying n-dimensional factors. The inherent part of the nature of a concept is that it involves a process for distinguishing between the exact and the inexact applications. Hence the dimensions of a concept might include the notions at the personal, conscious, and subconscious levels. The fact that an agent’s justifications following a pre-defined set of rules come to an end, does not always necessarily imply that the explanation of its reaction at any stage of the process comes to an end at the same point.
IFS FS CS
Fig. 1. IFS ⊃ FS ⊃ CS
In case of a systems development project, if after conducting a series of technical, operational, and economical feasibility analysis experiments, suppose an intelligent agent comes to the conclusion that the prototyping phase should commence, then depending upon the observational and analytic patterns, the objective should be to quantify the certainty factor, the uncertainty factor, and the intuitionistic factor associated with this decision process. Let ω = {φt , φo , φe } be set of conceptual outcomes of the feasibility experiments. In this case, ω acts as a collective model of intelligent concepts. In the corresponding decision pspace, the decision function λ can take up a number of linguistic values, each of which depends upon the analytical behaviour pattern of the members of ω, and in turn, these patterns can also be functional in nature. Any fuzzy set φ : ω −→ [0, 1] can represent the contextual certainty factor associated with the patterns. But the contextual certainty factor only is not sufficient to model the decision pattern and reach for a possible linguistic value of λ. Hence we consider the IFS represented by µφ : ω −→ [0, 1], νφ : ω −→ [0, 1] where µφ and νφ represent the
Intuition in Soft Decision Analysis
377
degree of belongingness and the degree of non- belongingness of φ to the concept respectively. This grading is assumed to be completely contextual in nature. In fact, µφ and νφ can be effectively used to measure the contextual certainty and the uncertainty factors associated with the feasibility analysis pattern of the expert. Thus, the formalization of the methodology for quantification of the associated certainty and uncertainty can be possible using this procedure. Also, πφ will represent the hesitation factor associated with the feasibility analysis pattern. Hence the quantification of the psychological hesitation associated with multi-objective analytical behaviour pattern can be thus explained in case of a decision function associated with a decision p-space. Finally we have, λ(φt , φo , φe , ui , cj )i∈I,j∈J = f (µφ , νφ , ui , cj )i∈I,j∈J = p(πφ , ui , cj )i∈I,j∈J where {ui }i∈I and {cj }j∈J represent the sets of uncontrollable and controllable variables of the decision p-space respectively. Clearly the function p will take up a linguistic value explaining the next state of the process. A comparative study of the hesitation factors associated with the behaviours of a series of intelligent agents under the influence of the same sets of controllable and uncontrollable variables can show the uncertain knowledge representation patterns and the outcome of the decision process module inside the boundaries of the concerned space. Thus, in this paper, we have discussed the notions of fuzzy sets and intuitionistic fuzzy sets with reference to their relevance and applicability to the cognitive modelling involving decision analysis situations associated with psychological hesitation. The explanation of the next state of process depends upon the analytical behaviour patterns of the feasibility process and understanding. This can model a system that objectively improves an existing system and can form links between the program, the process, and the stages of the process.
References 1. Atanassov, K.: Intuitionistic fuzzy sets. Fuzzy Sets and Systems. 20(1) (1986) 87–96 2. Clark, A., Millican, P. (eds.): Connectionism, Concepts, and Folk Psychology. Oxford University Press (1996) 3. Pawlak, Z.: Rough sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers (1991) 4. Zadeh, L.A.: Fuzzy Sets, Information Control. 8, (1965) 338–353 5. Zetenyi, T. (ed.): Fuzzy sets in Psychology. North Holland (1998) 6. H.J. Zimmermann, Zadeh, L.A., Gaines, B.R. (eds.): Fuzzy sets in Decision Analysis. North Holland (1984)
Ammunition Supply Decision-Making System Design Based on Fuzzy Control Deyong Zhao, Xinfeng Wang, and Jianguo Liu Department of Management Engineering, Ordnance Engineering College, Shijiazhuang 050003 [email protected]
Abstract. This paper puts forward radical thoughts and goal of designing the assistant decision-making system based on FUZZY control for ammunition supply and the whole structural function of the system by applying radical theory and method of FUZZY control which it establishes foundation of developing the assistant decision-making system based on FUZZY control.
Currently, researching status and developing tendency of ammunition control system at home and abroad, and seeking an accumulated and effective method to improve the level of control system for ammunition supply is the chiefly condition for solving modernization of ammunition support and supply. The research of Ammunition Supply Assistant Decision-making System design based on FUZZY control(FCASADS) is put forward to adapt this requirement. This paper puts forwards research as follows: Aiming to improve the level of ammunition supply from automatization, digitization and requirements of ammunition supply control, we develop an ammunition supply assistant decisionmaking system which satisfies organizational harmony and decision requirements for ammunition supply in wartime.
1 Design Goal Applying FUZZY control, we establish general headquarters-military region-unit army ammunition supply assistant decision-making system which provides theoretical supply for automatization and digitization of ammunition supply and provides general development scheme to achieve resources deploy, organizational harmony, command control and efficiency assessment. The radical goal, which realized by assistant decision-making system, offers a general development scheme for establishing an FC-ASADSling, in order to realize all kinds of functions abovementioned. The Structure of assistant decision-making system based on FUZZY control is radically resemble with the generic figure system which is controlled by computer, only in which is FUZZY controller. Generally, the FUZZY control system is composed FUZZY controller, input and output interface, examining device, executive mechanism and controlled object. The structural chart of FUZZY control system is as follows:
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 378–381, 2003. © Springer-Verlag Berlin Heidelberg 2003
Ammunition Supply Decision-Making System Design Based on Fuzzy Control reference input
+ -
input interface
379 output
FUZZY controlle
output interfac
Executive mechanis
controlled object
Examining device Fig. 1. Structural chart of FUZZY control system
2 Systemic Structure and Function Design The research for FC-ASADS mostly develops its work in four basic fields as follows: the research subsystem of ammunition supply system at home and abroad, the research and analysis subsystem about factors of ammunition supply control, the subsystem of FUZZY controller and the subsystem of system service. 2.1 Research Subsystem of Ammunition Supply System at Home and Abroad On one hand the major assignment of this subsystem is that studies actuality and development trend of ammunition supply control system in foreign army; on the other hand, studies historic and current radical circumstances in our army. Through these two sides of research, it can offer information of background and sustaining theory for us; at the same time, it can offer methodology direction of realizing automation and digitization technology innovation of ammunition supply control. The material aim to realize is as follows: acquaints actuality of other army entirely, especially American army and analyzes its trend of development materially; studies development history and current material circumstances of ammunition supply control in our army. Based on this foundation, we can realizes the need of automation and digitization on ammunition supply control, the thought of work and the collective construction aim . 2.2 Research Subsystem about Ammunition Supply Control Factors In this subsystem, it mostly analyzes radical factors entirely that effect ammunition supply control in order to seek complete and exact input variable for FUZZY controller of ammunition supply control. Because there are various factors that effect ammunition supply control, furthermore, the degree of each factors’ affection is different; at the same time, aiming to design FUZZY controller simply, and exactly, we need to analyze and research the effecting factors materially in order to separate these factors into different arrangement and dealing with them according to the different of effecting degree. For a MIMO ammunition supply system based on FUZZY control, it is very difficult to establish a controlling rules for a FUZZY controller of many variable. Therefore, we adopt the thoughtfulness of decreasing dimension to realize layered multi-variables FUZZY controller.
380
D. Zhao, X. Wang, and J. Liu
2.3 Subsystem of FUZZY Controller The subsystem of FUZZY controller is the central matter of the research for ammunition supply control system based on FUZZY control. According to its function, FUZZY controller is composed of four parts mostly as follows: FUZZY interface, knowledge base, inference mechanism and solving FUZZY interface. Its frame shows as follows: knowledge base
input
output FUZZY interface
inference mechanism
solving FUZZY interface
Fig. 2. FUZZY controller sketch map
FUZZY interface. Only through FUZZYING, the input of FUZZY controller can be used to solve output of FUZZY control, so it is input interface of FUZZY controller. Its major function is that translates a real input variable into a FUZZY vector. Function of FUZZY interface are as follows: (1) translation of measurement range; (2) FUZZY interface. It is as that translates exact variable into FUZZY variable. After input signal mapped to a point on corresponding domain, we can translate it into a FUZZY subset on this domain. Knowledge base. Knowledge base contains knowledge in applying field, and it is composed of database and ruling base. (1) Database. The database provides all essential defines. All domain corresponded by input and output variables and all definitions of FUZZY subset, which defines in these domain and be used in defining ruling warehouse, are deposited in the database. If domain is discrete, FUZZY subset collection leaves its degree of subjection on every scatter point in the database; if domain is continuous, FUZZY subset collection leaves its function of subjection in the database. In process of inference by FUZZY controller, the database provides essential data for inference mechanism. When changing into FUZZY by FUZZY interface and solves FUZZY by solving FUZZY interface, the database also provides essential data corresponding domain for them. What needs explain is that measuring data of input and output variable don’t belong to category of leaving in database. (2) Ruling base. The ruling base comprises rules of FUZZY control. Rules of FUZZY control are based on controlling experience accumulated by handling persons in a long time and relating knowledge of professional expert, and it is a model of knowledge for controlled object which it isn’t a mathematic model. It is a form of denoting language according inference of personal instinct. Weather this model is exact, in another word, weather summarize successful experience of handle persons and knowledge of professional expert exactly, it decides controlling capability of FUZZY controller.
Ammunition Supply Decision-Making System Design Based on Fuzzy Control
381
Inference Mechanism. The inference mechanism adopts some FUZZY inference method, input of sampling time and FUZZY controlling rules infer controlling output of FUZZY controller. Generally, the inference mechanism used in FUZZY controller is simpler than inference used in typical expert system, because in FUZZY controller a conclusion of a rule is unable to use as premises of another rule. According to FUZZY control rules, in the inference mechanism, input of every sampling time infers effect of controlling, but this group of FUZZY conditional sentence can infer a connection of FUZZY on input and output. Solving FUZZY Interface. Opposed to FUZZY interface, solving FUZZY interface is a process of translating FUZZY variable into exact variable. When computer executes calculate of FUZZY control, the effect of FUZZY control gained in FUZZY inference must translate into exact variable that executing institution can be accepted. Solving FUZZY interface has two functions mostly: (1) translation of measurement range; (2)solving FUZZY. 2.4 Subsystem of System Service This subsystem is composed of two aspects mostly one is network system service, the other is system service of executive mechanism in control system. The system service of network realizes connection and resources share of general headquartersmilitary region-unit army networks. Mostly, the service of executive mechanism in control system means how to gather input data of FUZZY controller and through calculation ensure actual fashion for ammunition after receiving actual supplying capacity and so on. Network system service. The collectivity task of system service shows as follows: from actual requirements, studying systemic frame and composing of FUZZY control system for ammunition supply in stratagem, campaign, tactic, then we can gain collectivity system frame of network, which is composed of all levels of network. Mostly, it includes selecting and designing LAN of general headquartersmilitary region; administration of networks in various levels; the radical service in network etc. Executive Mechanism Service. On one hand, the service of executive mechanism relies on existent network conditions to gather input data that all levels of FUZZY control system required; on the other hand, how to realize controlling actually after FUZZY controller works out the actual ammunition supply, it is also a problem that must be settled in this subsystem. For this problem, our goal is to carry out a little research and discuss in theory, for example the analysis of MOADS, PLS, GPS. For gathering data, we plan to rely on equipment support command automation system which is existent in army to gather data.
The Concept of Approximation Based on Fuzzy Dominance Relation in Decision-Making Yunxiang Liu1,2, Jigui Sun1, and Shengsheng Wang1 1
Institute of Computer Science, Jinlin University, Changchun 130012 2 Changchun Institute of Technology, Changchun 130021
Abstract. The method for decision-making system that derives from the combination of rough sets and fuzzy sets has been presented, discussing a multicriteria classification that differs from usual classification problems, the main change is the substitution of the indiscernibility relation by a dominance relation. Proposing the idea of fuzzy dominance relation for decision-making, and the concept of L-fuzzy set is given. It builds up a rough approximation of upward and downward cumulated sets by fuzzy dominance relation.
1 Introduction In general, decisions are based on some characteristics objects[1,2]. For example, when buying car, the decisions can be based on such characteristics as price, maximum speed, fuel consumption, color, country of production, etc. we refer to these characteristics calling them attributes; Some of them may have ordinal properties expressing preference scales, and calling them criteria[3,4]. Moreover, decisions may be ordinal, the DM could be interested in a classification of cars in three categories: acceptable, probably acceptable, non-acceptable. This type of classification is called sorting. This paper discusses a multicriteria classification that differs from usual classification problems since it takes into account preference orders in the description of objects by condition and decision attributes from the combination of rough sets and fuzzy sets. Proposing the idea of fuzzy dominance relation for decision-making, and the concept of L-fuzzy set is given.
2 Preliminaries Definition 2.1. The information system is the 4-tuple IS = (U, A, V, f ), where U is a finite set of objects (universe), A={a1,a2,…,am} is a finite set of attributes, and can be divided into condition attributes (set C≠ φ ) and decision attributes (set D≠ φ ), C∪D = Q and C∩D = φ . Va is the domain of the attribute a, V= UV a and f : U×A → V is total a∈ A
function such that f(x,a) ∈ Va for each a ∈ A, x ∈ U, called information function. To each subset of attributes P is associated with an indiscernibility relation on U, denoted by Rp: Rp = {(x,y) ∈ U×U; f(x,a) = f(y,a) ∀ a ∈ P}.
If (x,y) ∈ Rp, it is said that the objects x and y are P-indiscernibility. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 382–385, 2003. © Springer-Verlag Berlin Heidelberg 2003
The Concept of Approximation Based on Fuzzy Dominance Relation
383
Definition 2.2. Let U/R = {CLt, t ∈ M}, M={1,…, n}, be a set of class of U, each x ∈ U belongs to one and only one class CLt ∈ U/R, and (U/R, ≤ ) is a set of partial preorder. The upward and downward cumulated sets are defined, respectively[3], ≥ ≤ Cl t = U CL s, Cl t = U CL s.
s ≥t
s ≤t
Observe that Cl 1≥ = Cl ≤n =U, Cl ≥n = CLn, and Cl 1≤ = CL1. Example. Consider an example inspired by the example of evaluation in a high school , The director of the school wants to assign students to three classes. According to the decision attribute, the students are divided into three preferenceordered classes: Cl 1 ={bad}, Cl 2 ={medium}, Cl 3 ={good}. We can get upward and downward cumulated sets as following: · Cl 1≤ = Cl 1 , i.e. the class of (at most) bad students, · Cl ≤2 = Cl 1 ∪ Cl 2 ,i.e. the class of at most medium students, · Cl ≥2 = Cl 2 ∪ Cl 3 ,i.e. the class of at least medium students, · Cl 3≥ = Cl 3 , i.e. the class of (at least) good students.
3 Approximation Based on Fuzzy Dominance Relations Definition 3.1. Let X be a fuzzy set defined in a finite universe U. Moreover, let µ x (x) be the membership function of X and R be an equivalence fuzzy binary relation defined on U. In this case, the lower approximation of X can be defined as a fuzzy set whose membership function associates with each x ∈ U, the credibility that “for each y ∈ U, y is not in relation R with x and /or y belongs to X”, i.e.,
µ (x, A R (X)) = T
y∈U
S(N(R(y,x)), µ x (y))
The upper approximation of X can be defined in term as a fuzzy set whose membership function associates to each x ∈ U, the credibility that “there is at least one y ∈ U which is in the relation R with x and which belongs to X”, i.e.,
µ (x,
A R (X))= Sy∈U T(R(y,x), µ x (y))
Definition 3.2. Let Sq be a fuzzy outranking relation on U with respect to criterion q ∈ C, i.e. Sq: U × U → [0,1], Sq(x,y) represents credibility of the proposition “x is at least as good as y with respect to criterion q”. A fuzzy dominance relation on U (denotation Dp(x,y)) can be defined for each P ⊆ C as follows: Dp(x,y)= T (Sq(x,y)). q∈ P
Given (x,y) ∈ U × U, Dp(x,y) represents the credibility of the proposition “x outranks y with respect to each criterion q from P”. Suppose that Sq is a fuzzy partial T-preorder, i.e. that it is reflexive (Sq(x,x)=1 for each x ∈ U) and T-transitive, then also the fuzzy dominance relation Dp is a partial Tpreorder. Definition 3.3. Let U be a nonempty set and 〈 L, ≤ 〉 a lattice. An L-fuzzy set Cl t on U is an object having the form
384
Y. Liu, J. Sun, and S. Wang
Cl t ={ 〈 x,
Cl t
(x),
Cl t
(x) 〉 ; x ∈ U},
Where the function Cl t : U → L and Cl t : U → L denote the degree of membership (namely Cl t (x)) and the degree of nonmembership (namely Cl t (x)) of each element x ∈ U to the set Cl t , respectively, and 0 ≤ Cl t (x) ≤ Cl t (x) for each x ∈ Cl t , where : L → L is an order-reversing operation in 〈 L, ≤ 〉 . For the sake of simplicity, we shall use the symbol Cl t = 〈 x, Cl t , Cl t 〉 for Cl t ={ 〈 x, Cl t (x), Cl t (x) 〉 ; x ∈ U}. Definition 3.4. Let U be a nonempty set, and 〈 L, ≤ 〉 a complete lattice whose least element and greatest element are denoted by 0 and 1, respectively, with an orderreversing operation : L → L, and consider a fuzzy rough set ( A ( Cl≥t ), A ( Cl≥t )), where the membership functions µ A ( Cl≥t ) : Cl≥t → L, µ A ( Cl ≥t ) : Cl≥t → L. For the sake of
µ (x,
simplicity, we shall use the symbol
A ( Cl≥t )) for µ A ( Cl ≥t ) ,
µ (x,
A ( Cl≥t ))
for
µ A ( Cl ≥t ) The P-lower and the P-upper approximation of Cl≥t with respect to P ⊆ C are fuzzy sets in U, whose membership function (denotation A p( Cl
≥ t
µ (x,
A p( Cl≥t )) and
µ (x,
)) are defined as:
µ (x,
A p( Cl≥t )) = T (S(N(Dp(y,x)), µCl≥t (y))). y∈ U
µ (x,
A p( Cl≥t ))
µ (x,
= S (T(Dp(x,y), µCl≥t (y))). y∈ U
A p( Cl≥t )) represents the credibility of the proposition “ for all y ∈ U, y does not dominate x with respect to criteria from P and /or y belongs to Cl≥t ”, while µ (x, A p( Cl≥t )) represents the credibility of the proposition “there is at least one y ∈ U dominating x respect to criteria from P which belongs to Cl≥t ”. Analogously, the P-lower and P-upper approximations of Cl≤t with respect to P ⊆ C (denotation µ (x, A p( Cl≤t )) and µ (x, A p( Cl≤t )) can be defined as:
µ (x,
A p( Cl≤t )) = T (S(N(Dp(y,x)),
µ (x,
y∈ U
A p( Cl≤t ))
= S (T(Dp(x,y), y∈ U
Cl ≤t (y))). Cl ≤t (y))).
µ (x, A p( Cl≤t )) represents the credibility of the proposition “ for all y ∈ U, y does not dominate x with respect to criteria from P and /or y belongs to Cl≥t ”, while µ (x, A p( Cl≤t )) represents the credibility of the proposition “there is at least one y ∈ U dominating x respect to criteria from P which belongs to Cl≤t ”. 4
An Example
Let us consider an example. The example concerns six students described by means of four attributes in table 4.1. Using the set of attributes P={A1,A2,A3}, the students having a global evaluation “good” and “bad”, i.e.Y1={x1,x3,x6}, Y2= {x2,x4,x5}, we can get some results as fellow [4]:
IR P ( C good ,Y1) = IR P ( C good ,Y1) = 1 IR P ( C medium ,Y1) =0 IR P ( C medium ,Y1) = 0.7 A1 A1 A1 A1
The Concept of Approximation Based on Fuzzy Dominance Relation
385
IR P ( C bad ,Y1) = IR P ( C bad ,Y1) = 0 IR P ( C bad ,Y2) = IR P ( C bad ,Y2) = 1 A1 A1 A1 A1 IR P ( C ≤Abad ,Y2) = IR P ( C ≤Abad ,Y2) = 1 1 1 IR P ( C ≤Amedium ,Y2) = 0.25 IR P ( C ≤Amedium ,Y2) = 1 1 1 IR P ( C ≥Amedium ,Y1) =0.4 1
≥ medium
IR P ( C A1
,Y1) = 1
Table 4.1. Data table with examples of classification Object
attributes
x
A1
A2
A3
A4
x1 x2 x3 x4 x5 x6
good medium
good bad
bad bad
good bad
medium bad
bad bad
bad bad
good bad
medium good
good bad
good good
bad good
5 Conclusion In this paper, we discussed a rough set method for ordering problem whose purpose is to approximate a partition of some reference objects described by criteria and by attributes, It builds up a rough approximation of upward and downward cumulated sets by dominance relation and by fuzzy dominance relation. This new approach enables consideration of attributes with preference-ordered dominants as well as preferenced-ordered classes, the main idea is the substitution of the indiscernibility relation by a dominance relation. We showed some new basic concepts and discussed them with an example. These results will be very helpful to intelligent decisionmaking.
References 1. S.K.Pal, A.Skowron, Rough Fuzzy Hybridization, A New Trend In Decision-Making, Springer-Verlag Singapore Pte.Ltd. 1999 2. Wang Guoyin, The Extension of Rough Sets Theory in Uncertainty Information System, Journal of Computer Research and Development, 2002, vol.39, No.10, 1238–1243 3. S. Greco, B. Matarazzo, R.Slowinski, Rough Approximation by Dominance Relation, International journal of intelligent systems, Vol.17, 153–171(2002) 4. Liu Yunxiang, Sun Jigui, Fuzzy approximation in intelligent decision making, In: Zhongzhi Shi, Qing He, Proceeding of International conference on intelligent information technology, Beijing, 2002,169–172
An Image Enhancement Arithmetic Research Based on Fuzzy Set and Histogram Liang Ming, Guihai Xie, and Yinlong Wang Department of Control Systems Engineering, Ordnance Engineering College, ShiJiaZhuang, China, 050003 [email protected]
Abstract. This paper focuses on discussing a Fuzzy Enhancement Arithmetic Based on Histogram for image enhancement using fuzzy theory and histogram. With the statistic characters of histogram and flexibility of fuzzy set, this arithmetic can enhance the effect of some interested image-area freely, compared with some enhancement arithmetic using histogram alone or using selected threshold value difficultly. The paper proposes the corresponding theories and detailed process to carry out the arithmetic, and analyzes how to select the involved parameters xc , r and Fe in it. After that, the paper also gives an example for the application of the arithmetic.
1 Image Fuzzy Character Plane According to the concept of fuzzy subset, a M*N image X with L gray-scale can be X=
UU pmn / xmn, m = 1,2,3,..., M ; n = 1,2,3,..., N . m
(1)
n
look on as a fuzzy dot matrix, and formulated as follows: In (1), pmn / xmn means the membership function of fuzzy dot set of (m, n) is
pmn (0 ≤ pmn ≤ 1) . Now we assign the gray-scale of the pixel as fuzzy character in the image [1]. If xmn is the gray-scale of the pixel (m, n), and x max is the largest gray-scale, fuzzy information pmn can be formulated as follows with xmn by equation (2): where Fd and Fe is called as reciprocal fuzzy factor and exponent fuzzy factor, ( x max − xmn) Pmn = G ( xmn) = 1 + Fd
− Fe
, m = 1,2,..., M ; n = 1,2,..., N .
(2)
whose values will affect the fuzzy degree in fuzzy character plane P, which is composed by all the pmn( m = 1,2,..., M ; n = 1,2,..., N ) . The pmn in equation (2) can denote the degree how closely the gray-scale of pixel (m, n) tends to the largest.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 386–389, 2003. © Springer-Verlag Berlin Heidelberg 2003
An Image Enhancement Arithmetic Research Based on Fuzzy Set and Histogram
387
2 Fuzzy Enhancement Arithmetic This fuzzy enhancement arithmetic uses the fuzzy character plane from the space domain with fuzzy factors ( Fd and Fe ), then changes the gray-scales of pixels along with the pmn , and realizes the enhancement of the interested image-area effect. The main tool used for enhancement is “Fuzzy Contrast Operator (INI)”. Commonly, we make contrast enhancement on fuzzy set A, and get another fuzzy set A’: A′ = INT ( A),
(3)
whose membership function is: 2[ µ ( x)]2 , 0 ≤ µ A ( x) ≤ 0.5; µ A′ ( x) = µ INTA ( x) = A 1 − 2(1 − µ A ( x)) 2 ], 0.5 ≤ µ A ( x) ≤ 1.
(4)
Such enhancement operation further increases the value of µ A (x) above 0.5, and decrease that below 0.5, which reduces the fuzziness of fuzzy set A. For convenience, we assign T1 to represent the operation above: T ′ ( µ ( x)) = T ′ ( Pmn),0 ≤ µ ( x) ≤ 0.5; 1 A µ A ′ ( x ) = T1 ( µ A ( x)) = 1 A T1″ ( µ A ( x)) = T1″ ( Pmn),0.5 ≤ µ A ( x ) ≤ 1.
(5)
Draw an inference that T ′ ( Pmn),0 ≤ Pmn ≤ 0.5; P′mn = Tr ( Pmn) = r Tr ″ ( Pmn),0.5 ≤ Pmn ≤ 1.
(6)
In equation (6), T ( Pmn) = T (T ( Pmn)), s = 1,2,...r. The effect of image enhancement is not evidence by the operation of equation (5) once, but operating (5) several times until r times, the enhancement effect will be much obvious. So the transform Tr is the kernel of the fuzzy enhancement arithmetic. s
1
s −1
2.1 TheProcess of the Arithmetic
i) Make G transform on image ( xmn ) by equation (5), and get the fuzzy character plane ( pmn ): pmn = G (xmn);
Tr on each pmn in the fuzzy character plane, and get enhanced fuzzy character plane ( p ′mn ): ii) Make transform
P′mn = Tr (Pmn);
x max P′mn = 1 + Fd
− Fe
,
x max if P′mn 1 + Fd
− Fe
7
388
L. Ming, G. Xie, and Y. Wang −1
iii) Calculate the inverse function G from equation (2), and then make transform Tr on each p ′mn in the fuzzy character plane, and get the corresponding gray-scale
x ′mn .
x ′mn = G −1 ( P ′mn).
2.2 How to Select the Involved Parameters
xc in image gray-scales to make pmn > 0.5 when xmn > xc and pmn < 0.5 when xmn < xc , namely Pxc = 0.5 . There are many ways to get xc , but we will get suitable value of xc by the i) Select an turning point
means of histogram. For example, figure 1. Selecting the turning point xc in a “vale” is easy to segment an interested body in an image and enhance it. ii) Join the xc and Pxc = 0.5 to equation (2), and get equation (8): x max − xc Pxc = 1 + Fd
Fig. 1. Histogram
− Fe
(8)
Usually, let Fe be 1, or 2, even 3 is enough, then we can calculate the Fd value by equation (8); ) Select enhancement-time r , and make fuzzy enhancement r times by equation (6) to get an better effect. Usually, let r be 1, or 2, even 3 is enough.
3 An Example and Discussion Now we make Fuzzy Enhancement Arithmetic Based on Histogram on the image to get better local visual effect in figure 2(a), whose histogram is figure 1. Gray-scale is in the range from 0 to 255. Firstly, in the histogram (figure 1), there are three “vale” in [0 60] [100 200] and [230 250], where we can select three corresponding xc for enhancement. Secondly, in order to separate an interested area and enhance its contrast in the image, we can assign the xc according to the gray-scale in the area. By the original image (figure 2(a)) and its histogram, we can infer that the xc can be set at about 130 and 240. In order to get better experiment results, we can adjust the turning point xc , enhancement-time r and exponent fuzzy factor Fe . The results are as follow in figure 2(b)-(c). For compare, we process the same image with single Histogram Equalization Arithmetic, whose result is figure 2(d).
An Image Enhancement Arithmetic Research Based on Fuzzy Set and Histogram
a
b
c
d
389
Fig. 2. (a) Original image, (b) Enhancement of two mountain away, (c) Enhancement of sky and cloud, (d) Result using single Histogram Equalization Arithmetic.
We can draw some conclusion from the experiment results: i) Histogram Equalization Arithmetic enhance the image in whole based on statistics. But it is uncontrollable, and is not easy to distinguish the tiny gray-scale change. ii) The Fuzzy Enhancement Arithmetic Based on Histogram in this paper is much flexible and facility. We can enhance the effect of the image in detail to satisfied degree by adjusting the parameters xc , r and Fe along with improving the contrast of local interested area, which has better visual effect and will be useful to further processing on image such as edge extraction, model recognition and so on. For example, we can distinguish the cloud and sky in figure 2 (c).
References 1. Guo Guirong etc. Fuzzy technology in information processing. National University of Defense Technology Press, 1993. 2. Kenneth R.Castleman. Digital Image Processing. Prentice-Hall, Inc. 1996. 3. H. R. Tizhoosh, M. Fochem. Fuzzy Histogram Hyperbolization for Image Enhancement. Proceedings EUFIT 95. Vol. 3. Aachen, 1995.
A Study on a Generalized FCM 1
2
Jian Yu and Miin-shen Yang 1 2
Dept. of Computer Science & Technology, Northern Jiaotong University, Beijing 100044 Dept. of Applied Mathematics, Chung Yuan Christian University, Chungli, Taiwan, 32023 [email protected]
Abstract. In this paper, we unify several alternative FCM algorithms, including the FCM, PFCM, CFCM, PIM, ICS, etc, into one model, called a generalized fuzzy c-means (GFCM). This GFCM model presents a wide variation of FCM algorithms. We discover that the pseudo mass center of the data set is the fixed point of the GFCM under mild conditions. By judging the stability of the fixed point of the GFCM, we give a new approach to theoretically choosing the parameters in the GFCM model.
1
Introduction
Cluster analysis is a tool for clustering a data set into groups of similar characteristics, has been successfully applied in various areas (see [1], etc). The fuzzy c-means (FCM) algorithm is one of most widely used clustering algorithms. However, there are still some problems when applying the FCM. For example, we always suppose that the points in the dataset are equally important; The number of the points in each cluster are almost equal; Almost all points has no membership with one and outlying points affect the clustering results, etc., when we implement the FCM algorithm. In order to overcome those drawbacks, many generalized FCM algorithms had been proposed in the literatures, for example, PFCM [2,3]; the conditional FCM (CFCM) [4]; the ICS and PIM algorithms [5,6]; etc. In this paper, we first unify these varieties of FCM into one model and call it a generalized FCM (GFCM). The proposed GFCM model can even present more generalizations of FCM algorithms. The main aim of this paper is to get a new way to choose the appropriate parameters in the fuzzy clustering algorithm according to Yu et al. [8]. This paper is organized as follows. In Section II, several generalized types of FCM algorithms including FCM, PFCM, CFCM, ICS, PIM, and AFCM [7], etc., are unified into one model, called a GFCM. In Section III, we point that the pseudo mass center of the data set is the fixed point of the GFCM under mild conditions. Then, by judging the stability of fixed points of the GFCM, we obtain theoretically justified criteria to choose the appropriate parameters in the GFCM, including weighting exponent m, etc. In Section IV, we make conclusions and remarks.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 390–393, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Study on a Generalized FCM
391
2 The GFCM Algorithm Let X = {x1 , x2 , L, xn } ⊂ R s be a data set. For a given c, the cluster centers where vi ∈ R s .
2 ≤ c < n , v = {v1 , v2 ,
L, v } denotes c
We can propose the following GFCM objective function n c γ c Jmρ (u, v) = ∑∑(uik )m ρi (d(xk , vi )) − ∑ρ0 (d(vi , vt )) c t=1 k=1 i=1
ρ i (x ) is a continuous function of x ∈ [0,+∞ ) satisfying
c
∑ uik = f k for f k ≥ 0 and i =1 for all x ∈ [0,+∞ ) .
where ρ i′(x ) > 0
(1)
Several examples of the GFCM based on (1) includes the FCM, PFCM, Conditional FCM, PIM, and ICS, AFCM, as follows: a). Set ∀k , f k = 1, ∀i, ρi = x, γ = 0 , (1) leads to the standard FCM algorithms, see [1]. b). Set ∀k , f k = 1, ∀i, ρi = x − w ln(α i ), γ = 0 , and ∀i,α i ≥ 0, ∑ci=1α i = 1 , (1) results in the PFCM algorithms proposed in [2]. c) Set ∀k , f k = 1, ∀i, ρi = x − w × tahn(α i ), γ = 0 , and ∀i,α i ≥ 0, ∑ci=1α i = 1 , (1) results in an alternative algorithm of the PFCM, which is proposed in [3]. d). Set ∀i, ρi = x, γ = 0 , (1) is the objective function of the conditional FCM algorithms, see [4]. e). Set ∀k , f k = 1, ∀i, ρi = x − w, γ = 0 . Then minimization of (1) naturally yields the PIM algorithm, see [5]. f). Set f k = 1 , ∀k and ρ i (x ) = x for ∀i . Then we get the ICS objective function, see [6]. g). Set f k = 1 , ∀k , γ = 0 and ρ i (x ) = 1 − exp(1 − βx ), ∀i , then the AFCM objective function is obtained, see [7]. By Lagrange multipliers, we can get the necessary conditions for a minimum of J mρ (u, v ) as follows n
vi =
∑ uikm ρi′(d (xk , vi ))xk − k =1
2γ c
2γ ∑ uikm ρi′(d (xk , vi )) − c k =1
u ik = f k
n
ρ i (d ( x k , vi ) )
∑ ρ 0′ (d (vi , v j ))v j c
j =1 c
(2)
∑ ρ0′ (d (vi , v j )) j =1
−1 m −1
(3)
∑ ρ j (d ( xk , v j ) )m −1 −1
c
j =1
The iteration with update equations (2) and (3) is called the GFCM algorithm, where d (x k , v i ) = x − v . 2
k
i
Substituting (3) into (1) yields 1−m
n −1 c J mρ (v ) = ∑ f km ∑ ρ i (d (xk , vi )) m−1 k =1 i =1
−
nγ c
c
c
∑∑ ρ 0 (d (vi , vt )) i =1 t =1
(4)
392
3
J. Yu and M.-s. Yang
The Properties of the GFCM
It is well known that one of the most important parameters in the FCM is the weighting exponent m. When m is close to one, the FCM approaches to the hard cmeans algorithm. When m approaches to infinity, the only solution of the FCM and its fixed point will be the mass center x of the data set. Therefore, the weighting exponent m plays an important role in the FCM algorithm. Similarly, choosing a suitable weighting exponent is very important when implementing GFCM. Let us review the basic idea for choosing the proper weighting exponent m in FCM. It is known that the mass center x of the data set is always a fixed point of the FCM algorithm. However, when the data set is clustered into c (c>1) subsets, each subset is often expected to have a different prototype (or cluster center) than others. Therefore, we hope x is not a stable fixed point of FCM [8]. Similar analysis should hold for the GFCM. m
Let x f =
∑ f km xk k =1 m
∑ f km
k1 (x + b ) i = 1 ki (x + b ) i ≠ 1
, and ρi (x ) =
where
ki
is a positive constant and
k =1
k1 = 1 ,
b can be assumed a proper real number. Then it is proved that x f is always a fixed point of the GFCM algorithm. n
Under the above assumptions, if
∀i, v i = x f , C Xb, f = ∑ k =1
and ρ o (x ) = k 0 x + b0 where
k 0 > 0 , ∀k , d (xk , x ) + b > 0 ,
(
)(
f km xk − x f xk − x f n
∑ l =1
((
)
T
) )
,
f l m × d xk , x f + b
then it can be proved that the Hessian
matrix of (4) at ∀i, v i = x is positive semi-definite if and only if the matrix 2nγk (c + 1)c m 1 − 0 n ck fm ∑ i k =1 k
I − 2m C b , f s×s m − 1 X
is positive semi-definite, where b > − min{d (xk , x f )} . Let 1≤k ≤ n
λmax ( A) denote the maximum eigenvalue of the matrix A . It is easy to show that the
Hessian matrix of (4) at ∀i, v i = x is positive semi-definite if and only if 1−
2γk 0 (c + 1)c m cki ∑k =1 f n
m k
−
( )
2m λmax C Xb , f ≥ 0 m −1
. This result can lead to a theoretical justified
criterion on parameter selection of the GFCM, in other words, the relevant parameters in the GFCM should be guaranteed that 1 −
2γk0 (c + 1)c m cki ∑
n
m k =1 k
f
−
( )
2m λmax C Xb , f < 0 , m −1
in order to
make the GFCM work well.
4
Conclusions
Because the real data set is quite various, it is impossible for a clustering algorithm to fit all real cases. Therefore, there are so many various generalized types of FCM being proposed. In this paper, we create a GFCM model on the basis of several FCM-
A Study on a Generalized FCM
393
generalized algorithms such as PFCM, CFCM, ICS, PIM and AFCM. The proposed GFCM can even present more varieties of FCM. In fact, we could get new clustering algorithms based on the GFCM model. The main objective of this paper is to consider the test for solutions and the selection for parameters of GFCM. It is known that the parameters in clustering algorithms always have large effects to clustering results, for instance, the weighting exponent m for FCM, CFCM, the parameter b for PFCM and PIM, and the parameter γ for ICS, etc. In another case, the mass center x f of the data set is always a stable fixed point of most varieties of FCM. However, it is not expected and should be avoided in cluster analysis. In this paper, we has constructed a simple optimality test for a fixed point of GFCM, which implies the condition to avoid x f as a stable fixed point of GFCM. Therefore, we have proposed a tool for selecting parameters of GFCM in theoretical basis. These results are very important for many fuzzy clustering algorithms, especially for generalized types of FCM. As we know, the concept of these theoretical analyses in this paper could be easily expanded to most of clustering algorithms. Certainly, it is also helpful for these applications in the use of varieties of FCM.
References [1] [2] [3] [4] [5] [6] [7] [8]
J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press, 1981. M.S. Yang, “On a class of fuzzy classification maximum likelihood procedures,” Fuzzy Sets and Systems, vol. 57, pp. 365–375, 1993. J.S. Lin, K.S. Cheng and C.W. Mao, “Segmentation of multispectral magnetic resonance image using penalized fuzzy competitive learning network,” Computers and Biomedical Research, vol.29, pp. 314–326, 1996. W. Pedrycz, “Conditional fuzzy c-means”, Pattern Recognition Letters, vol. 17, pp. 625– 632, 1996. D. Özdemir and L. Akarun, “Fuzzy algorithms for combined quantization and dithering,” IEEE Trans. Image Processing, vol. 10(6), pp. 923–931, 2001. D. Özdemir and L. Akarun, “A fuzzy algorithm for color quantization of images,” Pattern Recognition, vol. 35, pp. 1785–1791, 2002. K.L. Wu and M.S. Yang, “Alternative c-means clustering algorithms,” Pattern Recognition, vol. 35, pp. 2267–2278, 2002. J. Yu, Q. Cheng and H. Huang, “Analysis of the weighting exponent in the FCM,” IEEE Trans. Syst., Man, Cybern.-Part B, 2002. (accepted)
Fuzzy Multiple Synapses Neural Network and Fuzzy Clustering Kai Li 1
2
1, 2
1
, Houkuan Huang , and Jian Yu
1
AI institute, School of Computer and Information Technology, Northern Jiaotong University, Beijing, 100044
School of Mathematics and Computer,Hebei University, Baoding, 071002 [email protected]
Abstract. Fuzzy multiple synapses neural network is proposed and the limitation of traditional Hopfield network, which handles the quadratic optimization problems, is overcome. First, fuzzy multiple synapses neuron and network are defined; and a hybrid neural network is given which combines multiple synapses neural network with Hopfield network. It may solve constrained optimization problems whose objective functions may not only include high-order form, but also may include logarithmic, sinusoidal and etc.. Second, it is applied to fuzzy clustering and a concrete hybrid neural network is derived. Experiments show that this method is very valid. Finally, conclusion and future work are given.
1 Introduction Bezdek proposed the fuzzy c-means clustering (FCM), its objective function is a modified version of the objective function of c-means clustering algorithm. The main difference between the FCM and c-means is that the weighting exponent m is introduced to the objective function of the FCM. Then many scholars modify it and obtain various clustering algorithms[1,2]. On the other hand, Hopfield network only handles quadratic optimization problems. So in [3], fuzzy clustering takes maximum entropy method instead of additive m. In [4], multiple synapses neural network is proposed and is applied to fuzzy clustering. In this paper, a generalized network, named fuzzy multiple synapses neural network , is proposed and the objective function of optimization problem may include high-order, logarithmic, and etc. The multiple synapses network is applied to fuzzy clustering to overcome limitation of the Hopfield network. Experiments show that the method is very valid. This paper is organized as follows. In section 2, we briefly describe fuzzy c-means algorithm. In section 3, the architecture of multiple synapses neural networks is described in detail. In section 4, an application of multiple synapses neural network is given to fuzzy clustering and derives a concrete multiple synapses network. Finally, experiment, conclusion and future work are given.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 394–397, 2003. © Springer-Verlag Berlin Heidelberg 2003
Fuzzy Multiple Synapses Neural Network and Fuzzy Clustering
395
2 Fuzzy Clustering Let X
= {x1 , x 2 , L , x n } ⊂ R p be any finite set which is in p dimension real
space. Vcn is the set of all real c × n matrixes and 2 ≤ c ≤ n is an integer.
U = [uik ] ∈ Vcn . The matrix U = [uik ] is called a fuzzy c-partition if it satisfies the following conditions: 1) uik ∈ [0,1],1 ≤ i ≤ c,1 ≤ k ≤ n; 2)
c
∑u i =1
ik
= 1, 1 ≤ k ≤ n; 3) 0 <
n
∑u
ik
< n, 1 ≤ i ≤ c .
k =1
Let M fc denotes the set of all matrixes satisfying the above conditions. Fuzzy c-means clustering is equivalent to the following optimization problem: c
n
min z m (U ; v ) = ∑∑ uikm || x k − vi ||G2 ; s.t. i =1 k =1
By Lagrangian multiplier, we may derive
c
∑u
ik
= 1, ∀k
(1)
i =1
vi , u ik .
3 Fuzzy Multiple Synapses Neural Network Definition 1 Fuzzy multiple synapses neuron stands for following memory system:
y = f [⊕ in=1 ⊕ rj =1 ( wij ⊗ xi ) − θ ] i
where
(2)
< ⊗,⊕ > is fuzzy neural operator, its domain is A . x1 , x 2 , L , x n are n in-
puts , xi ∈ A ; wi1 , wi 2 , L, wiri are weighs which are corresponding to input xi ,
wi j ∈ [0,1] ; θ is threshold value and θ ∈ R . f is output function and 0 ≤ f (•) ≤ 1 . Definition 2 Generalized fuzzy operator neuron stands for following system:Y=T(C(M(X))) .Where X = ( X 1 , X 2 , L , X n ) is input vector and Y is output. They are defined in A and B respectively. M = ( M 1 , M 2 , L , M n ) , C, and T are modified operator, focused operator and transferred operator respectively. Definition 3 Generalized multiple synapses fuzzy operator neuron is that M i , i = 1,2, L , n and T are operator vector in generalized fuzzy operator neuron, namely M i = ( M i1 , M i 2 , L , M ip ) , T = (T1 , T2 , L , Tq ) . Definition 4 Fuzzy multiple synapses neural network is a network which consists of many fuzzy multiple synapses neurons which is connected with other neurons according to some kind of topological architecture. For the simplicity, fuzzy operator pair < ⊗,⊕ > is still regarded as general operator < ×,+ > below.
396
K. Li, H. Huang, and J. Yu
4 Fuzzy Clustering and Hybrid Neural Network Here, consider a generalized form based on a maximum entropy approach, that is following problems: min z m (U ; v ) =
c
n
∑ ∑ (u
c
ik
i =1 k = 1
c
∑
s .t .
) m || x k − v i || G2 + λ − 1 ∑
n
∑u
ik
log e α i− 1 u ik
(3)
i =1 k = 1
u ik = 1;
1 ≤ k ≤ n, α
i
(4)
> 0
i =1
To solve this problem, a new hybrid neural network is proposed which composed of Hopfield network and multiple synapses neural network. Where Hopfield network is trained to calculate clustering center, multiple synapses neural network is trained to compute the degree of membership. Set ri = s in multiple synapses neural network. Weighs are
wij , z ij , y ij ; i = 1,2, L , s; j = 1,2, L , s respectively. There are 3 activation outputs,
which
are
f1 , f 2 , f 3
respectively.
External
inputs
of
neurons
are
i j , j = 1,2, L , s .In three synapses neural network, net inputs of neural network are following: s
net j = ∑ ( w ji ui + z ji ui + y ji ui ) + i j
j = 1,2,L , s .
(5)
i =1
Set
U < m −1> = (u1m−1 , u 2m−1 , L , u sm−1 ) T , for m>1. The transpose of U < m−1> is de-
noted as U . Now consider actual objective function, so net inputs of neural network are as follows: s
net j = ∑ ( w ji uim −1 + z ji ui + y ji (log e α i−1ui + 1)) + i j
j = 1,2, L, s
(6)
i =1
The computational energy function for designed multiple synapses neural network is
1 1 E = −( )U T WU < m −1> − ( )U T ZU − U T YU − U T I m 2
(7)
where U < log > = (log e α i−1u1 , log e α i−1u 2 , L , log e α i−1u s ) T . After equating the corresponding terms with E equation and the unconstrained form of z m , the weight matrixes W, Z and Y can be solved. The resulting W, Z and Y have following forms:
Fuzzy Multiple Synapses Neural Network and Fuzzy Clustering
− md i i = j w ji = i , j = 1,2, L , c × n 0 i≠ j
− 2λ z ji = 0
i i ( − 1) ⋅ c < j ≤ ⋅ c i, j = 1,2,L, c × n c c other
− λ− 1 i = j y ji = i, j = 1,2,L, c × n 0 i≠ j
397
(8)
(9)
(10)
5 Experiment, Conclusion, and Future Work From above obtained hybrid neural network, a series of experiments are conducted with different m, α i , and λ . In the experiments, iris data set is used to illustrate the unsupervised clustering. We use 2-dimensional data by taking the first two components. In resulting clustering, there are only 3 misclassified. Experiments show that this method is valid. In this paper, a fuzzy multiple synapses neural network is proposed and is applied to fuzzy clustering to overcome limitation of the Hopfield network that only solves quadratic optimization. In addition, this network can also solve optimization problems, which include objective function such as high order, logarithmic, and sinusoidal forms, etc.. In future work, we further investigate the features of fuzzy multiple synapses neural network such as convergence, stability, and so on.
References 1. R. Krishnapuram, J. M. Keller. A Possibilistic Approach to Clustering. IEEE Trans. on Fuzzy Syst.,Vol. 1, No. 2, (1993) 98–110. 2. H. Ichihashi, K. Honda, N. Tani. Gaussian Mixture PDF Approximation and Fuzzy c-means Clustering with Entropy Regularization, Proc. of the 4th Asian Fuzzy System Symposium, Tsukuba, Japan, (2000) 217–221. 3. R.P.Li and M.Mukaidono, A Maximum Entropy Approach to Fuzzy Clustering. In: Proc.4th IEEE Int. Conf. Fuzzy Syst.,Yokohama,Japan, 1995 (2227–2232). 4. Chih-Hsiu Wei and Chin-Shyurng Fahn,The Multisynapse Neural Network and Its Application to Fuzzy Clustering,.IEEE Trans. On Neural Networks.Vol.13, No.3, (2002) 600–618.
On Possibilistic Variance of Fuzzy Numbers Wei-Guo Zhang1,2 and Zan-Kan Nie1 1
2
Institute for Information and System Sciences, Faculty of Science, Xi’an Jiaotong University, Xi’an, Shaan’xi, 710049, P. R. China Ningxia University, Ningxia, Yinchuan, 750002, P. R. China [email protected]
Abstract. The lower and upper possibilistic mean values of fuzzy numbers were introduced by Carlsson and Full´er. This paper introduces the concepts of lower possibilistic and upper possibilistic variances of fuzzy numbers. These concepts are consistent with the extension principle and with the well-known definition of variance in probability theory. We also define a crisp possibilistic variance which differs from the one given by Carisson and Full´er. Furthermore, we show that the lower and upper possibilistic variances of linear combination of fuzzy numbers can be computed in a similar manner as in probability theory.
1
Introduction
In 1987 Dubois and Prade [2] defined an interval-valued expectation of fuzzy numbers. They also showed that this expectation remains additive in the sense of addition of fuzzy numbers. In 2001 Carlsson and Full´er [4] defined the concepts of lower possibilistic and upper possibilistic mean values. Furthermore, they also introduced a crisp variance of continuous possibility distributions. This paper introduces the concepts of lower and upper possibilistic variances based on the lower and upper possibilistic mean values of fuzzy numbers introduced by Carlsson and Full´er. We also define a crisp possibilistic variance which is different from [4]. These concepts are consistent with the extension principle and with the well-known definition of variance in probability theory. The theory developed in this paper is fully motivated by the principles introduced in [2].
2
Possibilistic Mean of Fuzzy Numbers
A fuzzy number A is a fuzzy set of the real line R with a normal, fuzzy convex and continuous membership function of bounded support. The family of fuzzy numbers will be denoted by F. A γ−level set of a fuzzy number A is denoted by [A]γ = {t ∈ R|A(t) ≥ γ} if γ > 0 and [A]γ = cl{t ∈ R|A(t) > 0} (the closure of the support of A) if γ = 0. Let A ∈ F be fuzzy number with [A]γ = [a1 (γ), a2 (γ)], γ ∈ [0, 1]. In 2001 Carlsson and Full´er defined the lower and upper possibilistic mean values of A as [4] 1 1 P os[A ≤ a1 (γ)]a1 (γ)dγ M∗ (A) = 2 γa1 (γ)dγ = 0 1 , P os[A ≤ a1 (γ)]dγ 0 0 G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 398–402, 2003. c Springer-Verlag Berlin Heidelberg 2003
On Possibilistic Variance of Fuzzy Numbers
∗
1
1
M (A) = 2
0
γa2 (γ)dγ = 0
399
P os[A ≥ a2 (γ)]a2 (γ)dγ , 1 P os[A ≥ a2 (γ)]dγ 0
where Pos denotes possibility, i.e., P os[A ≤ a1 (γ)] = Π((−∞, a1 (γ)]) = P os[A ≥ a2 (γ)] = Π([a2 (γ), ∞)) =
sup A(u) = γ,
u≤a1 (γ)
sup A(u) = γ.
u≥a2 (γ)
Furthermore, the following conclusions are introduced. Theorem 2.1. Let A and B be two non-interaction fuzzy numbers and let λ ∈ R be a real number. Then M∗ (A + B) = M∗ (A) + M∗ (B), M ∗ (A + B) = M ∗ (A) + M ∗ (B), λM∗ (A) if λ ≥ 0, M∗ (λA) =
∗
M ∗ (λA) =
λM (A) if λ < 0,
λM ∗ (A) if λ ≥ 0,
λM∗ (A) if λ < 0,
where the addition and multiplication by a scalar of fuzzy numbers are defined by the sup-min extension principle [3].
3
Possibilistic Variance of Fuzzy Numbers
We next introduce the concepts of the lower and upper possibilistic variances of fuzzy number A with [A]γ = [a1 (γ), a2 (γ)], γ ∈ [0, 1]. The lower possibilistic variance of A is defined as 1 V ar∗ (A) =
0
P os[A ≤ a1 (γ)](M∗ (A) − a1 (γ))2 dγ
1 0
P os[A ≤ a1 (γ)]dγ
1
=2
γ(M∗ (A) − a1 (γ))2 dγ.
0
Remark 3.1. The lower possibilistic variance of A is defined as the lower possibility-weighted average of the squared deviations between the left-hand endpoint and the lower possibilistic mean of its level sets. The upper possibilistic variance of A is defined as 1 1 ∗ 2 V ar∗ (A) =
0
P os[A ≥ a2 (γ)](M (A) − a2 (γ)) dγ
1 0
P os[A ≥ a2 (γ)]dγ
γ(M ∗ (A) − a2 (γ))2 dγ.
=2 0
Remark 3.2. The upper possibilistic variance of A is defined as the upper possibility-weighted average of the squared deviations between the right-hand endpoint and the upper possibilistic mean of its level sets.
400
W.-G. Zhang and Z.-K. Nie
The lower possibilistic and upper possibilistic covariances between fuzzy numbers A with [A]γ = [a1 (γ), a2 (γ)] and B with [B]γ = [b1 (γ), b2 (γ)](γ ∈ [0, 1] are, respectively, defined as 1 Cov∗ (A, B) = 2 γ(M∗ (A) − a1 (γ))(M∗ (B) − b1 (γ))dγ, 0
Cov ∗ (A, B) = 2
1
γ(M ∗ (A) − a2 (γ))(M ∗ (B) − b2 (γ))dγ.
0
Theorem 3.1. Let A and B be two non-interaction fuzzy numbers. Then V ar∗ (A + B) = V ar∗ (A) + V ar∗ (B) + 2Cov∗ (A, B), V ar∗ (A + B) = V ar∗ (A) + V ar∗ (B) + 2Cov ∗ (A, B), where the addition of fuzzy numbers is defined by the sup-min extension principle [3]. Proof. Really, from Theorem 2.1 and the equation [A + B]γ = [a1 (γ) + b1 (γ), a2 (γ) + b2 (γ)], we have V ar∗ (A + B) = 2 =2
1 0
1 0
γ[M∗ (A + B) − (a1 (γ) + b1 (γ))]2 dγ γ[(M∗ (A) − a1 (γ)) + (M∗ (B) − b1 (γ))]2 dγ
= V ar∗ (A) + V ar∗ (B) + 2Cov∗ (A, B). Similar to the proof above, we can easily obtain V ar∗ (A + B) = V ar∗ (A) + V ar∗ (B) + 2Cov ∗ (A, B). Theorem 3.2. Let A be a fuzzy number and let λ ∈ R be a real number. Then 2 2 λ V ar∗ (A) if λ ≥ 0, λ V ar∗ (A) if λ ≥ 0, ∗ V ar∗ (λA) = V ar (λA) = 2 2 λ V ar∗ (A) if λ < 0, λ V ar∗ (A) if λ < 0, where the multiplication by a scalar of fuzzy number is defined by the sup-min extension principle [3]. Proof. From Theorem 2.1 and [λa1 (γ), λa2 (γ)] if λ ≥ 0, [λA]γ = λ[A]γ = λ[a1 (γ), a2 (γ)] = [λa2 (γ), λa1 (γ)] if λ < 0, we get that for λ ≥ 0
On Possibilistic Variance of Fuzzy Numbers
V ar∗ (λA) = 2 =2
1 0
1 0
401
γ[M∗ (λA) − λa1 (γ)]2 dγ γ[λ(M∗ (A) − a1 (γ))]2 dγ
= λ2 V ar∗ (A) and for λ < 0 V ar∗ (λA) = 2 =2
1 0
1 0
γ[M∗ (λA) − λa2 (γ)]2 dγ γ[λ(M ∗ (A) − a2 (γ))]2 dγ
= λ2 V ar∗ (A). Similarly, we easily show that V ar∗ (λA) =
2 λ V ar∗ (A) if λ ≥ 0,
λ2 V ar∗ (A) if λ < 0.
The crisp possibilistic variance of fuzzy number A is defined as
1
1
γ(M∗ (A) − a1 (γ)) dγ + 2
V ar(A) = 0
γ(M ∗ (A) − a2 (γ))2 dγ.
0
V ar(A) can be rewritten as V ar(A) =
V ar∗ (A) + V ar∗ (A) . 2
The possibilistic covariance between fuzzy numbers A and B is defined as
1
Cov(A, B) =
γ[(M∗ (A) − a1 )(M∗ (B) − b1 ) + (M ∗ (A) − a2 )(M ∗ (B) − b2 )]dγ.
0
Cov(A) can be rewritten as Cov(A, B) =
4
Cov∗ (A, B) + Cov ∗ (A, B) . 2
Conclusions
We introduce the concepts of lower possibilistic and upper possibilistic variances of fuzzy numbers in this paper. These concepts are consistent with the extension principle and with the well-known definition of variance in probability theory. We also define a crisp possibilistic variance which differs from the one given by Carisson and Full´er. We obtain some properties of possibilistic variances.
402
W.-G. Zhang and Z.-K. Nie
References 1. D. Dubois, H. Prade: Fuzzy Sets and Systems: Theory and Applications. Academic Press, New York (1980) 2. D. Dubois, H. Prade: The mean value of a fuzzy number. Fuzzy Sets and Systems 24 (1987) 279–300 3. L.A. Zadeh: Fuzzy Sets. Inform. and Control 8 (1965) 338–353 4. Christer Carisson. and Robert Full´er: On possibilistic mean value and variance of fuzzy numbers. Fuzzy Sets and Systems, 122 (2001) 315–326
Deductive Data Mining Mathematical Foundation of Database Mining Tsau Young Lin Department of Computer Science San Jose State University, San Jose, California 95192 [email protected]
Extended Abstract Keywords: deductive data mining, granular computing
1
Key Issues
In ICDM02 the Foundation on Data Mining and Discovery Workshop [3], we have proposed that data mining is a procedure that transforms (extracts or discovers from) data into patterns/knowledge: Schematically DM: Patterns ⇐= Data, or
KD: Knowledge ⇐= Data
In this paper, we will take the most conservative view, namely, only mathematical deductions are the acceptable transformations. We will cal this type data mining deductive data mining. In other words, we treat data as the only ”axioms” and patterns or knowledge as ”interesting theorems” to be derived mathematically from the data. Taking such a view, does not imply that we are restricting in utilizing possible means and tools, but imply that data miners need to identify clearly what are the hidden assumptions. So one of the important issue becomes – What are the Raw Data(including all the associated information that are utilized by the algorithms)? In this paper, we focus on association rules mining, for neural network mining Some bases have been touched on [4]; we will address in future.
2
Association Rule Mining
For association rules mining, we have observed that the algorithms process the table (in fully automated tasks) Do not utilize the semantics of relational schema nor consult human view of the data. In other words, – the raw data consists of the data and the associated information stored in the systems. To stress this fact, the raw data is called – tables-in-systems [3,1]. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 403–404, 2003. c Springer-Verlag Berlin Heidelberg 2003
404
T.Y. Lin
Next, we need few notions, 1. A table-in-system is viewed as a knowledge representation of the universe of entities; each attribute induces a partition on the universe. 2. Two attributes (columns) of the same table are isomorphic, if the correspondence between the attribute values (of all tuples) is one-to-one and onto. 3. Two tables-in-systems are isomorphic, if there is a map that induces isomorphism in each attribute(column). For simplicity, we will assume – Tables-in-systems have no isomorphic columns. Here are some results 1. Two measures, support and confidence, are invariant under isomorphism. In other words, the support and confidence are the measures of the isomorphic class. 2. The probability theory based on counting items is a theory on the isomorphic class, not on individual table. 3. Probability-based-interestingness of association rules that are based on probability theory, e.g., entropy, is a notion of the whole isomorphic class, not an individual relations 4. However, commonly-referred-interestingness is a notion on individual relations. Hence can not be captured by probability theory based on counting items. 5. A sample is admissible, if it produces an isomorphic set of association rules as the original table. Random samples are often not admissible. 6. All possible attributes(features) of a table-in-systems can be determined [?]. Hence ”invisible patterns” can be discovered. 7. Generalized association rules can be captured by solving linear inequality [3, 1]
References 1. T. Y. Lin and Hugo Shi, ”Mathematical Foundation of Association Rules – Mining Associations by Solving Integral Linear Inequalities,” In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology, Dasarathy (ed), Proceeding of SPIE, Vol. 5098, Orlando, Fl, April 21–25, 2003, to appear 2. T. Y. Lin, ”Attribute (Feature) Completion – The Theory of Attributes from Data Mining Prospect,” in: the Proceedings of International Conference on Data Mining, Maebashi, Japan, Dec 9–12, 2002, pp.282–289 3. T. Y. Lin , ”Mathematical Foundation of Association Rules – Mining Associations by Solving Integral Linear Inequalities.” In the Proceedings of the Workshop on the Foundation of Data Mining and Discovery, which is a part of International Conference on Data Mining, Maebashi, Japan, Dec 9–12, 2002, pp 81–88. 4. T. Y. Lin, ”The Power and Limit of Neural Networks,” Proceedings of the 1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1–4, 1996, Vol. 7, 49–53.
Information Granules for Intelligent Knowledge Structures? Patrick Doherty1 , Witold Łukaszewicz1,2 , and Andrzej Szałas1,2 1
Department of Computer Science, University of Linko¨ping, Sweden, [email protected] 2 The College of Economics and Computer Science, Olsztyn, Poland {witlu,andsz}@ida.liu.se
Abstract. The premise of this paper is that the acquisition, aggregation, merging and use of information requires some new ideas, tools and techniques which can simplify the construction, analysis and use of what we call ephemeral knowledge structures. Ephemeral knowledge structures are used and constructed by granular agents. Each agent contains its own granular information structure and granular information structures of agents can be combined together. The main concept considered in this paper is an information granule. An information granule is a concise conceptual unit that can be integrated into a larger information infrastructure consisting of other information granules and dependencies between them. The novelty of this paper is that it provides a concise and formal definition of a particular view of information granule and its associated operators, as required in advanced knowledge representation applications.
1
Introduction
Information structures currently required in applications such as that for the WWW or in cognitive robotics are becoming increasingly more complex in terms of scale and content. The premise of this paper is that the acquisition, aggregation, merging and use of such information requires some new ideas, tools and techniques which can simplify the construction, analysis and use of what we call ephemeral knowledge structures. Ephemeral knowledge structures are used and constructed by granular agents. Each agent contains its own granular information structure and granular information structures of agents can be combined together and also with legacy structures such as relational databases to form ephemeral knowledge structures which can be used in solving goal-directed tasks. The main concept considered in this paper is an information granule. An information granule is a concise conceptual unit that can be integrated into a larger information infrastructure consisting of other information granules and dependencies between them. Most importantly, the structure described is recursive in nature and each conceptual unit (information granule) is (elaboration) tolerant within the context in which it is used at a particular level of abstraction in the total system infrastructure. The following requirements on information granules are intended to make these ideas more precise. The rest of the paper shows how these requirements are met. ?
Supported in part by the WITAS project grant under the Wallenberg Foundation, Sweden and KBN grant 8 T11C 009 19.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 405−412, 2003. Springer-Verlag Berlin Heidelberg 2003
406
P. Doherty, W. Ł ukaszewicz, and A. Szałas
– Information should be structured at a more fine-grained level than is currently practiced. We call such chunks of fine-grained information elementary information granules. – A library of operators should exist which permit the recursive construction or aggregation of information granules into complex information granules. – Specialized interfaces should be specifiable which provide the ability to create various levels of abstraction which simplify the complexity of the information for users and other agents. – The information itself should be assumed approximate as a rule rather than as an exception and rough set techniques will be used to provide a basis for a formal semantics for the ensuing structures. – A granular information structure framework should be cleanly organized in a manner similar to objects, object structures and interfaces used in the object-oriented programming paradigm since we are striving after a pragmatic mechanism for constructing dynamic knowledge structures in a modular and incremental manner.
The idea of granular computing has been proposed in the literature (see, e.g., [4, 5]). The novelty of this paper is that it provides a concise and formal definition of a particular view of information granule and its associated operators, as required in advanced knowledge representation applications. In addition, the resulting framework can be applied directly to a number of existing legacy structures such as information systems and relational databases. The querying mechanisms associated with granular information structures are provably tractable. In the rest of the paper, we will provide a precise specification of the notion of an information granule and operations on information granules and present a number of examples showing how various legacy and other structures can be recursively constructed using this framework.
2
Notation
Rough relations and their associated parts will be used to represent the semantic content of information granules. In order to construct a language for dealing with rough concepts, we introduce the following relation symbols for any rough relation R: – R+ – represents the positive facts known about the relation. R+ corresponds to the lower approximation of R. R+ is called the positive region (part) of R. – R− – represents the negative facts known about the relation. R− corresponds to the complement of the upper approximation of R. R− is called the negative region (part) of R. – R± – represents the unknown facts about the relation. R± corresponds to the set difference between the upper and lower approximations to R. R± is called the boundary region (part) of R. – R⊕ – represents the positive facts known about the relation together with the unknown facts. R⊕ corresponds to the upper approximation to R. R⊕ is called the positive-boundary region (part) of R. – Rÿ – represents the negative facts known about the relation together with the unknown facts. Rÿ corresponds to the lower approximation of the complement of R. Rÿ is called the negative-boundary region (part) of R.
3
Elementary Granules
We begin with the notion of an elementary granule. This is closely related to the concept of an object in the object-oriented programming paradigm. Granules can be naturally viewed as information objects specialized for distributed information processing and interchange.
Information Granules for Intelligent Knowledge Structures
407
Definition 3.1. Let Sig be a signature, P be a set of parameters consisting of variable and constant symbols, and let N be a set of granule names. By a granule we understand a triple G = hC, I, M i, where – C is a set of constant symbols from Sig. – I is an interface of G, which is an expression of the form N ame : R1 (a11 , . . . , a1m1 ), . . . , Rk (ak1 , . . . , akmk ), where N ame ∈ N , R1 , . . . , Rk ∈ Sig are (rough) relation symbols and a11 , . . . , a1m1 , . . . , ak1 , . . . , akmk , called parameters of G, are members of P . – {Mi] : 1 ≤ i ≤ k and ] ∈ {+, −} } is a set of methods. A method Mi] , for 1 ≤ i ≤ k, is a function that allows one to compute relations Ri] (ai1 , . . . , aimi ) as follows1 : for a vector of length mi (1 ≤ i ≤ k) consisting of 0 ≤ m ≤ mi variables and (mi − m) constant symbols, Mi] returns a set of m-ary tuples of constants or truth value Unknown. If m = 0 then the method returns a truth value True, False or Unknown.2 A query to a granule is any expression of the form Ri] (ai1 , . . . , aimi ), where 1 ≤ i ≤ k, ] ∈ {+, −, ⊕, ÿ, ±} and ai1 , . . . , aimi are constant symbols or variables. An answer to a query Ri] (ai1 , . . . , aimi ) is: – the set computed by method Mi] with input vector hai1 , . . . , aimi i, when ] ∈ {+, −} – the result computed according to definitions of Ri± , Ri⊕ , Riÿ , when ] ∈ {⊕, ÿ, ±}.
The idea of a granule, as defined above, is quite general. We now focus on more specialized granules. Definition 3.2. Let Sig be a signature, P be a set of parameters and N be a set of granule names. By an elementary granule we understand any granule G = hC, I, M i, where: – C is a set of constants from Sig – the interface I is of the form N ame : R(¯ a), where R ∈ Sig and all parameters in a ¯ belong to P – M consists of a set of methods that allow one to compute R] , for ] ∈ {+, −}.
Elementary granules can be used to represent more complex structures such as information systems and decision rules commonly used in the rough set paradigm. Example 3.3. Any information system A, with k attributes a1, . . . , ak, can be viewed as an elementary granule, in which: – the set of constant symbols is the set of constants appearing in the information table – the interface consists of the information system name, together with the relation, denoted by RA (x, xa1 , . . . , xak ), it represents3 , where x, xa1 , . . . , xak are new variables corresponding to columns in the information table representing A – the methods it contains correspond to standard deductive database queries of the form ] RA (w1 , . . . , wk ), where any wi is a variable or a constant symbol. Consider an information system A = hU, Ai shown in Table 1, where r, b, g, s, m, l, h stand for “red”, “blue”, “green”, “small”, “medium”, “‘low” and “high”, respectively. It can be represented as an elementary granule, where: 1 2 3
If a superscript ] is omitted, it is +, by default. If a method Mi] is not specified, it is assumed to always return the value Unknown for a query Ri] (ai1 , . . . , aimi ). Note that any table with k attributes can be viewed as a k-argument relation. This identification is standard in the field of relational databases.
408
P. Doherty, W. Ł ukaszewicz, and A. Szałas Table 1. Information table considered in Example 3.3. Object Size Color Speed 1 m r l 2 s b h 3 m g l
– the set of constants is {1, 2, 3, m, s, r, b, g, l, h} – the interface is A : RA (x, xSize , xColor , xSpeed ) – the methods correspond to standard deductive database queries with the local closed world + assumption4 applied to RA only. For example: • query RA (x, m, y, l) returns the answer {h1, ri, h3, gi} • query RA (2, m, b, h) returns the answer False.
Decision rules are commonly generated from information systems using rough set machine learning techniques. Example 3.4. Any decision rule of the form (a1 = c1 ) ∧ . . . ∧ (an = cn ) ⇒ (d = cd ) can be viewed as an elementary granule, in which: – the set of constant symbols is the set {c1 , . . . , cn , cd } – the interface consists of the granule name, together with a relation, denoted by R(xa1 , . . . , xan , xd ) – the methods of the granule return: hcd i for query R+ (c1 , . . . , cn , x), where x is a variable, Unknown for all other queries.
The elementary granules presented in Examples 3.3 and 3.4 are self-contained in the sense that for computing their methods no information external to the granule is used. It is also possible to consider elementary granules which are not self-contained, as illustrated in the next example. Example 3.5. Any Boolean descriptor5 can be viewed as an elementary granule, in which: – the set of constant symbols is the set of constants appearing in the descriptor – the interface consists of a granule name, together with a relation, denoted by RA (a1 , . . . , ak ), where parameters a1 , . . . , ak are all attributes occurring in the descriptor. RA refers to an information system A which is to be an input to the granule – the methods it contains compute answers to queries treating the Boolean descriptor as a standard deductive database query (see, e.g., [1]). The Boolean descriptor itself always returns the empty set of tuples. This is due to the fact that it contains no information about RA , i.e., from its perspective RA is empty.6
To create an elementary granule G, we introduce a special operation M akeEGranule(C, I, M ), which constructs the granule hC, I, M i. 4
5 6
The local closed-world assumption is a generalization of the standard CWA used in classical database theory. In this case, one has more local control of what relations are minimized to implicitly produce negative relational information. Boolean descriptors are any Boolean combinations of elementary descriptors of the form (a = v), where a is an attribute and v is a value. In Section 4 we provide operations which make such granules useful.
Information Granules for Intelligent Knowledge Structures
4
409
Information Granule Systems
Information granules are constructed from elementary granules and previously constructed information granules by means of a number of operations. Definition 4.1. Let E be a set of elementary granules and let G be a set of granules such that E ⊆ G. Assume I is a finite set and {Oi : i ∈ I} are ki -argument operations on G, i.e., for any i ∈ I, Oi : G × . . . × G −→ G. We say that hE, G, {Oi : i ∈ I}i is an information granule | {z } ki times
system provided that G is the least set containing all elementary granules of E and closed under operations in {Oi : i ∈ I}. Any element of G is called an information granule.
4.1
Operation Link
As we have seen in the previous section, granules which are not self-contained require additional input information from other granules in order to provide meaningful answers to queries. Accordingly, we require a mechanism for linking granules with their inputs. The operation Link serves this purpose7 , where Link : G × G × . . . × G −→ G. {z } | n times
Operation Link(G, G1 , . . . , Gn ) generates a granule such that: – the set of constant symbols of the resulting granule is the union of the sets of constant symbols of G, G1 , . . . , Gn – the interface is the interface of G, except that its name is a fresh name from N – the methods of the constructed granule compute answers as the methods of G, under the assumption that the results computed by G1 , . . . , Gn are accessible for methods in G. Example 4.2. As observed in Example 3.5, Boolean descriptors are elementary granules. Let G be an elementary granule obtained in the manner described in Example 3.5 and corresponding to the Boolean descriptor (Size = m) ∧ ((Color = r) ∨ (Color = g)) . Let GA be an elementary granule, obtained in the manner described in Example 3.3, representing the information table provided in Table 1. Consider granule GL = Link(G, GA ). Granule GL Table 2. Information table considered in Example 4.2. Object Size Color Speed 1 m r l 3 m g l represents all objects in Table 1, satisfying the considered Boolean descriptor, i.e., the information table shown in Table 2. 7
Observe that, in fact, Link is a family of operations, for all n > 0.
410
4.2
P. Doherty, W. Ł ukaszewicz, and A. Szałas
Operation Project
Operation P roject allows one to project out certain columns of a relation8 . P roject : G × ω × . . . × ω −→ G, | {z } r times
where the first argument of P roject is a granule of the form G = hC, I, M i with I = name : R(x1 , . . . , xk ) and 0 < r ≤ k. Operation P roject(G, i1 , . . . , ir ), where i1 , . . . , ir are different natural numbers from {1, . . . , k}, provides a granule hC1 , I1 , M1 i such that: – C1 = C – I is of the form name1 : R1 (xi1 , . . . , xir ), where name1 is a fresh name – the methods of the constructed granule compute relations R1] given by: def
R1] (xi1 , . . . , xir ) ≡ ∃z1 . . . . ∃zp .[R] (x1 , . . . , xk )], where {z1 , . . . , zp } = {x1 , . . . , xk } − {xi1 , . . . , xir }.
4.3
Operation M akeM odule
Operation M akeM odule allows one to construct a container of granules, responsible for one or more relations. M akeM odule is defined as9 : M akeM odule : G × . . . × G −→ G. {z } | n times
Operation M akeM odule(G1 , . . . , Gn ) provides a granule GM , such that: – the set of constant symbols of GM is the union of the sets of constant symbols of G1 , . . . , Gn – the interface of GM is of the form N ame : R1 (¯ x1 ), . . . , Rm (¯ xm ) such that N ame is a x1 ), . . . , Rm (¯ xm ) are all the relations occurring in interfaces fresh name from N , and R1 (¯ of G1 , . . . , Gn – the methods of GM compute answers, solving potential conflicts by returning the value Unknown whenever one argument granule returns True and another returns False for the query. Example 4.3. Consider the following three granules: – h{a}, G1 : R(x), M1+ i, where the method M1+ states that R+ (a) holds – h{a, b}, G2 : R(x), M2+ , M2− i, where the methods of G2 state that R− (a) and R+ (b) hold – h{a}, G3 : Q(x), M3+ i, where the method of M3+ states that Q+ (a) holds. Let GM = M akeM odule(G1 , G2 , G3 ). GM is a granule defined as follows. – the set of constants of GM is {a, b} – the interface of GM is name : R(x), Q(x), where name ∈ N is a fresh name 8 9
This operation, actually a family of operations, is frequently used in relational and deductive databases. Observe that, in fact, M akeM odule is a family of operations, for all n > 0.
Information Granules for Intelligent Knowledge Structures
411
– the methods of GM provide the answer True, for queries R+ (b) and Q+ (a), and the answer Unknown when asked about R+ (a) and all other queries.
The operations Link and M akeM odule can be used to provide more sophisticated voting policies. Suppose methods of the granules Gi , for 1 ≤ i ≤ n, resolve conflicts between granules Gi1 , . . . , Giki in a non-standard way. In order to implement such voting policies, it suffices to construct a granule M akeM odule(Link(G1 , G11 , . . . , G1i1 ), . . . , Link(Gn , Gn1 , . . . , Gnin )). 4.4
Operation Hide
Operation Hide allows one to hide certain relations from an interface.10 Hide : G × ω × . . . × ω −→ G, {z } | r times
where the first argument of Hide is a granule of the form G = hC, I, M i with x1 ), . . . , Rk (¯ xk ) and 0 < r ≤ k. I = name : R1 (¯ Operation Hide(G, i1 , . . . , ir ), where i1 , . . . , ir are different natural numbers from {1, . . . , k}, provides a granule hC1 , I1 , M1 i such that: – C1 = C – I is of the form name1 : Ri1 (¯ xi1 ), . . . , Rir (¯ xir ), where name1 is a fresh name – M1 = M .
The sets of constants and methods are the same in G and Hide(G, i1 , . . . , ir ), as computing relations Ri1 (¯ xi1 ), . . . , Rir (¯ xir ) might require an access to all constants and relations, including the hidden ones.
5
Example: Building Classifiers
Consider a problem of classifying given objects on the basis of some available attributes. For instance, one might want to classify a color on the basis of its RGB attributes. Another, far more complicated example, could depend on classifying traffic situations on a road based on video sequences. Software modules that provide one with such classifications are usually called classifiers. One approach to constructing classifiers is based on providing suitable decision rules (which may be learned). For instance, one might generate the following decision rules for a classifier recognizing colors, where y, b, dR stand for “yellow”, “brown” and “dark red”, respectively: (R = 255) ∧ (G = 255) ∧ (B = 51) ⇒ (Color = y) (R = 153) ∧ (G = 51) ∧ (B = 0) ⇒ (Color = b).
The above rules do not lead to any conflicting decisions. However, in general, such conflicts may appear. Such classifiers can easily be represented as information granules. The construction process consists of three steps: 10
In fact, Hide is a family of operations.
412
P. Doherty, W. Ł ukaszewicz, and A. Szałas
1. Construct granules Gj corresponding to each particular decision rule, for j = 1, . . . , r, as done in Example 3.4. 2. Construct a granule G solving possible conflicts among decision rules. 3. Construct the term M akeM odule(Link(G, G1 , . . . , Gr )) which defines the required classifier.
The conflict resolution granule can be defined as hC, G : R(d), M i, where: – C is the set of constants occurring in G1 , . . . , Gr – R is a one-argument relation returning unary tuples identified with decisions – M is a method solving conflicts.
From a conceptual point of view the notion of a classifier is then relatively simple. However, constructing a classifier is a complex task. The main difficulties lie in providing suitable decision rules and resolving conflicts.
6
Meeting the Requirements
In the introduction, we provided a list of requirements on a granular information framework that we considered important from both pragmatic and conceptual points of view. We have described the techniques and formal tools used to meet these requirements. Information is represented in fine-grained chunks called information granules. A library of operators has been introduced for constructing granular information structures of arbitrary complexity in a recursive manner. Each of the granules has its own interface and information may be abstracted or hidden. The semantic content of the granules are approximate in nature and can be viewed as parts of rough relations. The framework has the flavor of an object-oriented approach to structuring knowledge. It is important to note that the granules and operators on granules can be used by granular agent systems to construct, aggregate and merge information structures of arbitrary complexity and even integrate existing legacy structures into the ephemeral knowledge structures produced. Specific applications of these ideas and the tractable querying techniques used on granular information structures may be found in the papers by Doherty et al [2, 3].
References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley Pub. Co., 1996. 2. P. Doherty, W. Łukaszewicz, A. Skowron, and A. Szałas. Combining rough and crisp knowledge in deductive databases. In [6], 2003. 3. P. Doherty, W. Łukaszewicz, and A Szałas. Cake: A computer-aided knowledge engineering technique. In F. van Harmelen, editor, Proc. 15th European Conference on Artificial Intelligence, ECAI’2002, pages 220–224, Amsterdam, 2002. IOS Press. 4. T.Y. Lin. Granular computing on binary relations i, ii. In [7], pages 107–140, 1998. 5. H.S. Nguyen, A. Skowron, and J. Stepaniuk. Granular computing: A rough set approach. Computational Intelligence, 17:514–544, 2001. 6. S.K. Pal, L. Polkowski, and A. Skowron, editors. Rough-Neuro Computing: Techniques for Computing with Words. Springer–Verlag, Heidelberg, 2003. 7. L. Polkowski and A. Skowron, editors. Rough Sets in Knowledge Discovery 1: Methodology and Applications, volume 17 of Studies in Fuzziness and Soft Computing. Physica-Verlag, Heidelberg, 1998.
Design and Implement for Diagnosis Systems of Hemorheology on Blood Viscosity Syndrome Based on GrC Qing Liu, Feng Jiang, and Dayong Deng Department of Computer Science, Nanchang University, Nanchang 330029
Abstract. This paper discusses the design and implement for the diagnosis software of blood flowing dynamic theory on blood viscosity syndrome (BVS). The BVS is a clinical syndrome caused by one or several blood viscosity factors. The software of diagnosis and treatment in medicine is a reasoning system of the experience of simulating clinical experts. In the system, the experience of experts is transformed into the mathematical formulas using rough-fuzzy & fuzzy-rough approach. And then we create the reasoning system by the mathematical formulas and granular computing. The development of diagnosis software is successful via the applications of several thousand cases in clinic. The system is dynamic, it can learn from examples by visiting the case base. So addition, subtraction, adjustment and update for treatment measures are implemented dynamically. The formulas of the system in compare with similar systems are more perfect. The diagnosis efficiency in clinic of the system in compare with the doctors is higher.
1
Introduction
The diagnosis software of blood flowing dynamic theory on BVS is a fuzzy reasoning machine of simulating expert experience. BVS is a common clinical manifestation[1] according to the clinical experience, which is always appearing an irregular change. The same disease type may appear different diagnosis of BVS during different clinic. Different disease type may appear the same diagnosis of BVS during different clinic. Hence, the treatment measures must be making adjustment dynamically according to clinical observation, so that the diagnosis software of designing can be used in the ever changing clinic. According to the scientific assertion of basic theory for Chinese traditional medicine on “Dialectical Applying Treatment”, “The Same Disease Using Different Treating” and “Different Disease Using the Same Treating”, we design the diagnosis software system of blood flowing dynamic theory on BVS. The system includes a knowledge base consisted of 6 tables. The 6 tables are both nested and independent, and they can contact each other by a mapping [2],[4]. The tables can also be queried and searched by external key. The system is developed, implemented by Dephi5.0 and Interbase5.0, running in Win98/2000. The operations of the system are normative and easy mastery. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 413–420, 2003. c Springer-Verlag Berlin Heidelberg 2003
414
2
Q. Liu, F. Jiang, and D. Deng
The System Structure and Relation between All Parts in It
The following is the structure of system and the flow drawing of its executed:
The knowledge and data base of system is consisted of 6 tables, which are the interval table of reference values for normal people, the table of gathering clinical data, the weight value table of 18 indexes, the level table of subtypes, the tables of disease types and the treatment measures & suggestions. The operations of all in the system are on the tables, the data in the tables is computed dynamically during the system executed. Hence, this system is a dynamic system of diagnosis software. The procedure of system executing is following: (1) Gathering clinical data of patients , including name, sex, age, case history and testing value of indexes, and filling into the table 1. The data of gathering is used as a gist of diagnosis for the patients, also offering some information for inquiring case history aftertime. (2) Granulating the interval of reference value for normal people. The interval of reference value of index for normal people is formulized via rough-fuzzy and fuzzy-rough approach, namely the experience of experts is transformed into the
Design and Implement for Diagnosis Systems
415
mathematical formula. The average value AV and standard deviation SD of each interval are computed by the formulas and to fill in table 0. The level of each index is computed according to the experience of experts, to fill into the table 2, thus the data in table 2 is generated dynamically. (3) Computing the level of subtypes. By the table 2 and the experience formulas of experts via mathematical handling we compute the level of each subtype, filling into the table 3, thus the data in table 3 is generated dynamically. (4) The diagnosis of patients. We diagnose the type of disease for the patients with the table 3 and the experience formulas of experts via mathematical handling, including blood hyper-viscosity syndrome (BHS), blood lower-viscosity syndrome (BLS), blood hyper-lower-syndrome (BHLS) and blood lower-hyperviscosity syndrome (BLHS) etc. The data in table 4 is generated dynamically. (5) Giving treatment measures and suggestions. Based on granular computing, the treatment measures and suggestions are derived from the table 5(Omitting) according to the diagnosing type of disease for patients. Due to create the base of case history in the system, so we arrange the cache of visiting base of case history in the procedure of reasoning treatment measures and suggestions, and comparing with relative case history in the base, implementing intelligence action to learn from examples. (6) Printing recipe. The result of diagnosing, treatment measures and suggestions are included in the recipe. Besides, the explanation is also attached in it, let patients know how treating for oneself to cooperate with the doctor.
3
Principles of the System
Designing of the system is based on the rough-fuzzy, fuzzy-rough and granular computing methods. First of all, we handle the interval of reference value of blood viscosity index for normal people using rough-fuzzy and fuzzy-rough approach, namely computing the average value AV and standard deviation SD of each interval in the table 0. Let [a, b] be a interval. An indiscernbility relation R is defined in the interval, namely ∀x1 ,x2 ∈[a, b], x1 R x2 iff | x1 -x2 |≤ 0.618, the transitivity of R is defined on [j*0.618, 1 + j ∗0.618] ⊆ [a, b], j=0,1,. . . .Thus the interval is divided by R, until a + n∗0.618 > b, where n is the total of small intervals. 0.618 here is chosen from “Gold Cut” in Chinese ancient mathematics. And Professor Luogen Hua, famous Chinese mathematician in the world used also the data in his book ”Methods for Plan as a Whole”(in Chinese), which is called best choosing point on [0,1], to be also a fuzzy concept, and defining an indiscernbility relation by it. The partition on [a, b] is also called granulating. On the interval [a, b], We compute the average value and standard deviation n−2 AV=(a+ j=1 (a + j∗0.618)+b)/n (1) and n−1 SD=sqrt( j=0 ((a + j∗0.618)-AV)2 /n) (2) respectively, where sqrt is a functional symbol of square root. And to compute the index value corresponding to the clinical testing data TV, namely (T V − AV )/SD f or T V > a IV = (−T V + AV )/SD f or T V ≤ a (3)
416
Q. Liu, F. Jiang, and D. Deng
where TV is the clinic testing blood viscosity data via testing instrument. a is the low bound corresponding to the interval. Computed IV by (3) is also filled in the table 1. Table 0 of Reference Value Intervals for Normal People Index item Male Female
interval1 AV1 SD1 (1-200) [4.42,4.97] ? ? [3.32,4.08] ? ?
interval3 AV3 SD3 interval6 AV6 SD6 (3-5) [8.73,10.27] ? ? [0.00,15.0] ? ? [6.86,8.52] ? ? [0.00,20.00] ? ?
Table 1 of Clinical Data Index item Patient1 Patient2
index1 (1-200) [4.42,4.97] [3.32,4.08]
TV1 IV1 interval3 TV3 IV3 . . . age sex (3-5) ? ? [8.73,10.27] ? ? ... ? male ? ? [6.86,8.52] ? ? ... ? female
Case history emergency EPG
We can compute the level of each index value to use AV , SD & IV with above and following rules: (1) 0≤IVIV≥AV+1∗SD→1 ; (3) 3>IV≥AV+2∗SD→2; (4) 4>IV≥ AV+3∗SD→3; (5) 5>IV≥AV+4∗SD→4; (6) 6>IV≥AV+5∗SD→5; (7) IV≥AV+6∗SD→6; (8) IV<-AV+1∗SD→0; (9) IV ≥-AV+1∗SD→-1; (10) IV ≥-AV+2∗SD→-2; (11) IV ≥-AV+3∗SD→-3; (12) IV ≥-AV+4∗SD→-4; (13) IV ≥-AV+5∗SD→-5; (14) IV ≥-AV+6∗SD→-6. Where 0 level is interpreted as the blood viscosity concentration of a normal people. Hence, absolute value of number is larger, blood viscosity concentration is worse too. The level describes a degree of diseases. Computed levels are filled in the table 2 Table 2 of Index Value Level Indexitem index1 1-200 P atient1 T V1 ≤ a1 P atient2 T V1 ≤ b1
level1 index6 ESR ? T V6 > a6 ? T V6 > b6
level6 index7 hematocrit ? T V7 ≤ a7 ? T V7 ≤ b7
level7 . . . ? ?
... ...
In table 2, ai and bi is the lower bound of Reference Value Intervals for male and female respectively. We compute the levels of subtypes according to experiential formulas of the experts. Computed levels are filled in the table 3. The subscript number n of related data item in each table is a code corresponding to nth index item. The following is similar. •1 Concentration: ConL=(4∗level7 +4∗level5 +level3 )/9∨(4∗level7 +4∗level5 )/8; •2 Viscosity: VL=( (level1 + level2 +level3 )∗2+4∗ level4 + level5 )/11;
Design and Implement for Diagnosis Systems
417
•3 Aggregation: AL=(3∗ level11 + level18 )/4 ∨(3∗ level11 + (level14 + level15 ) /2∗1.5)/4; •4 Coagulation: CoaL=(2∗ level13 +(level14 + level15 )/2∗1.5∗3+ level18 )/6 ∨(2∗ level13 +(level14 + level15 )/2∗1.5∗3)/5∨ (3∗ level13 +level11 )/4; •5 Hematocrit: HL=(TV7 -AV7 )/SD7 ; •6 Erythrocyte Aggregation: EAL=(level11 +level12 )/2; •7 Red Cell Rigidity: RCRL=(TV10 -AV10 )/SD10 ; •8 Blood Plasma Viscosity: BPVL=(TV5 -AV5 )/SD5 ; •9 Blood Platelet Aggregation: BPAL=(TV18 -AV18 )/SD18. Level Table 3 of Subtypes Subtypes ConL VL Patient1 ? ? Patient2 ? ?
AL ? ?
CoaL HL ? ? ? ?
EAL RCRL BPVL BPGL . . . ? ? ? ? ... ? ? ? ? ...
The facts for influencing blood viscosity syndrome are computed in the system. Disease types will be diagnosed by the positive and negative of values for the nine subtypes. Namely, Normal or BHS or BLS or BHLS or BLHS will be determined by the levels of nine subtypes. We may also decide a degree corresponding to the type of disease using the levels of nine subtypes. Thus, one has the table 4 dynamically. Table 4 of Disease Types attribute conl vl al coal hl eal rcrl bpvl bpgl bhs bls bhls blhs Nor Patient1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? Patient2 ? ? ? ? ? ? ? ? ? ? ? ? ? ?
In the table 4, condition attribute set C = { ConL, VL, AL, CoaL, HL, EAL, RCRL, BPVL, BPGL}, Decision attribute set D = {BHS, BLS, BHLS, BLHS, Nor}. ? in the table is generated by computing during system running. The following introduces the theories of reasoning in the system.
4
Knowledge Base and Reasoning in It
We can create the approximate reasoning based on granular computing. Hence we define following concepts. Definition 1 (degree inclusion and closeness) Let G=(ϕ, m(ϕ)) and G’=(ϕ’, m(ϕ’)) be two Granules[2],[3], the degree inclusion and closeness of them are defined as follows respectively:
418
Q. Liu, F. Jiang, and D. Deng
(1) G is included in G’ in degree at least p, denoted by Vp (G, G’), where p∈[0,1]. In formally, Card(G ⊗ G )/Card(G) for G = ∅ Vp (G, G ) = 1 for Otherwise where ⊗ is a symbol of granular operations. For given positive real number p∈[0,1], if V(G, G’)≥p, then the relation of G included in G’ in degree at least p is thought to be an extraction or a satisfaction ; (2) G closes to G’ in degree at least p , namely closeness relation Clp (G,G’) is denoted by Vp (G,G’)∧ Vp (G’, G). In formally, having Clp (G, G’) iff Vp (G,G’) ∧ Vp (G’,G). Proposition 1. |= (ϕ → ψ, m(ϕ → ψ)) is derivable iff |= ϕ → ψ is derivable and m(ϕ) ⊆m(ψ) is held. Proposition 2 ∀ϕ,ψ ∈RLIS , then (1) m(av )={x ∈U: a(x) = v ∈V}, where V is the set of attribute values, m is a symbol of semantic function of common logical formulas; (2) (∼ ϕ, m(∼ ϕ))=U-(ϕ, m(ϕ)); (3) (ϕ ∨ ψ, m(ϕ ∨ ψ))=(ϕ, m(ϕ))⊕ (ψ, m(ψ)); (4) (ϕ ∧ ψ, m(ϕ ∧ ψ))=(ϕ, m(ϕ))⊗ (ψ, m(ψ)); (5) ((∀x)ϕ(x), m(ϕ))=(ϕ(e1 )∧. . . ∧ϕ(en ), m(ϕ(e1 ) ∧. . . ∧ϕ(en )))=(ϕ(e1 ), m(ϕ))⊗. . . ⊗ (ϕ(en ), m(ϕ)), assuming here that universe U of all objects is finite. In the fact, IS is finite. And for ∀x∈VAR, ur ∈VAL, there exists an entity ei ∈U, such that ur (x)=ei , i=1,. . . ,n, where ur is an assigned symbol to object variable[2],[4]. The formulas with connectives → and ↔ can be substituted by ∼ and ∨ or ∧. Proposition 3 Let G=((ϕ, m(ϕ)),(ψ, m(ψ))) and G’=((ϕ’, m(ϕ’)), (ψ’, m(ψ’))) be two granules. They are close in degree at least p in IS, written by Clp (G,G’), then having (1) Clp (m(ϕ),m(ϕ’)); (2) Clp (m(ψ)-m(ϕ), m(ψ’)- m(ϕ’)); (3) Clp (U-m(ϕ),U-m(ϕ’)). Proposition 4 If G=((ϕ,m(ϕ)),(ψ,m(ψ))) and G’=((ϕ’,m(ϕ’)), (ψ’,m(ψ’))) are the granules defined by decision rules[2],[3], and Clp ((ϕ,m(ϕ)),(ϕ’,m(ϕ’)), then having Clp ((ψ,m(ψ)),(ψ’,m(ψ’))). This proposition will be an important criterion for search reasoning in Artificial Intelligence. The search reasoning is finished in rule base of expert systems. The rule base of tradition is a set of rules form as ϕ1 → ψ1 ,. . . , ϕi → ψi ,. . . , ϕn → ψn . The granulations corresponding with them are the form ((ϕ1 ,m(ϕ1 )),(ψ1 ,m(ψ1 ))),. . . ,(( ϕi , m(ϕi )), (ψi , m(ψi ))),. . . , (( ϕn , m(ϕn )), (ψn , m(ψn ))) respectively. The set of the granulations is the granulation base of form (( ϕi , m(ϕi ) (ψi , m(ψi ))), where i=1,2,. . . n. Thus, the inference in rule base of tradition is transformed into a reasoning in granulation base. Namely, which is the matching between the granule (ϕ,d(ϕ)) of gathering in situation and condition granule (ϕi , m(ϕi )) of a granulation in granulation base. If Clp ((ϕ,m(ϕ)), (ϕi , m(ϕi ))) is held then the decision (ψi , m(ψi )) of granulation ((ϕi , m(ϕi )), (ψi , m(ψi ))) is chosen. Following is the procedure of inference: (1) Gathering a group data to have contact with goal of searching. (2) The group data is denoted by a rough logical formula ϕ, it is constructed as an elementary granule (ϕ, m(ϕ)) on IS.
Design and Implement for Diagnosis Systems
419
(3) Computing the closeness degree pi between the granule and condition granule (ϕi , m(ϕi )) of each granulation ((ϕi , m(ϕi )), (ψi , m(ψi ))) in granulation base, and if pi ≥p, then {pi }∪List, where p is a given threshold and List is a table of storing closeness degree pi . Until the end of matching with each granulation in the granulation base. That is, For (i=1; eof (f); i++) {pi =V(G, Gi ); pi ’=V(Gi , G); pi ”=min(pi , pi ’); if (pi ≥ p) List = {pi }∪List; }\* f is a file of granulation base, p is a given threshold, List is a table of storing closeness degree of satisfying*\ (4) pi =max(List)i . Therefore, the goal is the decision granule ((ψi , m(ψi )) of the granulation ((ϕi , m(ϕi )), (ψi , m(ψi ))) corresponding to the closeness degree pi . If maximal pi is one more, such as {pi1, . . . , pij ,. . . ,pim }, then ij =min{i1, . . . ,ij ,. . . ,im }, namely the decision part (ψij , m(ψij )) of the most former granulation ( (ϕij , m(ϕij )), (ψij , m(ψij ))) corresponding to the close degree pij is one of our needful goal. For example, Let the data set of gathering in clinic for the patient, P={wang, Male, 65, 3.5, 4.5, 7.2, 16.1, 1.2, 1.0, 0.35, 5.5, 32.2, 2.8, 3.5778, 50.0, 1.5, 8.5, 6.5, 2.5, 10.2, 28.0, },which is written a rough logical formula ϕ = namewang ∧ Sexmale ∧ Age65 ∧ TV3.5 .The granulation corresponding to it is written as G = (ϕ, m(ϕ)),where the granule corresponding with a sub-formula ϕi is Gi = (ϕi , m(ϕi )).Such as, G2 = (ϕ2 , m(ϕ2 )) = (male, m(male)). 1. Computing average value AV and standard deviation SD of each index item interval by rough-fuzzy and fuzzy-rough approach. G2 is matched with the granulation (male, m(male)) of first row in the table 0. So, the decision part m(male) is chosen. Hence, [4.42,4.79], [5.41,6.33], . . . , are granulated by rough-fuzzy and fuzzy-rough respectively. To have AV1 =4.70 SD1 =0.28; AV2 =5.92 SD2 =0.38 ;AV3 =9.58 SD3 =0.59; AV4 =20.49 SD4 =1.54 ; AV5 =1.50 SD5 =0.07; AV6 =7.71 SD6 =4.61 ;AV7 =0.47 SD7 =0.07; AV8 =8.50 SD8 =1.02 ; AV9 =43.46 SD9 =5.15; AV10 =5.16 SD10 =0.92 ;AV11 =3.46 SD11 =0.13; AV12 =36.77 SD12 =21.40 ; AV13 =3.14 SD13 =0.75; AV14 =17.45 SD14 =4.98 ;AV15 =16.11 SD15 =4.96; AV16 =3.25 SD16 =0.30 ; AV17 =13.35 SD17 =1.22; AV18 =37.61 SD18 =8.55; 2. Computing index value IV of each index item by clinical testing value TV,AV and SD. IV1 =-4.29, IV2 =-3.74 , IV3 =-4.03 ,IV4 =-2.85, IV5 = -4.29 ,IV6 =-1.46, IV7 = -1.77 ,IV8 = -2.94 IV9 =-2.19, IV10 = -2.57, IV11 =0.98, IV12 = -0.62, IV13 = -2.19, IV14 = -1.80, IV15 = -1.94, IV16 = -2.50 IV17 =-2.58, IV18 = -1.12 . 3. Computing level of each index value IV by the rules (1)-(14) in above. Level =-1 , level2 =-5 , level3 =-6 , level4 =-6 , level5 =0 , level6 =-1 . . . 4. Computing level of each subtype by the rules •1 -•9 in above. ConL=-3, VL=-3, AL=0, CoaL=-2, HL=-2, EAL=1, RCRL=-3, BPVL=-5, BPGL=0 5. Diagnosing type of disease by the levels of subtype in the table 4. The degree of disease corresponding to the type of disease is computed by average
420
Q. Liu, F. Jiang, and D. Deng
value of level sum of relative subtypes and the correct value of case history. Diagnosing type of disease is BLHS ,the degree is –3.22(Low) and 1.00(High) 6. Printing out the recipe(omitting) using table 5 (omitting) and offering the explanation about type of disease for patient
5
Feature of the System
(1) Automatic analyzing and diagnosing the type of disease in BVS and computing the degree value of disease corresponding to the type. (2) Automatic printing out the diagnosing result, treatment measures, suggestions and related explanation for patient. (3) Automatic saving case history, offering the search in further consultation with a doctor, realizing the learning from examples, also realizing the clinical observe dynamically and adding, subtracting, adjusting, update the original treatment measures. (4) Realizing analyzing and diagnosing for BVS by the combination between Chinese traditional medicine and Western medicine in medical skill. (5) The experience of experts is formulized by rough-fuzzy and fuzzy-rough approach in the theory of system designing. (6) The system is divided into 4 levels and computing related data of each level dynamically in the art of programming. Summary, the design of medical skill in the system is an original. It uses an approximate reasoning of granular computing in programming of the system. Acknowledgements. This study is supported by the State Natural Science Fund (#60173054) and Natural Science Fund of Jiangxi province. Thanks are due to Professor Guoxian Li for his offering original medical skill.
References 1. Guoxian Li,Hypothesis on Blood Viscosity Syndrome (Clinical Blood Viscosity Syndrome), International Workshop of Clinical Blood Viscosity Syndrome, USA, 1995. 2. Qing Liu and Qun Liu,Approximate Reasoning Based on Granular Computing in Granular Logic, The Proceedings of ICMLS2002, Nov. 4–6, 2002. 3. A.Skowron and J.Stepaniuk, Extracting Patterns Using Information Granules, Proceedings of International workshop on Rough Set Theory and Granular Computing (RSTGC-2001), Bulleting of International Rough Set Society, Vol.5, No.1/2, May 20–22, 2001, 135–142. 4. Qing Liu, Neighborhood Logic and Its Data Reasoning in Information Table of Neighborhood Values, Journal of Computer (in Chinese),Vol.24, No.4,2001,4.
Granular Reasoning Using Zooming In & Out Part 1. Propositional Reasoning (An Extended Abstract) T. Murai1 , G. Resconi2 , M. Nakata3 , and Y. Sato1 1
Graduate School of Engineering, Hokkaido University, Sapporo 060-8628, Japan, {murahiko, ysato}@main.eng.hokudai.ac.jp 2 Dipartimento di Matematica, Universita Cattolica, 25128 Brescia, Italy [email protected] 3 Faculty of Management & Information Sciences, Josai International University, Togane, Chiba 283-8555, Japan, [email protected]
Abstract. The concept of granular computing is applied to propositional reasoning. Such kind of reasoning is called granular reasoning in this paper. For the purpose, two operations called zooming in & out is introduced to reconstruct granules of possible worlds. Keywords: Granular reasoning, granular computing, zooming in & out, filtration.
1
Introduction
Recently, much attention has been paid on rough set theory[6,7] and granular computing[2,9]. Using filtration in modal logic[1], we showed in [3] a possible way of granular reasoning, which means some mechanism of reasoning using granular computing. The paper, as its next step, aims to formulate some aspect of granularity in propositional reasoning. For the purpose, we use two operations called zooming in & out proposed in [4] to give a logical foundation of local and global worlds in semantic fields[8].
2
Granules of Possible Worlds
Brief review of Kripke semantics (cf.[1]). Given a countably infinite set of atomic sentences P, a language LBL (P) for propositional logic of belief is formed as the least set of sentences from P with the well-known set of connectives with a modal operator B(belief) by the usual formation rules. A sentence is nonmodal when it does not contain any occurrence of B. A Kripke model is a tuple M = , where W is a non-empty set of possible worlds, R is a binary relation on W , and V is a valuation for every atomic sentence p at every world x. Define M, x |= p iff V (p, x) = 1. The relationship |= is extended for every compound sentence in the usual way. The truth set of p in M is define as pM = {x ∈ U | M, x |= p}. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 421–424, 2003. c Springer-Verlag Berlin Heidelberg 2003
422
T. Murai et al.
Filtration and granules of possible worlds. Given P, we define the set UP of possible worlds as the power set of P: UP =2P . We call any subset in P an elementary world. Then, for an atomic sentence p in P and an elementary world x in UP , a valuation VP is naturally defined by VP (p, x)=1 iff p∈x. When a binary relation R is given on UP , we have a Kripke model: MP =. When we are concerned only with finite sentences, the set UP is, in general, huge for us. Hence we need some way of granularizing UP . Our proposal in a series of papers[3,4,5] is to make a quotient set whose elements we regard as granules of possible worlds. Suppose we are concerned with a set Γ of nonmodal sentences. Let PΓ =P∩sub(Γ ), where sub(Γ ) is the union of the sets of subsentences of each sentence in Γ . Then, we can define an agreement relation ∼Γ by x∼Γ y iff ∀p∈PΓ [V (p, x)=V (p, y)]. Then the relation becomes an equivalence relation and induces the quotient set UΓ =UP /∼Γ . Its elements are regarded as the granules of possible worlds under Γ . A new valuation is given by VΓ (p, X)=1 iff p∈∩X, for p in PΓ and X in UΓ . There is a bijection ψ:UΓ →2PΓ , which is defined by ψ(X)=∩X. According to [1], when a relation R is given on UP , assume we have an accessibility relation R on UΓ satisfying (1) if xRy then [x]∼Γ R [y]∼Γ , (2) if [x]∼Γ R [y]∼Γ then M, x |= Bp ⇒ M, y |= p, for every sentence Bp in Γ , and (3) if [x]∼Γ R [y]∼Γ then M, y |= p ⇒ M, x |= ¬B¬p, for every sentence ¬B¬p in Γ , then the model MR Γ = is called a filtration through sub(Γ ). Note R that R , and thus MΓ , is not uniquely determined by the above conditions.
3
Zooming In & Out
In [4], we introduced two operations of zooming in & out on sets of worlds. For the purpose here, we need extend the two operations of zooming in & out on models to describe propositional reasoning as a dynamic process. Zooming in & out on sets of worlds. A set Γ of non-modal sentences we are concerned with at a given time is called a focus and its elements focal ones at the time. When we move our viewpoint from one focus to another along time, we must reconstruct the set of granularized possible worlds. Let Γ be the current focus and let ∆ be the next focus we will move to. First we consider the following simpler two nested cases. (a) When PΓ ⊇P∆ , we need granularization, which is a mapping IΓ∆ : UΓ → U∆ , where, for any X df
in UΓ , IΓ∆ (X)={x ∈ U | x ∩ P∆ = (∩X) ∩ P∆ }. We call this mapping a zooming in from Γ to ∆. (b) When PΓ ⊆ P∆ , we need the inverse operation of granularization, which is a mapping OΓ∆ : UΓ → 2U∆ , where OΓ∆ (X)={Y ∈ U∆ |(∩Y ) ∩ PΓ = ∩X}. We call this mapping zooming out from Γ to ∆. Next we consider non-nested cases. When the two sets Γ and ∆ are not nested, the movement from Γ to ∆ can be represented using combination of ’zooming out and in’ as IΓ∆∪∆ ◦ OΓΓ ∪∆ : UΓ → 2U∆ .
Granular Reasoning Using Zooming In & Out
423
Extending zooming in & out on models. We extend the two operations of zooming in & out so that they can be applied to models. Again let Γ be the current focus and let ∆ be the next focus we will move to. Given a model MΓ =, df we can define BMΓ (X)={X ∈ UΓ | XRΓ X }, for X in UΓ . We abbreviate it to BMΓ , when BMΓ (X)=BMΓ (X ) for any X, X in UΓ . df
(a) When Γ ⊇∆, a zooming in from Γ to ∆ is extended as IΓ∆ (MΓ )=, where Yi R∆ Yj iff Yj ∈ IΓ∆ (∪{BMΓ (X) | X∈UΓ and IΓ∆ (X)=Yi }). We call IΓ∆ (MΓ ) a zooming in of MΓ . df
(b) When Γ ⊆∆, a zooming out from Γ to ∆ is extended as OΓ∆ (MΓ )=, where Yi R∆ Yj iff Yj ∈OΓ∆ (BMΓ ((OΓ∆ )−1 (Yi ))). We call OΓ∆ (MΓ ) a zooming out of MΓ . (c) Non-nested cases are described using the following merging of two models MΓ and M∆ . When PΓ =P∆ , their merging MΓ ◦M∆ is , where Xi RXj iff Xj ∈BMΓ (Xi )∩BM∆ (Xi ). The merging ◦ is extended for the cases df
that PΓ =P∆ . If PΓ ⊃P∆ , then MΓ ◦M∆ = MΓ ◦OΓ∆ (M∆ ), else if PΓ ⊂P∆ , then df
df
MΓ ◦M∆ =OΓ∆ (MΓ )◦M∆ , else MΓ ◦M∆ =OΓΓ ∪∆ (MΓ )◦O∆ Γ ∪∆ (M∆ ). The last case of merging is used for non-nested cases.
4
Granular Reasoning as a Dynamic Process Using Zooming In & Out
According to Chellas[1], the general form of the rule of inference in propositional logic is RPL. p1 , · · · , pn /p, where p is a tautological consequence of p1 , · · · , pn . Γ df n Let Γ ={p1 , · · · , pn }. Define MΓP =, where BMΓP = i=1 pi MP , by which we can recover an accessibility relation by xRx iff x ∈BMΓP . The necessary and sufficient condition that a sentence p is a conclusion of RPL is of course n Γ that ((p1 ∧ · · · ∧pn )→p) is a tautology in P L, which means BMΓP = i=1 pi MP ⊆ Γ
pMP . Thus p is a conclusion of RPL iff MΓP |= Bp. Hence, a sentence p is a conclusion of RP L just in case p is believed in the model. Unfortunately, we do not have a finitary way of checking it because truth sets are, in general, infinite. df Γ Next, we make a zooming in MΓ =IP Γ (MP )=. For a given sentence p, knowing whether p is a conclusion of RPL or not is reduced to checking whether pMΓ contains BMΓ or not. Any truth set is now finite, hence we can check it in a finite step. Finally, we can describe propositional reasoning as a dynamic process using zooming in & out. Consider the problem whether a sentence p is a conclusion of the premises p1 , · · · , pn . {p }
{p }
{p }
{p }
1 Step 1: Zoom in from MP 1 to M{p1 } =IP {p1 } (MP ) and let M ← M{p1 } .
2 Step 2: Zoom in from MP 2 to M{p2 } =IP {p2 } (MP ) and let M ← M ◦ Mp2 .
{p }
······
{p }
{p }
{p }
P 1 2 2 (Now M=O{p11 ,p2 } (IP {p1 } (MP ))◦O{p1 ,p2 } (I{p2 } (MP )))
424
T. Murai et al.
Step n:
{p }
{p }
n to M{pn } =IP ) and let M← {pn } (MP
Zoom in from MP n
Γ \{p }
{p }
M◦MΓ \{pn } . (Now M = OΓ n (MΓ \{pn } ) ◦ OΓ n (M{pn } )) Step n+1: If Pp ⊆PΓ then zoom out to MΓ ∪{p} and let M ← MΓ ∪{p} . Step n+2: Zoom in from M to M{p} = IΓ{p} (M). Step n+3: Check whether p is believed or not in M{p} , that is, whether pM{p} contains BM{p} or not. Γ \{pn }
Thus, we can find M{p} =IΓ{p} (OΓ Γ ∪{p} Γ \{p } I{p} (OΓΓ ∪{p} (OΓ n (MΓ \{pn } )
5
◦
{pn }
(MΓ \{pn } ) ◦ OΓ
(M{pn } )) or M{p} =
{p } OΓ n (M{pn } ))).
Concluding Remarks
In this paper, we described an aspect of granularity in propositional reasoning as a dynamic process using two operations of zooming in & out. We can find each step in the reasoning process just contains only the possible worlds as granules that need to perform the step. In the forthcoming paper[5], we will apply them into Aristotle’s categorical syllogism. Acknowledgments. The first author was partially supported by Grant-in-Aid No.14380171 for Scientific Research(B) of the Japan Society for the Promotion of Science.
References 1. Chellas,B.F. (1980): Modal Logic: An Introduction. Cambridge University Press. 2. Lin,T.Y. (1998): Granular Computing on Binary Relation, I & II. L.Polkowski et al.(eds.), Rough Sets in Knowledge Discovery 1: Methodology and Applications, Physica-Verlag, 107–121, 122–140. 3. Murai,T., M.Nakata, Y.Sato (2001): A Note on Filtration and Granular Reasoning, T.Terano et al.(eds.), New Frontiers in Artificial Intelligence, LNAI 2253, Springer, 385–389. 4. Murai,T., G.Resconi, M.Nakata, Y.Sato (2002): Operations of Zooming In & Out on Possible Worlds for Semantic Fields, E.Damiani et al.(eds.), Knowledge-Based Intelligent Information Engineering Systems and Allied Technologies, IOS press, 1083–1087. 5. Murai,T., G.Resconi, M.Nakata, Y.Sato (2003): Granular Reasoning Using Zooming In & Out: Part 2. Aristotle’s Categorical Syllogism. Proceedings of International Workshop on Rough Sets in Knowledge Discovery, to appear. 6. Pawlak,Z. (1982): Rough Sets. Int. J. Computer and Information Sciences, 11, 341– 356. 7. Pawlak,Z. (1991): Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer. 8. Resconi,G., T.Murai, M.Shimbo (2000): Field Theory and Modal Logic by Semantic Fields to Make Uncertainty Emerge from Information. Int. J. of General Systems, 29(5), 737–782. 9. Skowron,A. (2001): Toward Intelligent Systems: Calculi of Information Granules. T.Terano et al.(eds.), New Frontiers in Artificial Intelligence, LNAI 2253, Springer, 251–260, 2001.
A Pure Mereological Approach to Roughness Bo Chen and Mingtian Zhou Microcomputer Institute, School of Computer Science & Engineering, University of Electronic Science & Technology of China, Chengdu, 610054 [email protected], [email protected]
Abstract. The present paper aims to establish an approach to roughness with mereological relations between information granules rather than common set theoretic relations. With Atomic Granule, the minimum information unit of an information system is encapsulated, which is a triple encoding the semantics “an entity has the relevant attribute of specified value”. Based on it, a granular calculus is built up. Using the granular calculus, according to underlying mereological relations, lower and upper granule approximation is defined to achieve roughness over granular representation of Information System. Keywords. Granular Calculus, Mereological Relation, Granular Rough Theory
1 Introduction Lesniewski’s Mereology is a formal theory dealing with the “part-to-whole” relation between individual objects [1, 2]. Its full capabilities to be an alternative to Set Theory inspire us to build up a system for roughness over pure mereological relation among information granules of an information system.
2 A Granular Calculus for Information System An Information System (IS) I = (U, A) is the fundamental notion in Rough Theory, which is often given as an Information Table to represent the universe U of discourse, describing entities in U by their aspects qualified by the attribute set A. Each tabular data cell contains a value v for an attribute c of a specific entity u, expressing the finest unit of significant information in an IS that “u has the attribute c with value v”, formalized as a triple (u, c, v). We name such a triple as Atomic Granule, denoted by ξ (u , c, v) . Any tabular representation of IS can be transformed into collection of individual atomic granules corresponding to the data cells. Over atomic granules, we define Aggregation operation to synthesize more complex granule, namely Compound Granule, Θ . Aggregation does nothing but to put things together, which uses parentheses to mark multiple ingredient granules as a whole and colons for delimitation purpose. Aggregation of atomic granules: (ξ1 , ξ 2 , ..., ξ n ) ≡ def (ξ1 : ξ 2 : ... : ξ n ) G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 425–429, 2003. © Springer-Verlag Berlin Heidelberg 2003
(1)
426
B. Chen and M. Zhou
For compound granules, two similar operations can be defined over them: Aggregation and Fusion. Aggregation operation over compound granules is in the same sense as over atomic granules, in that it puts separate compound granules into a new compound granule. Fusion of compound granules is to generate a new compound granule by embodying all the atomic granules constructing each of the operands. Aggregation of compound granules: ( Θ1, Θ 2 , ..., Θ n ) ≡ def (Θ1 : Θ 2 : ... : Θ n )
(2)
Fusion of two compound granules Θ a = (ξ a1 : ξ a2 : ... : ξ am ) , Θb = (ξb1 : ξb2 : ... : ξbn ) :
Θa ⊕ Θb ≡ def (ξ a1 : ξ a2 : ... : ξ am : ξ b1 : ξb2 : ... : ξbn )
(3)
Two identical atomic granules aggregate into a compound granule with only one atomic granule, viz. ξ1 ξ1 = (ξ1 ) , referred to as Trivial Compound Granule. Due to ingredients of compound granules, they can also be classified into two categories: Plain Compound Granule and Higher Order Compound Granule. Granules of the former type have only atomic ingredients, whereas those of the latter have only nontrivial ones. Such distinction is defined to encode structural information of parts. An internal operation Self-Fusion ( ι ) is applied to higher order compound granule to transfer it into a plain compound granule, which comprises all atomic granules playing the sub-part role in the higher order compound granule. Three cases of mereological relations exist between two granules: part-of (P), partial overlap (PO) and disjoint (DR). Recognition of these cases is based on the commonality of the self-fused ingredient atomic granules in either compound granule. Two granules are disjoint without common atomic ingredients; one granule is part-of the other when all its atomic ingredients are common ones for both. When both cases fail, they partially overlap each other. In order to find out the overlapped part of multiple granules, we define Wrap operation. Wrap of two compound granules:
Θ a ◊Θb ≡ def (ξ c1 , ξ c2 ,..., ξ cl ) , where P(ξ ci , Θ a ) ∧ P(ξ ci , Θb )
(4)
When there exist special relationships amongst the three elements of each ingredient triples for a compound granule, it can suggest particular connotations. When the entity identity of each granule is identical, it means they stand for different aspects of a given entity, named Aspect Granule, Ψ (ui , (c j1 , c j2 ,..., c jn )) ; the identical attribute type with identical value means these triples are indiscernible so as to be clustered into an equivalent collection, called a pre-Cluster Granule; when the attribute values are identical so as to be an equivalent class, i.e. all the triples of the same attribute value are ingredients of such a compound granule, it is referred to as Cluster Granule, Ξ(c j , vt ) . So far these special types of compound granules are only on behalf of single column or row in the Information Table. When the compound granules are further aggregated into more intricate ones, a very significant compound granule can be generated, which has its ingredient as aspect granules, each of which takes the same set of attribute types with equal valuation. Such a granule leads to the formulation of an equivalent collection or class of entities based on the given set of attribute types and valuation, expressing the meaning these information granules are indiscernible over multiple aspects. While being equivalent class, it is termed as
A Pure Mereological Approach to Roughness
427
Aspect Cluster Granule, ℑ((c j1 , c j2 ,..., c jn ), (v1 , v2 ,..., vn )) . It is obvious that cluster granule and aspect granule are both retrogressed cases of aspect cluster granule. Any aspect cluster granule has ((ui1 , ui2 ,..., uin ), (c j1 , c j2 ,..., c jn ), (v1 , v2 ,..., vn )) as its abbreviated written form. Another class of internal operations, Convergence, can convert plain compound granules to higher order ones according to given internal structural specification. Horizontal Convergence ( τ ), Vertical Convergence ( υ ) and Full Convergence ( ρ ) aggregate internal atomic granules of plain compound granule into all possible maximum aspect granules, cluster granules and aspect cluster granules, respectively. Given two aspect cluster granules, either describing a class of entities with respect to their respective aspects, we define Merge ( ⊗ ) operation to extract the class of entities satisfy aspect specifications in both granules. This operation is performed in three steps, Ξ(c j , v j ) ⊗ Ξ(ck , vk ) for instance: 1) Fusion: Ξ(c j , v j ) ⊕ Ξ(ck , vk ) = Θ ; 2) Full Convergence: ρ (Θ) ; 3) Keep ℑ((c j , ck ), (v j , vk )) as result. By nature, an information granule encapsulates extensional properties of a set of entities, which produce intensional connection between different granules. From a mereological viewpoint, granules depict different attributes of the same entities can be disjoint; whereas the intensional connection exists to make our granular calculus more semantic viable for representing intricate IS with topological issues. Here we only define a Shift ( δ A j ) operation based on the intensional connection. An aspect cluster granule ℑ((c j1 , c j2 ,..., c jn ), (v1 , v2 ,..., vn )) represents a class of indiscernible entities with respect to the relevant attribute collection, but if we investigate entities involved for other attribute collection Aj = (c j ′ , c j ′ , ..., c j ′ ) , it would be useful to know how the 1
2
n
same class of entities split into different classes under Aj . The operation δ A j (ℑ) aggregates all atomic granules about attributes in Aj for entities from ℑ ; then it applies full convergence to construct internal aspect cluster granules of Aj .
3
Roughness over Granular Representation
A crucial semantics of roughness is to present a way of approximately representing a class of entities characterized by some aspects in terms of entity collections described at other aspects, or granularly speaking, “to roughly approximate one aspect cluster granule by some other aspect cluster granules”. Given a conditional attribute collection B, according to all possible valuation of it, we can aggregate a series of aspect cluster granules relevant to B in the IS. Let ℑk ( B,V jk ) be an arbitrary aspect cluster in the series. By δ d (ℑk ( B, V jk )) , we shift
ℑk to decisional attribute d, and obtain Θ k = (Ξ1 (d , v1 ) : Ξ 2 (d , v2 ) : ... : Ξt (d , vt )) , which includes all decisional cluster granules for the entities related to ℑk . According
428
B. Chen and M. Zhou
to the mereological relation between Θ k and a specified decisional cluster granule
Ξ(d , di ) , ℑk can be classified into three classes: - If Θ k is part of Ξ(d , di ) , viz. P(Θk , Ξ(d , di )) , ℑk is called Regular B-Granule with respect to d i , denoted by ℜ B (di ) .
- If Θ k partially overlap Ξ(d , di ) , viz. PO(Θk , Ξ(d , di )) , ℑk is called Irregular Bˆ (d ) . Granule with respect to d , denoted by ℜ B
i
i
- If Θ k is disjoint from Ξ(d , di ) , viz. DR(Θ k , Ξ(d , di )) , ℑk is called Irrelevant with respect to d i . Analogue to their counterparts in terms of set theoretic definitions [3], we can define the notions of lower and upper granule approximation as below: - Definition 1. The compound granule aggregated by all the regular B-granules with respect to d i , ℜ B (di ) , is called Kernel Granule to Ξ(d , di ) , written as Λ B (di ) , which is interpreted as the lower approximation of the decisional granule. - Definition 2. The compound granule aggregated by all the irregular B-granules ˆ (d ) , is called Hull Granule to Ξ(d , d ) , written as ∆ (d ) , with respect to d i , ℜ B i i B i standing for the boundary between the lower and upper approximation. - Definition 3. The Kernel and Hull granule to Ξ(d , di ) together aggregate the upper approximation of Ξ(d , di ) , which we call Corpus Granule, written as Ω B (di ) . Quantitative aspects like the Rough Membership in Rough Set can be defined by employing the notion of granule cardinality, while left as open issues. Table 1. An arbitrary example Information System I* in tabular form. For demonstration purpose, we leave the attributes reduction issues open; the numbers in the table stand for specific discrete values of corresponding conditional attribute ci , and the same numbers in different columns are of different meaning to their concrete semantic context
d
u1
c1 0
c2 1
c3 0
c4 2
c5 4
d1
u2
1
0
1
2
3
d2
u3
0
1
0
2
3
d1
u4
2
0
2
1
2
d2
u5
2
1
1
1
4
d1
u6
2
1
1
0
1
d1
u7
1
0
2
0
0
d1
u8
0
1
1
2
3
d3
For I*, let B = (c1 , c2 ) , di = d1 . Ξ(d , d1 ) = ((u1 , u3 , u5 , u6 , u7 ), d , d1 ) . ℑ1 ( B, (0,1)) =
((u1 , u3 , u8 ), B, (0,1)) , ℑ2 ( B, (1, 0)) = ((u2 , u7 ), B, (1, 0)) , ℑ3 ( B, (2, 0)) = ((u4 ), B, (2, 0)) , ℑ4 ( B, (2,1)) = ((u5 , u6 ), B, (2,1)) . For each of them, by shift operation, the resulted
A Pure Mereological Approach to Roughness
429
decisional granules are: Θ1 = δ d (ℑ1 ) = (((u1 , u3 ), d , d1 ) : (u8 , d , d3 )) , Θ 2 = δ d (ℑ2 ) =
((u2 , d , d 2 ) : (u7 , d , d1 )) , Θ3 = δ d (ℑ3 ) = (u4 , d , d 2 ) , Θ 4 = δ d (ℑ4 ) = ((u5 , u6 ), d , d1 ) . The mereological relations between each of the shift results with Ξ(d , d1 ) are:
PO(Θ1 , Ξ(d , d1 )) , PO(Θ2 , Ξ(d , d1 )) , DR(Θ3 , Ξ(d , d1 )) and P(Θ4 , Ξ(d , d1 )) . Then ˆ ˆ ℜ (d ) = ℑ ( B, (2,1)) , ℜ (d ) = ℑ ( B, (0,1)) , ℜ (d ) = ℑ ( B, (1, 0)) ; ℑ ( B, (2, 0)) is B
1
B1
4
1
1
B2
1
2
3
irrelevant. Finally Kernel Granule to Ξ(d , d1 ) : Λ B (d1 ) = (ℜ B (d1 )) = ℜ B (d1 ) ; Hull ˆ (d ) ); Corpus Granule: Ω (d ) = (Λ (d ) : ∆ (d )) . ˆ (d ) : ℜ Granule: ∆ (d ) = ( ℜ B
4
1
B1
1
B2
1
B
1
B
1
B
1
Related Works and Conclusions
As far as it is concerned with the Adaptive Calculi of Information Granules [4] from A. Skowron, L. Polkowski, et al, it is defined on basics of Rough Mereology, while we use the notion of granule as primitive for mereological interpretation and representation of roughness. Another active direction of granular computing in Rough Theory is from Y. Y. Yao, et al [5]. Their approach is mainly concerned with information granules in Rough Set Theory, adherent to the canonical definition of fuzzy granular computing from Zadeh. In common approaches, a triple form information granule is a second order object exists only on ontological level to commit to some conception of entities. In our granular calculus, such a second order object is promoted to primitive individual for further operation. By granules, extensional aspects of entities are well bound to entities they describe, viz. all semantic contexts required in the process of rough approximation are explicitly encapsulated, including entity identity, attribute set and valuation of the attributes. Such explicitness eliminates intertwined external references to semantic contexts when multiple simpler constructs are to model more complex ones. As the present paper indicates, by mereological relations among information granules, an alternative approach to roughness is viable, with intrinsic power of knowledge representation and approximation. By the naïve system, we intend to call for more efforts in Mereology issues, so as to take advantage of its intrinsic relationship with ontology to build brand new methodologies and theories.
References 1. Eugene C. Luschei, The Logical Systems of Lesniewski. North-Holland Publishing Company, Amsterdam, (1962). 2. Tarski Alfred, Foundations of the Geometry of Solids. In: Logic, Semantics, Metamathematics, Published by Oxford at the Clarendon Press, (1956), 24–29. 3. Pawlak Z., Rough sets. International Journal of Computer and Information Sciences 11, (1982), 341–356. 4. Polkowski L., Skowron A., Rough Mereology: A New Paradigm for Approximate Reasoning. International Journal of Approximate Reasoning 15/4, 333–365. 5. Yao, Y.Y., Information Granulation and Rough Set Approximation, International Journal of Intelligent Systems, Vol. 16, No. 1, (2001), 87–104.
Knowledge Based Descriptive Neural Networks J.T. Yao Department of Computer Science, University or Regina Regina, Saskachewan, CANADA S4S 0A2 [email protected]
Abstract. This paper presents a study of knowledge based descriptive neural networks (DNN). DNN is a neural network that incorporates rules extracted from trained neural networks. One of the major drawbacks of neural network models is that they could not explain what they have done. Extracting rules from trained neural networks is one of the solutions. However, how to effectively use extracted rules has been paid little attention. This paper addresses issues of effective ways of using these extracted rules. With the introduction of DNN, we not only keep the good feature of nonlinearity in neural networks but also have explanation of underlying reasoning mechanisms, for instance, how prediction is made.
1 Introduction Artificial neural networks are computer software that emulates biological neural networks. A neural network model is a learning system made up of simple units configured in a highly interconnected network. Neural networks are normally classified as one of soft computing techniques. Soft computing is a collection of techniques spanning many fields that fall under various categories in computational intelligence. Soft computing methodologies including neural networks, genetic algorithms, granular computing, fuzzy sets, rough sets and wavelet are widely applied in data mining and knowledge discovery process [8]. Recently, a granular computing model for knowledge discovery and data mining is proposed by combining results from formal concept analysis and granular computing [18]. The essence of soft computing is that unlike the traditional, hard computing approach, it is aimed at an accommodation with the pervasive imprecision of the real world. The guiding principle of soft computing is: “exploit the tolerance for imprecision, uncertainty, and partial truth to achieve tractability, robustness, low solution cost, and better rapport with reality” [19]. Knowledge discovery in databases or KDD has been defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Data mining as one of the processes in KDD is the application of data analysis and discovery algorithms that under acceptable computational efficiency limitations G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 430–436, 2003. © Springer-Verlag Berlin Heidelberg 2003
Knowledge Based Descriptive Neural Networks
431
produce a particular enumeration of patterns over the data. However, researchers do not differentiate these two terms. Generally speaking, there are two types of data mining approaches, namely descriptive data mining and predictive data mining. Descriptive data mining explores interesting patterns to describe the data while predictive data mining forecasts the behavior of the model based on available data set. Due the black box nature of neural networks, they sometimes are not classified as data mining tools for discover interesting and understandable data mining definition. Information or knowledge embedded in trained neural networks is hard to be verified or interpreted by human beings. There are two conventional types of forecasting models, namely qualitative and quantitative approaches. Qualitative methods are incapable to separate individual’s biases from objective facts. Quantitative approaches are weak in applications with nonlinearity in the data set. Artificial neural networks can learn from examples and exhibit some capability for generalization, beyond the training data. Indeed, they have been used in such diverse applications as handwriting recognition, medical diagnosis, exchange rate prediction, stock market assessment and prediction, and many more. For instance, they have been shown [11, 16] to outperform traditional forecasting models in relation to numerous business classification and prediction problems. In addition, neural networks have been shown mathematically to be universal approximators of functions and their derivatives [15]. Hence the potential benefits to various applications may be unlimited. Taking financial forecasting as an example, there are two major problems in the use of neural networks. One is that the underlying structural factors are not static and thus there is pressure to find the trends quickly while they are valid, as well as to recognize the time when the trends are no longer effective. The second problem is that, while neural networks have proved to be far more effective at forecasting than more conventional linear techniques like regression analysis, their decision processes are not easily understandable in terms of rules that human experts can verify. Businesses are reluctant to invest in forecasting techniques, however effective, if there is not a clearly understandable causal base to the model. In other words, these neural networks operate as “black boxes”. In many applications, it is desirable to extract knowledge from trained neural networks for users to gain better understanding of the problem at hand [12]. If the prediction performed by a neural network could be understood and explained, then there would be much more widespread acceptance of this technology. This would then pave the way for even more applications of this powerful technology. This paper addresses the issues of usage the discovered underlying rules by neural networks. It presents an ongoing project of incorporating rules extracted from trained neural networks to form a knowledge based descriptive neural network--DNN. The organization of this article is as follows. The next section gives a brief review of techniques of rules extraction from neural networks. A section discusses the construction of the descriptive neural networks is followed. Finally, a conclusion section ends with this paper.
432
J.T. Yao
2 Rules Extraction from Trained Neural Networks Neural networks achieve high accuracy in classification, prediction and many other applications as suggested in the literature. As mentioned above, being unable to explain the knowledge embedded in trained neural networks is one of the major drawbacks of this technology. Much of attention has been paid to solve this problem by extracting rules from trained neural networks. According to the taxonomy of Tickle et al. [14], neural network rule extraction techniques may be classified into five dimensions, namely, (1) the “expressive power” of the extracted rules (format or type of rules extracted); (2) the “quality” (accuracy, fidelity, consistency and comprehensibility) of the extracted rules; (3) the “translucency” of the view taken within the rule extraction technique of the underlying network units (i.e. using of decompositional or pedagogical); (4) the complexity of the algorithms; (5) the portability to network architectures and training regimes. A sixth dimension, the treatment of linguistic (i.e., binary, discretized and continuous) variables has been considered by Duch et al. [4]. Many researchers have been working on extracting rules from trained networks, as reasoning with logical rules is more acceptable to users than black box systems [1]. Setiono's work [12] is one of the examples to address this problem and open the black box of neural networks. Andrews et al. [1] classified techniques of extraction of symbolic logical rules from neural network into two groups. The rules are extracted at the level of individual hidden and output layer units by decomositional approach [13]. With pedagogical approach [1, 12], the network is treated as a ‘black box’; extracted rules describe global relationships between inputs and outputs; no analysis of the detailed characteristics of the neural network itself is undertaken. The fifth criterion of taxonomy of Tickle et al. [14] is to measure the generalization ability of the technique. This dimension can be extended to applications as well. For instance, in the area of financial forecasting with neural networks the rules should be as simple as possible from practitioner's point of view. In fact, although there are dozens of technical indicators available, most of traders only stick on a couple of indicators. A trader may only look at 10 day and 30 day moving average. The rules he used might be “When the 10 day moving average crosses above the 30 day moving average and both moving averages are in an upward direction it is the time to buy” and “When the 10 day moving average crosses below the 30 day moving average and both moving averages are directed downward it is time to sell.” If the inputs of a neural network include 10 day and 20 day moving averages and the output is buy or sell, similar rules may be extracted from the trained neural networks. However, neural networks are more complex than this one in most cases therefore the rules are more complex than the cited examples. In any case, the rules extracted from neural networks should be easily interpreted and transferred into actions by traders. If we only use these discovered rules, we only treat neural networks as one of data mining tools. As suggested by many researchers that neural network technique is one of the best forecasting candidates of financial forecasting, we would better use neural networks as forecasting tool instead of just use the discovered forecasting rules from trained networks. This research is trying to incorporate discovered rules into neural network models.
Knowledge Based Descriptive Neural Networks
433
3 Construction of Descriptive Neural Networks Following the above discussion, we will present our proposed novel descriptive neural network -- DNN in this section. The DNN is a neural network embedded with business rules that have been discovered from previously trained networks. The architecture of DNN is not only decided by training examples but also by hidden rules extracted from trained networks. One of the aim of our descriptive technique is to create and use innovative a second layer of rule extraction techniques to explicate the hidden rules from previously trained neural networks. This will enable us to explain the mechanism of a neural network forecasting model.
New Data
Data
NN
DNN
NN Construction System DNN Construction Rule Extraction Mechanism
Explainable Prediction Rules
Fig. 1. Procedures of DNN Construction
An additional aim is to be able to extract the factors that actually drive business decision-making for instance. The DNN system to traditional neural network is similar to econometrics to regression analysis. One of advantages of neural networks applied in the forecasting domain is that high forecasting accuracy may be achieved without knowing domain knowledge. Armstrong [2] stated in his principle 10.2 that “let the data speak for themselves” is ill-suited for forecasting. It may not acceptable by practitioners if only regression or neural network models are used as forecasting tools. Econometrics is the technique using statistical analysis combined with economic theory to analyze economic data. It presents more economic knowledge than a single mathematical formula and thus more popular and acceptable to practitioners. In fact, it is more accurate than regression models.
434
J.T. Yao
There are three steps involved in the construction of the proposed DNN networks as depicted in Figure 1. The first step is to build neural network forecasting model by the neural network construction system [17]. The mechanism of the subsystem is like an Expert System Shell. It assists users to build a forecasting model by analysis available data and some domain knowledge. The neural network construction system contains all manual procedure components involved in construction of neural networks such as data preprocessing, input/output selection, sensitivity analysis, data organization, model construction, post analysis and model recommendation. It will decide the neural network forecasting models used in a particular financial application domain; perform the tedious trial-and-error training procedures; and relieve the human from most of the analysis jobs. The system also contains a knowledge base that enables an inexperienced user to construct a neural network forecasting model without worry about techniques details such as trading strategies, data sampling, and data partitioning. Self-adapting facilities are included in the model construction system. Time series and its related fundamental series (if any) are fed to the system when used in developing financial forecasting models for instance. Different market information will then be analyzed by the system to automatically determine the model parameters, network size, data frequency, etc. Statistical and R/S analyses are also conducted by the system to gauge the behavior of markets. To minimize recency problem, some technical indicators, such as smoothing indicators will also be derived from the original time series. There is an Existing Knowledge Base that initially stores the information of human experience from manual training procedures. The system will then proceed to collect all the related information through its own experiences, as it keeps track of the performance of each model. After a forecasting model is constructed, the system will analyze and generalize the information in a new knowledge base called Current Record. The Existing Knowledge Base will be updated according to the new knowledge extracted from current construction. The knowledge base contains two parts in representing two kinds of knowledge, namely, general knowledge and case specific knowledge. A piece of specific knowledge is stored as it is. It just states a fact in which kind of situation how much profit was gained. This kind of information is recorded in Current and will be converted to Existing Knowledge Base. The second step is to extract rules from trained the neural networks. Neural network architecture and weight space are used to mine the business rules that govern the forecasting by the rule extraction mechanism. Many rule extraction techniques such as decompositional and pedagogical, Boolean rule have appeared in the literature [5, 12 & 13]. We are particular interested in the business rules could be applied to financial forecasting in our preliminary study as little research has been done in this area. We need further study on the specific features of rule extraction for financial forecasting models. If we can explicate the hidden forecasting rules from previously trained neural networks, we thus will be able explain the mechanism of a neural network forecasting model. In the third step hidden forecasting rules extracted in the previous step are incorporated to the network generated by neural network construction system to form a descriptive neural network, DNN. Most of researchers extract if-then type association rules, as they are more understandable for humans than other
Knowledge Based Descriptive Neural Networks
435
representations [4]. The rules created from neural networks can then be converted in decision tree or used in other expert systems. The uses of rules here have limited neural network’s ability to model nonlinear data especially in financial applications. DNN is an artificial neural network with descriptions of the domain knowledge of applied area. So that not only predictions can be made but also the reasons for the predictions can be explained. DNN to conventional neural networks is like econometrics to regression. It incorporates uncovered rules such as propositional rules [3, 9] and fuzzy rules [5] as well as domain knowledge to traditional neural networks. We expect that DNN networks could make more accurate and explainable predictions. Some issues in construction of DNN include knowledge based management, architecture enhancement, rule measurement criteria, threshold adjustment, fuzzy representation, etc. Fuzzy neural networks can be considered as the simplest DNN. Neural network ensemble can be considered as the second type of DNN. A neural-network ensemble a set of separately trained neural networks that are combined to form one unified prediction model [10]. Each trained neural network in the collection of ensemble can serve a different rule in modeling. Suppose we use three neural networks in financial forecasting. They are used to forecast different movements such as primary trend, secondary movements and day fluctuations as in Dow Jones theory. These networks are served as committee members [17] in the forecasting and decision making processes. A more complex DNN is to incorporate rules extracted from the previous step. The rules are used in construction of neural networks in terms of the architecture and weights. The weighs to or from the most influence factors are set the highest in the retraining. Some unimportant factors could be passed. For instance, if we find that the daily movement has the least influence to long term forecasting, we could eliminate the node or the neural network that are severed as daily movement component.
4 Concluding Remarks Data mining covers most of soft computing techniques. Neural networks are widely used in applications such as forecasting, pattern recognition, and classification. Although research has shown that neural networks are more effective and accurate in many areas than traditional statistical models for instance, industry people are reluctant to invest in such a good technology due to the lack of explanation of its mechanism of modeling. Extracting rules from trained neural networks is one of the solutions to this problem. However, just use the rules extracted from neural network does not take full advantages of neural network. This paper addresses the issues of using discovered rules in a way to incorporate them to neural networks. It presents a descriptive neural network model, i.e. DNN that is expected to make more accurate and explainable forecasts. Further research on this topic includes theoretical extension and enhancement of the techniques by integration with other softcomputing techniques such as fuzzy logic, rough sets, and genetics algorithms; refining techniques of DNN construction and rules extraction; Further investigation the rules extracted in order to make full advantages from them.
436
J.T. Yao
References [1] R. Andrews, J. Diederich, & A. Tickle, “Survey and critique of techniques for extracting rules from trained artificial neural networks”, Knowledge Based Systems, 8(6), 373–389, 1995. [2] J. S. Armstrong, Principles of Forecasting: A Handbook for Researchers and Practitioners, Norwell, MA: Kluwer Academic Publishers, 2001. [3] G. A. Carpenter & A.W. Tan, “Rule Extraction: From Neural Architecture to Symbolic Representation”, Connection Science, 7(1), pp3–27, 1995. [4] W. Duch, R. Adamczak & K. Grabczewski, “A new methodology of extraction, optimization and application of crisp and fuzzy logical rules”, IEEE Trans Neural Networks, 11(2), 1–31, 2000. [5] L. M. Fu, “Rule Generation from Neural Networks”, IEEE transactions on systems, man and cybernetics, 28(8), 1114–1124, 1994. [6] S. Horikawa, et. al, “On Fuzzy Modeling Using Fuzzy Neural Networks with the Backpropagation Algorithm”, IEEE Trans Neural Networks, 3, 801–806, 1992. [7] N. Kasabov, “On-line learning, reasoning, rule extraction and aggregation in locally optimised evolving fuzzy neural networks”, Neurocomputing, 41 (1–4), 25–45, 2001. [8] S. Mitra, S. K. Pal and P. Mitra, “Data mining in soft computing framework: A survey”, IEEE Trans Neural Networks, 13, 3–14, 2002. [9] F Maire, “A Partial Order for the M-of-N Rule Extraction Algorithm”, IEEE Trans Neural Networks, 8(6), 1542–1544, 1997. [10] D. Opitz & J. Shavlik, “Actively searching for an effective neural-network ensemble”, Connection Science, 8, 1996, 337–353. [11] A. N. Refenes, A. Zapranis, and G. Francis, “Stock Performance Modeling Using Neural Networks: A Comparative Study with Regression Models”, Neural Network, 5, 961–970 1994. [12] R. Setiono, “Extracting M-of-N Rules From Trained Neural Networks'', IEEE Trans Neural Networks, 11( 2), 512–519, 2000. [13] S. Thrun, “Extracting rules from artificial neural networks with distributed representations.” In Tesauro, G.; Touretzky, D.; and Leen, T., eds., Advances in Neural Information Processing Systems (NIPS) 7. Cambridge, MA: MIT Press, pp.505–512, 1995. [14] A. Tickle, R. Andrews, M. Golea & J. Diederich, “The Truth Will Come To Light: Directions and Challenges in Extracting the Knowledge Embedded Within Trained Artificial Neural Networks”, IEEE Trans Neural Networks 9(6), 1057–1068, 1998. [15] H. White, “Learning in Artificial Neural Networks: A Statistical Perspective”, Neural Computation, 1, 425–465, 1989. [16] J. T. Yao, C. L. Tan, “A Case Study on Using Neural Networks to Perform Technical Forecasting of Forex”, Neurocomputing, 34(1–4), 79–98, 2000. [17] J. T. Yao, “Towards a Better Forecasting Model for Economic Indices”, The 6th International Conference on Computer Science and Informatics, USA. pp.299–303, 2002 [18] Y.Y. Yao “On Modeling data mining with granular computing”, Proceedings of COMPSAC 2001, pp.638–643, 2001. [19] L. A. Zadeh, “The roles of fuzzy logic and soft computing in the conception, design and deployment of intelligent systems”, H. S. Nwana and N. Azarmi (Eds), Software Agents and Soft Computing: Towards Enhancing Machine Intelligence, Concepts and Applications, Springer Verlag, 1997, pp183–190.
Genetically Optimized Rule-Based Fuzzy Polynomial Neural Networks: Synthesis of Computational Intelligence Technologies Sung-Kwun Oh1 , James F. Peters2 , Witold Pedrycz3 , and Tae-Chon Ahn1 1
Department of Electrical Electronic and Information Engineering, Wonkwang University, 344-2, Shinyong-Dong, Iksan, Chon-Buk, 570-749, South Korea 2 Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Manitoba, R3T 5V6, Canada 3 Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2G6, Canada and Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Abstract. In this study, we introduce a concept of Rule-based fuzzy polynomial neural networks(RFPNN), a hybrid modeling architecture combining rule-based fuzzy neural networks(RFNN) and polynomial neural networks(PNN). We discuss their comprehensive design methodology. The development of the RFPNN dwells on the technologies of Computational Intelligence(CI), namely fuzzy sets, neural networks, and genetic algorithms. The architecture of the RFPNN results from a synergistic usage of RFNN and PNN. RFNN contribute to the formation of the premise part of the rule-based structure of the RFPNN. The consequence part of the RFPNN is designed using PNN. We discuss two kinds of RFPNN architectures and propose a comprehensive learning algorithm. In particular, it is shown that this network exhibits a dynamic structure. The experimental results include well-known software data such as the Medical Imaging System(MIS) dataset. Keywords: Rule-base fuzzy polynomial neural networks(RFPNN), rule-based fuzzy neural networks(RFNN), polynomial neural networks(PNN), computational intelligence(CI), genetic algorithms(GAs), design methodology
1
Introduction
Empirical studies in software engineering employ experimental data to gain insight into the software development process and assess its quality. Data concerning software products and software processes are crucial to their better understanding and, in the sequel, developing effective ways of producing high quality software. However, data have no meaning per se and their semantics arises in the context of a conceptual model of the phenomenon under study[1]. The mechanism describing the software development process is not understood sufficiently well or becomes too complicated to allow an exact model to be postulated from G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 437–444, 2003. c Springer-Verlag Berlin Heidelberg 2003
438
S.-K. Oh et al.
theory. Accordingly bearing these in mind, we are vitally interested in the development of adaptive and highly nonlinear models that are capable of handling efficacies of software processes. This calls for models that are anchored in the framework of fuzzy sets and neural networks. In particular, we are concerned with a hybridization of these two technologies giving rise to so-called fuzzy neural networks systems. Moreover efficient modeling techniques should allow for a selection of pertinent variables and a formation of highly representative datasets. The models should be able to take advantage of the existing domain knowledge(such as a prior experience of human observers or operators) and augment(calibrate) it by available numeric data to form a coherent data-knowledge modeling entity. Most lately the omnipresent modeling tendency is the one that exploits techniques of Computational Intelligence(CI) by embracing fuzzy modeling, neurocomputing, and genetic optimization [2-6, 22-24]. In this study, we develop a hybrid modeling architecture, called Rule-based Fuzzy Polynomial Neural Networks(RFPNN). In a nutshell, RFPNN is composed of two main substructures, namely a rule-based fuzzy neural networks(RFNN) and a polynomial neural networks(PNN). The role of the RFNN based on fuzzy inference and Back-Propagation(BP) algorithm is to interact with input data, granulate the corresponding input spaces(viz. converting the numeric data into representations at the level of fuzzy sets). In this network, the number of layers and number of nodes in each layer are not predetermined(unlike in case of most neural-networks) but can be generated in a dynamic fashion. To assess the performance of the proposed model, we exploit well-known and Medical Imaging System(MIS) [16-17]widely used in software engineering. The MIS data have provided a basis for numerous studies of software change prediction [18-21].
2
The Architecture and Development of RFPNN
In this section, we elaborate on the architecture and a design process of the RFPNN. These networks emerge as a synergy between two other general constructs such as RFNN and PNN [6]. 2.1
Rule-Based Fuzzy Neural Networks
The premise part of the RFPNN is constructed with the aid of Rule-based Fuzzy Neural Networks(RFNN). Let us consider an extension of the network by considering the fuzzy partition realized in terms of fuzzy relations. The fuzzy partitions formed for the all variables lead us to the topology visualized in. The RFNN structure shows one possible connection point with the rest of the model for combination with PNN. The “circles” denote units of the RFNN, the neuron denoted by realizes a Cartesian product. The outputs of these neurons are taken as a product of all the incoming signals. The “N” identifies a normalization procedure applied to the outputs taken as a product of membership grades. The “ ” neuron is described by linear sum.
Genetically Optimized Rule-Based Fuzzy Polynomial Neural Networks
439
The learning algorithm in RFNN is realized by adjusting connection weights wi of the neurons and as such it follows a standard Back-Propagation(BP) algorithm. In this study, we use of the Euclidean error (1) as a performance measure. Ep = (yp − yˆp )2
(1)
where Ep is an error for the p-th data, yp is the p-th target output data and yˆp stands for the p-th actual output of the model for this specific data point. For N input-output data pairs, an overall (global) performance index (2) comes as a sum of the errors. E =
N 1 (yp − yˆp )2 N p=1
(2)
As far as learning is concerned, the connections change as in (3). w(new) = w(old) + ∆w
(3)
where the update formula follows the gradient descent method (4) ∂EP ∂Ep ∂ yˆp ∂fi ∆wi = η · − = −η · · · wi ∂ yˆp ∂fi ∂wi
(4)
Each part of the right side in (4) is expressed in the form (5) −
∂Ep ∂ = − (yp − yˆp )2 = 2(yp − yˆp ), ∂ yˆp ∂ yˆp
∂ yˆp = 1, ∂fi
∂fi = µ ¯i ∂wi
(5)
Overall ∆wi reads in the form (6) ∆wi = 2 · η · (yp − yˆp ) · µ ¯i
(6)
with η being a positive learning rate. Quite commonly the learning is augmented by a momentum term (that is wi (t) − wi (t−1)) with an objective to accelerate convergence 2.2
A Genetic Optimization of RFNN
The main features of genetic algorithms concern individuals viewed as strings, population-based optimization(search through the genotype space) and stochastic search mechanism(such as selection and crossover). In order to enhance the learning of the RFPNN and augment its performance of a RFPNN, we use genetic algorithms to adjust learning rate, momentum coefficient and the parameters of the membership functions of the antecedents of the rules. Here, GAs use serial method of binary type, roulette-wheel as the selection operator, one-point crossover, and an invert operation in the mutation operator[9].
440
2.3
S.-K. Oh et al.
Polynomial Neural Networks
We use PNN in the consequence structure of the RFPNN. Each neuron of the network realizes a polynomial type of partial description(PD) of the mapping between input and output variables. The input-output relation formed by the PNN algorithm can be described in (7). y = f (x1 , x2 , · · · , xn )
(7)
The estimated output yˆ of the actual output y is yˆ = fˆ(x1 , x2 , · · · , xn ) = c0 +
k1
ck1 xk1 +
k1k2
ck1k2 xk1 xk2 +
ck1k2k3 xk1 xk2 xk3 + · · ·
(8)
k1k2k3
where ck s are the coefficients of the model to be optimized. Depending on the number of the input variables as shown below, two types of generic PNN architectures are discussed. The structure of PNN is selected on the basis of the number of input variables and the order of PD in each layer. The basic RFPNN architectures areshown in Fig. 1. In the advanced type of Fig. 1, the “NOP” node means the Ath node of the current layer that is the same as the node of the corresponding previous layer(NOP denotes no operation). An arrow to the NOP node is used to show that the corresponding same node moves from the previous layer to the current layer. 2.4
RFPNN Topologies: Combination of RFNN and PNN
The RFPNN is an architecture combined with the RFNN and PNN as shown in Fig. 1. These networks result as a synergy between two other general constructs such as RFNN and PNN. The topologies of the RFPNN depend on those of the PNN used for the consequence part of RFPNN. The RFPNN is architecture combined with the RFNN and PNN as shown in Fig. 1. Let us recall that the RFNN is constructed with the aid of the space partitioning realized by fuzzy relations. We identify also two types as the following • Generic type of RFPNN; Combination of the RFNN and the generic PNN • Advanced type of RFPNN; Combination of the RFNN and the advanced PNN 2.5
The Algorithmic Framework of RFPNN
The design procedure for each layer in the premise and the consequence of RFPNN is as follows. We discuss the architecture in detail by considering the functionality of the individual layers (refer to Fig. 1).
Genetically Optimized Rule-Based Fuzzy Polynomial Neural Networks
441
Fig. 1. Configuration of basic RFPNN architecture(Combination of RFNN and PNN)
The premise of RFPNN : RFNN [Layer 1] Distributing the signals to the nodes in the next layer as input layer. [Layer 2] Computing activation degrees of linguistic labels. [Layer 3] Computing fitness of premise rule: Every node in this layer is a fixed node labeled , whose output is the product (9) of all the incoming signals. µi = µA (x1 ) × µB (xk )
A, B=small, large, etc.
(9)
[Layer 4] Normalization of a degree of activation(firing) of the rule given in (10) where n is number of rules. µi µ ¯i = (10) n µi i=1
[Layer 5] Multiplying a normalized activation degree of the rule by connection weight in (11). µi · wi fi = µ ¯i × wi = n µi
(11)
i=1
wherefj is given as the input variable of the PNN which is the consequence structure of RFPNN.
442
S.-K. Oh et al.
[Layer 6] Computing output of RFNN: The output (12) in the 6th layer of the RFNN is inferred by the center of gravity method. yˆ =
n
fi =
i=1
n
µ ¯i · wi =
i=1
n µi · wi n i=1 µi
(12)
i=1
The consequence of RFPNN : PNN [Step 1] Configuration of input variables: x1 = f1 , x2 = f2 , . . . , xn = fj (n = j, j: rule number). [Step 2] Forming a PNN structure: We select input variables and fix an order of a PD. [Step 3] Estimate the coefficients of a PD: The vector of coefficients of the PD’s(Ci ) in each layer is produced by the standard least-squares method and expressed in the form (13). Ci = (XTi Xi )
−1
XTi Y
(13)
where idenotes node number. This procedure is implemented repeatedly for all nodes of the layer and also for all layers of consequence part of RFPNN. [Step 4] Choosing PDs in case that the training and testing dataset are taken into consideration: Each PD is constructed and evaluated using the training and testing dataset, respectively. Then we compare the values of the performance index and select PDs using an aggregate performance index with a sound balance between approximation and prediction capabilities. [Step 5] Termination condition: We take into consideration a stopping condition (Emin Emin∗ ). Where Emin is a minimal identification error at the current layer while Emin∗ denotes a minimal identification error at the previous layer. [Step 6] Determining new input variables for the next layer: The outputs of the preserved PDs serve as new inputs to the next layer. In other words, we set x1i = z1i , x2i = z2i , . . . , xW i = zW i . The consequence part of RFPNN is repeated through steps 3-6.
3
Experimental Studies
In this section, we illustrate the development of the RFPNN and show its performance for well-known and widely used Medical Imaging System (MIS) datasets in software engineering. We consider an Medical Imaging System(MIS[16]) subset of 390 modules written in Pascal and FORTRAN for modeling. These modules consist of approximately 40,000 lines of code. To design an optimal model from the MIS, we utilize 11 system input variables such as, LOC, CL, TCHAR, TCOMN, MCHAR, CDHAR, N, NE, NF, VG, and BW. And as output variable, “NC” is used. Also, we use 4 system inputs [TCOMN, MCHAR, CDHAR, N] among 11 system inputs. Here 4 system inputs are selected from structure identification by GMDH method.
Genetically Optimized Rule-Based Fuzzy Polynomial Neural Networks
443
The first 60% data set is used for fitting the models. The remaining 40% data set, the testing data set, provides for quantifying the predictive quality of the fitted models. For the RFNN structures, GAs help optimize learning rate, momentum coefficient, and the parameters of the membership functions. In RFNN structure, two membership functions for each input variable are used. So RFNN structure is represented by 16 fuzzy rules for the 4 system inputs. Regression models are constructed by a linear equation. The comparative analysis reveals that the RFPNN comes with high accuracy and improved prediction(generalization) capabilities.
4
Concluding Remarks
In this study, we have introduced a class of RFPNN regarded as a modeling vehicle for nonlinear and complex systems, studied its properties, came up with a detailed design procedure and used these networks to model a well-known MIS dataset which is experimental data widely used in software engineering. RFPNN is constructed by combining RFNN with PNN. In this sense, we have constructed a coherent platform in which all components of CI are fully utilized. The model is inherently dynamic - the use of the PNN which comes with a highly versatile architecture is essential to the process of “growing” the network by expanding the depth of the network. A comprehensive design procedure was developed. Acknowledgements. Support from the Korean Science & Engineering Foundation (KOSEF: grant No. R05-2000-000-00284-0) and the Natural Sciences and Engineering Research Council of Canada (NSERC) is gratefully acknowledged. The authors gratefully acknowledge the work by Maciej Borkowski in preparing the LATEXversion of this article.
References 1. G.E.P. Box, W.G. Hunter, and J.S. Hunter, Statistics for Experimenters, John Wiley & Sons, 1978. 2. S. Horikawa, T. Furuhashi, and Y. Uchigawa, “On Fuzzy Modeling Using Fuzzy Neural Networks with the Back Propagation Algorithm,” IEEE Trans. Neural Networks, Vol. 3, No. 5, pp. 801–806, 1992. 3. N. Imasaki, J. Kiji, and T. Endo, “A Fuzzy Rule Structured Neural Networks”, Journal of Japan Society for Fuzzy Theory and Systems, Vol.4, No.5, pp.985–995, 1992(in Japanese). 4. S. K. Oh, D. W. Kim, and B. J. Park, “A Study on the Optimal Design of Polynomial Neural Networks Structure,” The Trans. of the Korean Institute of Electrical Engineers, Vol. 49D, No. 3, pp.145–156, 2000(in Korean). 5. T. Ohtani, H. Ichihashi, T. Miyoshi and K. Nagasaka, “Orthogonal and Successive Projection Methods for the Learning of Neurofuzzy GMDH,” Information Sciences, Vol. 110, pp. 5–24, 1998.
444
S.-K. Oh et al.
6. T. Ohtani, H. Ichihashi, T. Miyoshi and K. Nagasaka, “Structural Learning with M-Apoptosis in Neurofuzzy GMHD,” Proceedings of the 7 th IEEE international conference on fuzzy systems, pp. 1265–1270, 1998. 7. T. Yamakawa, “A New Effective Learning Algorithm for a Neo Fuzzy Neuron Model,” 5th IFSA World Conference, pp. 1017–1020, 1993. 8. A. G. Ivahnenko, “The group method of data handling : a rival of method of stochastic approximation,” Soviet Automatic Control, Vol. 13, No. 3, pp. 43–55, 1968. 9. D. E. Goldberg, Genetic Algorithms in search, Optimization & Machine Learning, Addison-Wesley, 1989. 10. C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, 1995. 11. M. Kearns and D. Ron, “Algorithmic Stability and Sanity-Check Bounds for LeaveOne-Out Cross-Validation”, Proc. 10th Ann. Conf. Computational Learning Theory, pp. 152–162, 1997. 12. C.F. Kemerer, “An Empirical Validation of Software Cost Estimation Models”, Comm. ACM, Vol. 30, No. 5, pp. 416–429, May 1987. 13. M. Shin and A.L. Goel, “Empirical Data Modeling in Software Engineering Using Radial Basis Functions”, IEEE Trans on Software Engineering, Vol. 26, No. 6, June, 2000. 14. B.J. Park, S.K. Oh and W. Pedrycz, “The Hybrid Multi-Layer Inference Architecture and Algorithm of FPNN Based on FNN and PNN”, Joint 9 th IFSA World Congress, pp.1361–1366, 2001. 15. S.K. Oh, T.C. Ahn and W. Pedrycz, “A Study on the Self-Organizing Polynomial Neural Networks”, Joint 9 th IFSA World Congress, pp.1690–1695, 2001. 16. Michael R. Lyu, Handbook of Software Reliability Engineering, McGraw-Hill, pp. 510–514, 1995. 17. Lind, R.K., Vairavan, K. 1989. An experimental investigation of software metrics and their relationships to software development effort. IEEE Trans. on Software Engineering SE-15(5), 649–653. 18. Basili, V. and Perricone, B.T. 1984. Software errors and complexity: An empirical investigation, IEEE Transactions on Software Engineering SE-10 (6), 728–738. 19. Khoshgoftaar, T.M., Munson, J.C., Bhattacharya, B.B., Richardson, G.D. 1992. Predictive Modeling Techniques of Software Quality from Software Measures. IEEE Trans. on Software Engineering 18(11), 979–986. 20. Munson, J.C. and J.C. Khoshgoftaar, J.C. 1990. Regression modeling of software quality: Empirical investigation,” Information and Software Technology 32 (2), 106–114. 21. Pedrycz, W., Han, L., Peters, J.F., Ramanna, S., Zhai, R. 2001. Calibration of software quality: Fuzzy neural and rough neural approaches. Neurocomputing 36, 149–170. 22. Pedrycz, W. and Peters, J.F. 1998. Computational Intelligence in Software Engineering. World Scientific Publishing Co. Pte. Ltd., Singapore. 23. Pedrycz, W. and Peters, J.F. 1997. Computational intelligence in software engineering. Proceedings of the Canadian Conf. on Electrical & Computer Engineering, 253–257. 24. Peters, J.F. and Pedrycz, W. 1999. Computational Intelligence. In J.G. Webster, Ed. Encyclopedia of Electrical and Electronic Engineering, 22 vols. John Wiley & Sons, Inc., NY.
Ant Colony Optimization for Navigating Complex Labyrinths Zhong Yan and Chun-Wie Yuan Department of Biomedical Engineering, Southeast University, Nanjing 210096, China {seu_yanz, cwy }@seu.edu.cn
Abstract. Navigating complex labyrinths was very difficult and timeconsumed. A new way to solve this problem, called Ant Colony Optimization (ACO), was proposed. The main characteristics of ACO were positive feedback, distributed computation and the use of a constructive greedy heuristic. The object of this paper was to apply this approach to navigating complex labyrinths and finding the shortest paths for the traffic networks. To do these problems different updating rules of pheromone were tested and different versions of ACO were compared. The experiments indicated that ACO with step-by-step updating rule was more efficient than others. And the results also suggested that ACO might find wide applications in the traffic management.
1 Introduction From the mechanism of the evolution of the biology, many people had proposed some new heuristic methods to solve complex optimization problems. In the early nineties an algorithm called Ant System (AS) [1] was proposed. This algorithm was derived from the forgiving behavior of real ants in nature. People found that real ants had abilities to find the shortest path from the food source to the nest by collective behavior through pheromone. Since AS was advanced it had inspired researchers lots of interest. They designed some improved algorithms [2,3], and applied to discrete applications [4]. Later, all versions of AS were consolidated into Ant Colony Optimization (ACO) [5] which provided a unitary view of the ongoing research. The object of this paper was to apply this algorithm to navigating labyrinths and traffic routing problems in that it provided parallel and distributional algorithm for these complex problems. The rest of this paper was organized as follows: section 2 briefly described ACO; section 3 applied ACO to complex labyrinths, and the experiment results were also reported here; finally, section 4 provided a summary and future work.
2 Ant Colony Optimization For briefly describing ACO, Traveling Salesman Problem (TSP) was taken as an example to explain it. Applied to TSP, ACO launched a colony of ants to build solutions simultaneously by visiting sequentially cities. After ants finished their tours they up G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 445–448, 2003. © Springer-Verlag Berlin Heidelberg 2003
446
Z. Yan and C.-W. Yuan
dated the pheromone along the crossed edges by adding a quantity of pheromone proportional to the goodness of the solution. Finally, determining whether the termination condition was satisfied. If it was true, the system exited and reported the results of this trial. Otherwise the system reiterated until the condition became true. Because of limited pages more details could be found in [5].
3 Simulation of ACO for Navigating Complex Labyrinths For navigating labyrinth in Fig. 1, ACO with offline pheromone update spent much more time on running. So improvement was needed. When an ant went into a dead angle and couldn’t complete a tour, the system stopped this ant and let it return to the starting point, while recording the visited cells in a list. After ants returned because of failure, system’s pheromone along their paths were updated and minima were assigned to the pheromone intensity of the cells where ants were blocked, in order to remind other ants not to go into them again. Certainly the searching process would
Fig. 1. Using ACO to navigate a complex labyrinth a). b),c),d),e) were derived from a) Table 1. Computational results for different versions of ACO
ACOMZ1 Iterations Exit
2000 Not found
ACOMZ2
38 Found
ACOMZ3 7 Found
terminate if one ant had discovered a valid path. The procedure was demonstrated in Fig. 1. In Fig. 1 an area b) in labyrinths a) was focused on. The bold line denoted a visited path at end of which an ant had been finally stuck. Then, the dead angle at the right corner was modified by updating its intensity of pheromone with minimum, marked with the symbol ‘X’ which could be regarded as virtual ‘wall’ in c). Similarly from c) to d) would cost another ant’s life. Sequentially it could be concluded that the system would sacrifice two ants from d) to e), and four ants from b) to e). Experiment showed that it took much time for the system with the online delay pheromone update method (referred as ACOMZ1) to find the exit because it should mark all possible dead angles at the cost of the ants’ lives. Thus it was necessary to speed up this convergence. Two mechanisms were considered. One was ACOMZ2, ACO with online step-step pheromone update. Once an ant was blocked and couldn’t finish a tour, it updated system information and marked the dead angle so that it early informed other ants of not going into this cell. The other was ACOMZ3, based on ACOMZ2. When
Ant Colony Optimization for Navigating Complex Labyrinths
447
one ant ran into a dead angle, it retraced the old path back until a new path appeared, while marking all the relative dead angles with ‘X’. Furthermore, only one ant would sacrifice from b) to e), but it played the same role as four ants. From results in Table 1 the latter two mechanisms worked more efficiently than ACOMZ1, and the best among them was ACOMZ3. Note that labyrinth was depicted by different kinds of cells, and assumed that ants started synchronously from a starting point to an end point, and the distance from one cell to another, was all equal to 1.0, MaxIter, the maximum of the iteration in one trial, was 2000, and Ants, the number of ants, was 50. From the above mentioned it could be summarized that ACO with different updating rules could solve labyrinths, and the pheromone of ACO—indirect communication media, played an important role in marking dead angles. In the following part, ACO was extended to solve the routing problems for traffic networks. Finding the shortest path from a single source to a single destination in traffic networks was much more complex than labyrinths because of many circles and dead angles. Employed to it, ACOMZ3 should mark all the relative dead angles when an ant went into dead angles. In Fig. 2, the numbers randomly generated between 1 and 10 were weight values; the hollow circles were the starting and the end point respectively.
Fig. 2. A more complex traffic network. The found best path was denoted by arrows in b)
Table 2. Average results for Fig. 2 Iterations Best path Average
ACOIR 986.4 4.0% 68.84
ACOGR 962.5 4.0% 71.84
ACOIP 984.5 2.0% 68.76
ACOGP 413.7 90% 64.54
ACSIR 1000 0.0 71.52
ACSGR 1000 0.0 71.0
ACSIP 1000 0.0 70.36
ACSGP 1000 0.0 70.0
To make ACOMZ3 efficiently find the best path, reinforcing the intensity of pheromone along the best path in the current iteration or along the global best since the beginning of the computation was adopted. Moreover, different selection approaches, which were the proportional selection and the roulette wheel selection just as Genetic Algorithm’s, were taken into account as far as selecting a next node to be visited was concerned. It could be generalized that the proportional selection favored the system’s exploration but ran slower than the roulette selection. As a result different versions of ACO were assembled as follows: ACOIR, ACOIP, ACOGR, and ACOGP, in which ACO denoted this algorithm, the forth letter, ‘G’ or ‘I’, expressed methods of pheromone reinforcement, and the last, ‘P’ or ‘R’, was selection approach.
448
Z. Yan and C.-W. Yuan
Ant Colony System (ACS) [2] was an improved algorithm of ACO. From the aspect of the selection and pheromone reinforcement mentioned above, there also existed different versions of ACS with online delay pheromone update just as ACO’s. The results for Fig. 2 were given in Table 2 over 50 trials. In Table 2 the average of iterations of ACOGP was nearly half of those of other ACOs and ACSs, and the percent that ACOGP had found out the best path was ninety, far more than others. The average of solutions was 64.54, near to the best value 64.0. Therefore ACOGP was the best. The parameters, not optimized particularly, were set as follows: α =1.0, β =2.0, ρ =0.9, MaxIter=100, Ants=100, q0 , which could be used to distinguish the exploitation from exploration in the pseud-radom-proportional rule of ACS, was 0.9.
4 Conclusions A new novel method with ant colonies intelligence to navigate labyrinths was proposed. In addition to successfully navigating labyrinths, it showed that it had potential to solve the shortest path in the traffic networks. And the experiment results also clearly proved that Ant Colony Optimization with online step-step pheromone update, the pheromone reinforcement of the global best solution and the proportional selection was the best and the most efficient to solve the complex labyrinths when compared with others. In future, applying this novel algorithm to dynamical routing problems in traffic networks would be deserved.
Acknowledgments. This paper is financially supported by NSFC (No. 69831010).
References [1]
[2] [3] [4] [5]
M. Dorigo, V. Maniezzo, and A. Colorni, “The ant system: optimization by a colony of cooperating agents”. IEEE Transactions on Systems Man and Cybernetics, Part B, 1996, vol. 26 no. 1, pp. 29–41. Gambardella L.M. & M. Dorigo, “Solving Symmetric and Asymmetric TSPs by Ant Colonies”. Proceedings of IEEE Intenational Conference on Evolutionary Computation, IEEE-EC 96, May 20–22, 1996, Nagoya, Japan, pp. 622–627. T. Stützle and H. Hoos, “The MAX-MIN ant system and local search for the traveling salesman problem”, Proc. IEEE International Conference on Evolutionary Computation, April 13-16 1997, pp. 309–314. M. Dorigo, G. Di Caro, and L. M. Gambardella, “Ant algorithms for discrete optimization”, Artificial Life, 1999, vol. 5. No. 2, pp. 137–172. M. Dorigo and G. Di Caro, “Ant colony optimization: a new meta-heuristic”, Proc. 1999 Congress on Evolutionary Computation, July 6–9, 1999, pp. 1470–1477.
An Improved Quantum Genetic Algorithm and Its Application Gexiang Zhang, Weidong Jin, and Na Li School of Electrical Engineering, Southwest Jiaotong University, Sichuan 610031, P.R.China [email protected]
Abstract. An improved quantum genetic algorithm (IQGA) is proposed in this paper. In IQGA, the strategies of updating quantum gate by using the best solution and introducing population catastrope are used. The typical function tests show convergent speed of IQGA is faster than that of quantum genetic algorithm(QGA) and other several GAs, and IQGA can also make up for prematureness of QGA. The simulations of FIR filter design demonstrate IQGA is superior to QGA, the methods in reference [5] and traditional method.
1 Introduction Quantum genetic algorithm (QGA), a new genetic algorithm based on the principles of quantum computing[1], was used to solve combinatorial optimization problem and good results was obtained in reference [2,3]. But it is found that premature phenomenon appears easily and successful convergence rate is only 80%, when QGA is applied to optimize continuous functions, especially functions with multi-peaks. So an improved quantum genetic algorithm (IQGA) is proposed in this paper. In IQGA, the strategies of updating quantum gate by using the best solution and introducing population catastrope are used. The test results from typical functions demonstrate that IQGA is superior to QGA, standard genetic algorithm (SGA), maintaining optimum genetic algorithm (MOGA ) and new genetic algorithm (NGA) [4]. Then, IQGA is also used to design a finite impulse response (FIR) bandpass digital filter and the simulation results show that IQGA outperforms QGA, the methods in reference [5] and traditional method (look-up table method, LUTM).
2 Improved Quantum Genetic Algorithm In QGA, the best solution in current generation is maintained and is used to updating quantum gate. When the best solution in current generation is local optimal values, the algorithm will drop into local optimum and premature phenomenon will appear. So improvements of QGA are G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 449–452, 2003. © Springer-Verlag Berlin Heidelberg 2003
450
G. Zhang, W. Jin, and N. Li
1) Maintaining optimal solution: the best solution is maintained in the process ofsearch to guarantee that the algorithms can be convergent in whole solution space. 2) Rotation angles of quantum rotation gate are changed using the best solution, which quicken convergent speed of the algorithm. 3) Population catastrope operation is applied and violent vibration is added to the population in evolution to make the algorithm break away off the local optimum and begin a new search. The catastrope is maintaining only the optimal solution and generating the rest individuals, which is like rebirth of a certain population after all individuals are killed out in nature. Catastrope operation will not make the population degenerate, but make it keep off the stagnancy state in evolution. Steps of IQGA are described in detail in the following. t t t 1) Initialization: a population containing n individuals is P(t)= {p1 , p2 , , pn }, t where pj (j=1,2, ,n) is an individual of population at generation t and
α 1t α 2 t Lα m t pj = t t t β 1 β 2 L β m t
(1)
where m is the number of qubits. In initialization, all ¢i,£i (i=1,2, ,m) =1/ 2 . t t t t 2) According to the values of P(t), construct R(t), R(t)= {a1 , a2 , , an }, where aj (j=1,2, ,n) a binary string with the length of m and each element of R(t) depends on t probability amplitudes of pj (j=1,2, ,n). 3) Each individual in R(t) is valued by using fitness valuation function and the best solution is kept. If expected solution is obtained, the algorithm ends, else, continues. 4) If catastrophe conditionis true, population catastrophe operation is executed, otherwise, the algorithm continues. 5) Updating P(t) by using rotation quantum gate U(t). 6) Generation t=t+1 and the algorithm returns to step (2).
3 Performance Test of IQGA To prove the effectiveness and advantages of IQGA, a typical function in reference [4] is taken for example in the following. Example 1: A function with multi-peaks: f(x)=e
-0.001x
2
cos 0.8x (x 0)
(2)
The maximal values of f(x) in theory is FM=f(xM)=max{f(x)}=1.0 and xM=0.0. x [0,18] is the same as reference [4]. The other four local optimum are respectively f1(x1)=0.996081,x1=3.926209; f2(x2)=0.992177,x2=7.853200;f3(x3)=0.988288, x3=11. 78100;f4(x4)=0.984415, x4=15.70900; Fitness function of GAs is F(x) = f (x). Population sizes of NGA[4], SGA and MOGA are18. The length of chromosomes is L=40. While population sizes of IQGA and QGA are n=10 and the number of qubits is m=20. The results of 5 algorithms are given in table 1 (*g means generations).
An Improved Quantum Genetic Algorithm and Its Application
451
Table 1. Parts of the results in example 1 g 1 15 35 45 55 g
SGA xm 17.7150955 11.6227932 18.9991989 19.8628068 4.19896507 xm
1 13.65 17.9
17.1730205
üü
0.00000000
Fm 0.98437571 0.97270059 0.74851423 0.94839894 0.94940662 QGA Fm
MOGA xm 8.03904247 3.92016493 3.92584982 3.92595402 3.92595402 MSCR
0.98437571
üü
1.00000000
80%
Fm 0.97040866 0.99605780 0.99608101 0.99608106 0.99608106 xm 8.9384164222 0.0000000000 0.0000000000
NGA xm 4.10280512 0.04639857 0.00000000 0.00000000 0.00000000 NGA Fm 0.4144978667 1.0000000000 1.0000000000
Fm 0.9763333 0.9985764 1.0000000 1.0000000 1.0000000 MSC R 100%
From table 1, we can draw conclusions that 1) SGA is the worst among the several algorithms because the results of SGA fluctuate violently and the results are far away the optimal solution in most situations, even far away the second optimum; 2) MOGA is better than SGA and at generation 15, Fm =0.996057 is close to the optimal solution Fm =1.0, but xm =3.920164 is close to the first local optimal value. And the state is kept from generation 15 to 55, which demonstrate that MOGA drops into the local optimum (absorbed by the first peak of f(x)); 3) NGA is much better than SGA and MOGA and obtains the maximal value Fm of f(x) and xm=0.0000000062 approaches to xm=0, which is the optimum in theory; 4) Convergent speed of IQGA and QGA is much faster than SGA, MOGA and NGA. In 100 tests, if algorithm converges, IQGA and QGA arrive at the optimal solution in theory quickly, that’s Fm =1.00000 xm=0.000000. But mean successful convergence rate (MSCR) of QGA is only 80% and QGA drops into the first local optimum in most situation when QGA can not converge, while MSCR of IQGA is 100%.
4 Application of IQGA in FIR Filter Design Example 2[5]: a bandpass filter is designed using frequency sampling method. Requirements are: cutoff frequencies ¹1s=0.2±, ¹1p=0.35±,¹2p=0.65± and ¹2s=0.8±, maximal ripple in passband is Ap=0.3dB and minimal attenuation in stopband is As=60dB. To bring into comparison, the number of frequency sample points N is set 60 and two points dot1 [0, 0.4], dot2 [0.2, 0.9] are added in transition band of the filter, which is the same as that in reference [5]. Population size P=10 and the number of qubits m=10. The results of IQGA, LUTM, IGA[5] and QGA are shown respectively in (a), (b), (c) and (d) in figure 1. And conclusions can be drawn that IQGA is superior to QGA, IGA and LUTM because the maximal ripple in the passband is only 0.2513, which is least, and the minimal attenuation in the stopband rises 63.4507, which is also best.
452
G. Zhang, W. Jin, and N. Li
Fig. 1. Amplitude responses (dB) of the filters in example 2
5 Conclusions and Remarks In this paper, IQGA is proposed to solve continuous functions optimization problems. The results of tests and FIR filter design demonstrate that IQGA is better than SGA, MOGA, NGA, IGA and QGA in convergent speed, global search capability and optimization efficiency.
References 1. Tony Hey, Quantum Computing: an introduction[J], Computing & Control Engineering Journal, June 1999, pp105–112 2. A. Narayanan & M. Moore, Quantum-inspired genetic algorithm, Proceedings of IEEE International Conference on Evolutionary Computation, 1999, 61–66 3. K. H. Han, K. H. Park, C. H. Lee & J. H. Kim, Parallel quantum-inspired genetic algorithm for combinatorial optimization problems, Proceedings of IEEE International Conference on Evolutionary Computation, 2001, 1442–1429 4. Tu Chengyuan, Tu Chengyu, A new genetic algorithm converging to the globally-optimal solution [J], Information and Control, 2001,30(2): pp116–138 5. Cheng Xiaoping, Yu Shenglin, FIR filter design: frequency sampling technique based on genetic algorithm [J], Journal of Nanjing University of Aeronautics & Astronautics, 2000,32(3): pp276–281
Intelligent Generation of Candidate Sets for Genetic Algorithms in Very Large Search Spaces Julia R. Dunphy, Jose J. Salcedo, and Keri S. Murphy Jet Propulsion Laboratory California Institute Of Technology 4800 Oak Grove Dr. Pasadena, California, USA {Julia.Dunphy,Jose.Salcedo,Keri.Murphy}@jpl.nasa.gov Abstract. We have been working on how to select safety measures for space missions in an optimal way. The main limitation on the measures that can be performed is cost. There are often hundreds of possible measures and each measure has an associated cost and an effectiveness that describes its potential to reduce the risk to the mission goals. A computer search of such an enormous search space is not practical if every combination is evaluated. It was therefore decided to use an evolutionary algorithm to improve the efficiency of the search. A simple approach would lead to many sets of solutions which were wildly expensive and so unfeasible. Preselecting candidates which meet the cost goals reduces the problem to a manageable size. Preselection is based on rough set theory since cost goals are usually not rigid. This paper describes the methodology of ensuring every candidate is roughly comparable in cost.
1 Introduction For several years we, at JPL, have been working on how to select safety measures for space missions in an optimal way. The main limitation on the measures that can be performed is, of course, cost. There are often hundreds of possible measures which may be taken, ranging from simple inspections to using complex environmental simulators. Each measure has an associated cost and an effectiveness that describes its potential to reduce the risk to the mission goals. Making the trades between measures is still mostly manual and often human biases show up in the final selected. The DDP (Defect Detection and Prevention) project is an attempt to take the human bias out of the process and make all selections of safety measures as optimal as possible. However, using a computer search of such an enormous search space is not practical if every combination is evaluated. It was therefore decided to use an evolutionary algorithm to improve the efficiency of the search. A naïve approach might be to take random combinations of measures as candidates, evaluate the utility of each combination, and then use genetic algorithm methods based on high efficiency and low cost. However, this would lead to many sets of solutions which were wildly expensive and so unfeasible. In fact, this method is quite likely to yield a best 100 solution set without any members that could come in within the assigned budget. So we decided to try and generate only candidates that were affordable from the outset. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 453–457, 2003. © Springer-Verlag Berlin Heidelberg 2003
454
J.R. Dunphy, J.J. Salcedo, and K.S. Murphy
2 The DDP Model A PACT (Prevention, Analysis, Control or Test) is a way of reducing the likelihood of one or more Failure Modes (FMs) occurring. Each requirement is given an arbitrary weight of importance to the project and an FM, if it occurs, reduces the total recovered requirement weight (i.e. the percentage of the requirement weight that is likely to survive the FMs) [1] [2] [3] [4]. 2.1 Problem Statement How do we choose the PACTs that are the most cost effective? There are typically hundreds of PACTs that might be employed with complex internal and external interactions. So since a typical solution to the problem may involve 20 to 30 PACTs, the number of combinations is enormous. Even after the constraint that the cost goal be met, the number of possible combinations that do not exceed budget allocation is still large but tractable.
3 Sifting the Combinations Naïve searches of all possible candidate sets for those that meet the cost constraint is not practical. We are now actively investigating evolutionary techniques such as genetic algorithms to rank the solutions. The subject of this paper, however, is to describe the methodology for creating the candidate sets. We plan to publish our results concerning the efficiency of the algorithm later [5]. 3.1 Binning Strategy To make the problem tractable, each PACT is assigned to a cost bin that can roughly be thought of as “Low Cost, Inexpensive, Median, Expensive, and Very Expensive”. The division into five bins is quite arbitrary and later we may try a larger number of bins. The lowest and highest cost extracted from all the PACTs are used to determine the cost limits for each bin. After the bins limits are calculated, the individual PACTs are each assigned to a bin based on their cost. Candidate solutions for the evolutionary algorithm are first identified by a key such as {4,1,6,3,2} indicating how many of each kind may be chosen. This key does not define the actual candidate solution but only the number of PACTs that can be extracted from each bin in turn to achieve the cost limit. We use integer arithmetic so that the actual cost of the candidate is only roughly equal to the maximum cost. The round-off errors can produce quite significant deviations from the maximum cost if they occur in the high value PACTs. The methodology ensures that we won’t discard a solution that exceeds our coast maximum by some small amount. It is often possible to get a budget increase if significant increase in safety is achieved by a relatively small cost increase. The actual candidate then is extracted by another tuple that
Intelligent Generation of Candidate Sets for Genetic Algorithms
455
includes the actual indices within each bin such as {3,0,5,2,0}. Each value in the second tuple must, of course, be less than the value in the key.
b1 Ave($k) 5.0 Max/Av
10
b2
b3
b4
b5
15
25
35
45
4
2
2
1
Example keys for a $360k budget (using average values for each bin): {0,0,0,8,2} {20,8,4,1,0}
8 X 35 + 2 X 45 = 370 20 X 5 + 8 X 15 + 4 X 25 + 45 = 365
Because of the fact that the costs are unrelated, it is quite possible that some bins have no members. If this is the case, we eliminate the bin but keep the cost limits in the other bins the same. It is also possible that the number of PACT combinations that meet the cost limit is very small. This may be due to either the cost limit being too high or too low. If the cost limit is so high that it can only be achieved by a very few candidates we automatically lower the cost limit. If the cost limit is too low, we declare that the evolutionary search is not possible. If the cost limit is low enough to leave about 1000 to 10000 possible candidates we replace the evolutionary search algorithm with a simple exhaustive search using the sequential mode defined below. Each actual candidate solution is extracted by means of an iterator as described in the next section.
456
J.R. Dunphy, J.J. Salcedo, and K.S. Murphy
3.2 Iterator New candidate solutions are created by the iterator. The iterator is created each time a new key is employed. The iterator can work in a random or sequential mode. In sequential mode the iterator will cycle through all the lowest cost PACTs (those in bin 1) and when all combinations that meet the key signature have been exhausted, it will change the selection from bin 2 by one member and then cycle bin 1 again. This is similar to the way normal counting works (e.g. units exceeding nine create a carry into the tens). This mechanism ensures that all possible candidates will have been generated eventually and each candidate will have roughly (but not exactly) the same cost. A complete candidate PACT set is then composed of a set of subcandidates, one each from each bin whose count in the tuple is not zero. In random mode, however, the iterator will generate subcandidates randomly in each bin. The iterator is programmed to make sure that each candidate is different by keeping track of those previously created. A sequential iterator can completely encompass every combination within the bin if called a sufficient number of times. 3.3 Genetic Algorithm The candidates are assembled into a population. Because of the selection method they are guaranteed to have roughly the same cost. They are rated by their total Recovered Requirement ability (as a percentage of total requirements in failure free environment). 3.4 Mutations Generation to generation transformation is based on the work of Colin Williams at JPL on Quantum Computing Circuits. Normal Genetic Algorithm mutations, crossovers etc are employed with some new ideas. We use the tuple idea one more time to improve the efficiency of the algorithm. In our application it is possible to substitute an entire subcandidate at mutation time if desired. We therefore carry both normal (single PACT) substitution and subcandidate substitution. In either case, we are ensured that the cost boundary will be loosely obeyed.
4 Testing We are now testing but have few if any results to report yet.
References 1. [Cornford 2001] S.L. Cornford, M.S. Feather and K.A. Hicks: “DDP – A tool for life-cycle risk management”, Proceedings, IEEE Aerospace Conference, Big Sky, Montana, March 2001, pp. 441–451.
Intelligent Generation of Candidate Sets for Genetic Algorithms
457
2. [Cornford 2002] S.L. Cornford, J. Dunphy and M.S. Feather, 2002, “Optimizing the Design of end-to-end Spacecraft Systems using failure mode as a currency”, IEEE Aerospace Conference, Big Sky, Montana. 3. [Feather 1999] M.S. Feather, B. Sigal, S.L. Cornford & P. Hutchinson: “Incorporating CostBenefit Analyses into Software Assurance Planning”, Proceedings, 26th IEEE/NASA Software Engineering Workshop, Greenbelt, Maryland, November 27–29. 4. [Feather 2000] M.S. Feather, S.L. Cornford and M. Gibbel: “Scalable Mechanisms for Requirements Interaction Management”, Proceedings 4th IEEE International Conference on Requirements Engineering, Schaumburg, Illinois, June 19–23, 2000, IEEE Computer Society, pp. 119–129. 5. [Rasheed 1997] Khaled Rasheed. Using Case-Based Learning tp Improve GeneticAlgorithm-Based Design Optimization. Proceedings of the 1997 International Conference on Gentic Algorithms.
Fast Retraining of Artificial Neural Networks Dumitru-Iulian Nastac1 and Razvan Matei2 1 Turku
Centre for Computer Science and Åbo Akademi University, Lemminkäisenkatu, 14 B FIN-20520 Turku, Finland [email protected] 2 Nokia Oy, Takomotie 1, 00380, Helsinki, Finland [email protected]
Abstract. In this paper we propose a practical mechanism for extracting information directly from the weights of a reference artificial neural network (ANN). We use this information to train a structurally identical ANN that has some variations of the global transformation input-output function. To be able to fulfill our goal, we reduce the reference network weights by a scaling factor. The evaluation of the computing effort involved in the retraining of some ANNs shows us that a good choice for the scaling factor can substantially reduce the number of training cycles independent of the learning methods. The retraining mechanism is analyzed for the feedforward ANNs with two inputs and one output.
1 Introduction The artificial neural networks (ANNs) ability to extract significant information from an initial set of data allows both an interpolation in the a priori defined points, as well as an extrapolation outside the range bordered by the extreme points of the training set [2]. It is well known that the training process of an ANN requires a large number of processing cycles [1,3], which can occasionally reach and even outnumber hundreds of thousands. Let us suppose that after a while the input-output function of the ANN has to be reconsidered. If we restart from scratch, we will train the network exactly in the same way as in the first phase. Therefore, the computing effort remains the same. As an alternative, we propose to use the previous knowledge acquired in the first phase, while designing the network, in order to reduce the re-designing effort. The structure of this paper is as follows. In Section 2 we introduce the retraining procedure and explain the working strategy. The main features of our experimental results are given in the next section, where we discuss specific aspects. Our conclusions are formulated in the final section of the paper.
2 Retraining Strategy The main purpose is to establish how a viable ANN structure at a previous moment of time could be re-trained in an efficient manner in order to support modifications of the input-output function. Our proposed procedure reduces the reference network weights G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 458–461, 2003. © Springer-Verlag Berlin Heidelberg 2003
Fast Retraining of Artificial Neural Networks
459
by a scaling factor g (0 < g < 1). These reduced weights are used as initial weights for the training sequence of the second network with the same structure. At the end we compare the network convergence speed (i.e. the number of cycles required until some arbitrary imposed error is reached) obtained in both cases. Note that, before the training phase, the reference network had its weights initialized to some random uniformly distributed values. The retraining mechanism was analyzed in the following cases: same function with same training set (identical with the one used in the first training phase, which preceded the scale reduction process), same function with different training set, and different function with a new training set. In order to properly evaluate the proposed procedure, we tested its effects on various training models such as: BP (back-propagation), momentum, ALR (Adaptive Learning Rate) and the ALR-momentum combination. This way, we try to emphasize that our procedure has similar results regardless of the learning algorithm used. Our main intent is to reduce the number of the learning cycles. The retraining procedure needs the completion of two major steps. Firstly, we have to get the set of weights by inheriting them from the reference ANN, and secondly we have to reduce these weights by a scale factor g (e.g. in range (0.1,0.9)). The newly obtained weights represent the base for the new learning process. In order to decide the optimum scale factor and the consequences of the scale reduction procedure itself, the retraining phase should be performed for more than one value of g.
3 Results and Discussion We tested our procedure over a number of three-dimensional functions (see for example Fig. 1). These kinds of functions were copied by two inputs – one output network architecture following training sets of 400 points.
Fig. 1. An example of 3D input-output function
As a result of analyzing the simulating data, regardless of the training procedures, we get that varying the scale value, the number of retraining cycles starts from a higher value than the reference cycles number (marked with a horizontal line), and then it progressively decreases in accordance with the increases of g, much below the
460
D.-I. Nastac and R. Matei
reference value, especially for g 0.3 (Fig. 2.a). Sometimes, for the values of g higher than 0.6, we have significant jumps of the learning cycles number that are associated with the network paralysis or over-learning phenomenon (Fig. 2.b).
Fig. 2. The number of training cycles as function of g, with decreasing aspect (a) and overlearning aspect for g > 0.6 (b)
This remark remains valid even if some graphs might surprise us (see Fig. 3.a). At first sight, the observation we have made seems not to be true. This was due to the fact that we used a relative small number of scales, L=9, for gj (j=1…L). When we resumed the retraining sequence with L=45 values of g, from 0.02 to 0.9 (having 0.02 as the increment), we obtained the graph in Fig. 3.b that confirmed the previous observation. We noticed that a small number L leads in some cases to irrelevant conclusions.
Fig. 3. Irrelevant situation for L=9 values of g (a) and an improvement for L=45 (b)
However, our method may not always provide spectacular results concerning the decrease of the training cycle number. Thus, it is possible that the retraining procedure is not efficient for limit situations.
Fast Retraining of Artificial Neural Networks
461
4 Conclusions In this paper, we have proposed a procedure for retraining those ANNs, which require modifications of their input-output functions. We described the information extracting mechanism directly from the weights of a reference ANN that was already functional. These weights were reduced by a scale factor g, and used as initial weights for the new training sequence. This way, we obtained a significant decrease in the number of the training cycles compared to the classical way. Based on the simulations performed for multiple combinations of the input parameters, we conclude that the optimal g has its value around 0.5. Increasing this coefficient is not justified by the over-learning possibility and the implicit paralysis of the neural networks. In addition, values below 0.5 lead to a behaviour very similar to a lack of the memory that remains by scaling. We conclude that there is almost the same phenomenon in the ANN behaviour when the retraining procedure is applied, regardless of the learning method used. Even if some of the techniques used are more efficient than others, applying the scaling method leads to a somehow similar ratio with respect to the decrease of the training cycles. This ratio depends on the analyzed case, more precisely on the neural architecture, imposed error limit, etc. The results and the graphs were selected in a non-preferential manner from more than 140 retraining simulation sessions. In 31% of the analyzed cases, for g > 0.6, we noticed that the number of the training cycles has an ascending trend, and this was associated with the over-learning phenomenon. This percentage kept itself relatively independent in almost all the situations given by the modifications of the learning parameters, i.e. layers number, cells number of each hidden layer, etc. The similarity between the training methods for networks with an arbitrary number of inputs and/or outputs implies the fact that the results are also valid for the case of a highly dimensional space. This reason allows us to generalize the conclusions regarding the values of g to networks with any dimension.
References 1. Hagan, M.T., Demuth, H.B., Beale, M.: Neural Networks Design, MA: PWS Publishing, Boston (1996) 2. Hassoun, M.H.: Fundamentals of Artificial Neural Network, MA: MIT Press, Cambridge (1995) 3. Nastac, D.I.: Contributions in Technical Systems Quality Modelling through the Artificial Intelligence Methods, Ph.D. dissertation, Polytechnic University of Bucharest (2000) 4. Zhang, Y., Peng, P.Y., Jiang, Z.P.: Stable neural controller design for unknown nonlinear systems using backstepping, IEEE Trans. Neural Networks 6 (2000) 1347–1360
Fuzzy-ARTMAP and Higher-Order Statistics Based Blind Equalization 1
1
Dong-kun Jee , Jung-sik Lee , and Ju-Hong Lee 1
2
School of Electronic & Information Eng., Kunsan National University, Korea {Enigma64,Leejs}@kunsan.ac.kr 2 School of Computer Science and Engineering, Inha University, Korea [email protected]
Abstract. This paper discusses a blind equalization technique for FIR channel system, that might be minimum phase or not, in digital communication. The proposed techniques consist of two parts. One is to estimate the original channel coefficients based on fourth-order cumulants of the channel output, the other is to employ fuzzy-ARTMAP neural network to model an inverse system for the original channel. In simulation studies, the performance of the proposed blind equalizer is compared with both linear and other neural basis equalizers. Keywords: Channel, Blind Equalization, Higher-Order Statistics (HOS), Cumulants, Neural Network, Fuzzy-ARTMAP.
1 Introduction For the last three decades, many of blind equalizers that do not use the known training sequence have been proposed in the literature beginning with Sato[1], because there are some practical situations when the conventional adaptive algorithms are not suitable for wireless communications during an outage caused by severe fading. Most current blind equalization techniques use higher order statistics (HOS) of the received sequences, directly or indirectly, because they are efficient tools for identifying that may be the nonminimum phase[2,3]. This paper presents a new method to solve the problems of blind equalization, by combining the advantages of HOS and a Fuzzy-ARTMAP neural network[5].
2 Blind Fuzzy-ARTMAP Equalizer Assume that the received signal {yk} is generated by an FIR system described by p
y k = ∑ hi s k −i + nk = yˆ k + nk
(1)
i =0
where {hi} , 0 ≤ i ≤ p is the impulse response of the channel and {sk} is i.i.d., nonGaussian. Here, {sk} could be a two-level symmetric PAM sequence. The additive noise {nk} is zero mean, Gaussian, and statistically independent of {sk}. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 462–465, 2003. © Springer-Verlag Berlin Heidelberg 2003
Fuzzy-ARTMAP and Higher-Order Statistics Based Blind Equalization
463
Firstly, the autocorrelation technique in [4] is used to estimate the channel order that is required to specify the number of centers in an RBF equalizer. Secondly, using the cumulants property, the following channel parameters are obtained
+ = (& H & ) −1 & H F
(2)
where & and F are the matrix and vector consisting of the estimated fourth-order cumulants, and + is the unknown coefficient vector. Fig. 1 shows the block diagram of the blind fuzzy-ARTMAP equalizer system.
Fig. 1. The structure of blind fuzzy-ARTMAP equalizer system
3 Simulation Results Firstly, the channel order is estimated with two different channel models. Autocorrelations of channel observations were computed and the results of them are illustrated in Fig. 2. The following channel is assumed in the training procedure
H ( Z ) = 0.5 + 1.0 Z −1
(3)
Fig. 3 shows the error rate comparison of three kinds of neural network equalizers. As shown in the graph, the performance of the blind fuzzy-ARTMAP equalizer is superior to that of the blind MLP equalizer, while producing results as favorable as those in the blind RBF equalizer.
464
D.-k. Jee, J.-s. Lee, and J.-H. Lee 1.25
training sequences = 1000
Normalized autocorrelations
Normalized autocorrelations
1.25 1 0.75 0.5 0.25 0 -0.25
training sequences =1000
1 0.75 0.5
confidence interval
0.25 0 -0.25
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Autocorrelation lag
Autocorrelation lag
BBBBBBJKBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBJK
(a ) H ( z ) = 0.348 + 0.87 z −1 + 0.348 z −2 (b) H ( z ) = 0.227 + 0.46 z −1 + 0.668 z −2 + 0.46 z −3 + 0.227 z −4 Fig. 2. Channel order estimation
-1
Log 10 (error probability)
-1.5
-2
-2.5
Fuzzy ARTMAP equalizer -3
RBF equalizer
MLP equalizer -3.5 5
7.5
10
12.5
15
Signal to Noise Ratio(SNR)
Fig. 3. Error rate comparison
4 Conclusion In this paper, a blind equalization technique is discussed based on higher-order statistics and a fuzzy-ARTMAP. The blind fuzzy-ARTMAP equalizer is fast and easy to train and includes capabilities not found in other neural network approaches; a small number of parameters, no requirements for the choice of initial weights, automatic increase of hidden units, and capability of adding new input data without retraining previously trained data. Throughout the simulation studies, it was found that the blind fuzzy-ARTMAP equalizer performed favorably better than the blind
Fuzzy-ARTMAP and Higher-Order Statistics Based Blind Equalization
465
MLP equalizer, while requiring the relatively smaller computation steps in training. The superiority of fuzzy-ARTMAP to other neural networks makes the blind fuzzyARTMAP equalizer feasible to implement. Acknowledgements. This Paper was partially supported by the Brain Korea 21 Project in Kunsan National University.
References 1. Y. Sato, “A Method of Self-Recovering Equalization for Multilevel Amplitude Modulation Systems,“ IEEE Trans. Commun, vol. COM-23, pp. 679–682, Jun. 1975. 2. F. B. Ueng and Y. T. Su, “Adaptive Blind Equalization Using Second and Higher Order Statistics,“ IEEE J. Select. Areas Commun., vol. 13, pp. 132–140, Jan. 1995. 3. S. Mo and B. Shafai, “Blind Equalization Using Higher Order Cumulants and Neural Network,“ IEEE Trans. Signal Processing, vol. 42, pp. 3209–3217, Nov. 1994. 4 S. Chen, B. Mulgrew, and P. M. Grant, “A Clustering Technique for Digital Communications Channel Equalization Using Radial Basis Function Networks,” IEEE Trans. Neural Networks, vol. 4, pp. 570–579, 1993. 5. G. A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds, and D.B. Rosen, “Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps,” IEEE Trans. Neural Networks, vol. 3, pp. 698–713, Sep. 1992.
Comparison of BPL and RBF Network in Intrusion Detection System Chunlin Zhang, Ju Jiang, and Mohamed Kamel Pattern Analysis and Machine Intelligence Research Group, Systems Design Department, Engineering Faculty University of Waterloo, Canada {c9zhang,j4jiang,mkamel}@uwaterloo.ca
Abstract. In this paper, we present the performance comparison results of the backpropagation learning (BPL) algorithm in a multilayer perceptron (MLP) neural network and the radial basis functions (RBF) network for intrusion detection. The results show that RBF network improves the performance of intrusion detection systems (IDSs) in anomaly detection with a high detection rate and a low false positive rate. RBF network requires less training time and can be optimized to balance the detection and the false positive rates.
1
Introduction
IDS is a system for monitoring and detecting data traffic or user behaviors to distinguish intruders. There are two primary approaches to detect intrusions: misuse detection and anomaly detection. The misuse detection attempts to model intrusions to a system by some statistical or soft computing methods, and then scans the system for their occurrences. This method is used to detect well-known intrusions. Anomaly detection works by creating a profile of typical normal network traffic activities or user behaviors then comparing the profile with the input activities to decide whether the input instance is normal or not. It usually uses a deviation threshold to indicate when a certain established deviation has been reached. Anomaly detection directly addresses the problem of detecting novel intrusions [1][2]. The misuse detection is used to detect well-known intrusions. Anomaly detection addresses the problem of detecting novel intrusions. Artificial neural networks (ANN) have been successful in solving many complex practical problems. However, the application of neural networks to detect computer intrusions has been very limited [3][4][5][6]. BPL and RBF are two common algorithms in ANN. Mainly because of the differences of their activation functions, BPL and RBF have different performance in applications. Some of the drawbacks of the BPL include reaching local minima, slow convergence, determining the number of hidden layers and nodes, and initializing weights. Furthermore it is inflexible to tune the network by analyzing the input data, because there is no intuitive relationship between the input data and the BPL structure. Theoretically, RBF G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 466–470, 2003. c Springer-Verlag Berlin Heidelberg 2003
Comparison of BPL and RBF Network in Intrusion Detection System
467
has some advantages over BPL. RBF can model any nonlinear function using a single hidden layer, which eliminates considerations of determining the number of hidden layers and nodes. The simple linear transformation in the output layer can be optimized fully by using traditional linear modelling techniques, which are fast and less susceptible to the local minima problem. The number of hidden nodes and function parameters of RBF network can be preset in accordance with the prior understanding of the training data or requirements of the output.
2
Experimental Set up and Results
The dataset used in this paper is “KDD Cup 1999 Data”, which is a version of DARPA 1998 dataset . Intrusions in the dataset fall into four main categories: denial-of-service (DoS), unauthorized access from a remote machine (R2L), unauthorized access to local super-user privileges (U2R) and surveillance and other probing (PROBE) [7]. Both the training data and the testing data of the experiments contain 1000 data instances, which are selected from the dataset. Each sample is unique with 34 numerical features and 7 symbolic features. The symbolic features were converted to ASCII numbers before they were used as training or testing data. A binary coding method is used to convert these ASCII numbers into small values. 2.1
Experiment 1: Misuse Detection
The objective of experiment 1 is to evaluate the abilities of RBF in misuse detection, and further to compare them with the results of BPL. The outputs are defined as [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], which represent NORMAL, PROBE, DoS, and R2L patterns respectively. The training of BPL is quite simple. It is not necessary to normalize the training and testing data. Compared to RBF, one of the advantages of BPL is the simplicity of data preparation. Some parameters need to be set before training, such as the training epochs, the number of hidden nodes and layers, the training goal and the type of activation function. The limitation of BPL is that it takes long time to train the neural network, and it is not as easy as RBF to add more hidden nodes for new classes. In the experiment of RBF in misuse detection, a C-Means clustering algorithm is used in the first phase of the RBF network. In this phase, the parameters, such as the number of hidden nodes, the center and covariance of every hidden node, are obtained. In the second phase, the weights between the output nodes and the hidden nodes are trained. Since the outputs are not exactly “1s” or “0s”, the largest element of the output was set to “1” Table 1 shows that BPL has a slightly better performance than RBF in that it has higher detection rate (DR) and lower false positive (FP) rate.
468
C. Zhang, J. Jiang, and M. Kamel Table 1. Results of BPL and RBF in Misuse Detection OVERALL PROBE DoS R2L DR FP DR FP DR FP DR FP BPL 99.2% 1.2% 99.2% 1.2% 99.6% 1.2% 98.8% 1.2% RBF 98% 1.6% 98% 1.6% 98.8% 1.6% 97.2% 1.6%
2.2
Experiment 2: Anomaly Detection
The objective of this experiment is to evaluate if RBF can overcome the limitations of BPL in anomaly detection. There is only one output node. A deviation threshold is used to tune the anomaly detection. In the experiment of the BPL in anomaly detection, the number of the hidden nodes is chosen as 83. The maximum training epoch is chosen as 200. It took two hours to finish the training. In the RBF structure, only NORMAL instances were used to train the RBF neural network. The RBF network in the experiment has 41 input nodes, 1 hidden node and 1 output node. The range of deviation threshold of RBF, which separates “1s” and “0s”, is wider than the range in BPL. The detection and false positive rates did not change too much when the threshold changed. The reason is that the Gaussian activation function of RBF assures that the same classes are clustered together and magnifies the output difference if the instances belong to different classes. The boundary of the hypersphere of the RBF classifier can be found easily. From Fig.1, it is obvious that the BPL suffers from a low detection rate and a high false positive rate. For example, when the threshold is 0.31, the overall performance of BPL is 94% detection rate and 8.8% false positive rate. In R2L intrusion detection, the BPL classifier only achieved 88% detection rate and 7.2% false positive rate. In DoS intrusion detection, it achieved 90% detection rate with 11% false positive rate. The results of BPL in anomy detection will not be acceptable in a crucial network environment. In contrast to BPL, RBF shows an excellent performance, the detection rate is even close to the result of misuse detection. However we cannot ignore the fact that the training and testing data are pre-processed as in [7], which contributes to the results. Another observation is that RBF has a flat slope and broad threshold range, see Fig.1. The slopes of FP and DR determine the stability of the performance. A flatter slope means a stable performance. For example, the threshold of RBF can vary from 0.1 to 0.9 but the curves of FP and DR almost stay constant. Their performances change slightly when the threshold changes. In contrast to RBF, if the threshold of BPL varies by even 0.1, its performance will change greatly. Table 2 displays the detail performance of BPL and RBF when the bestoptimized threshold is set for different kind of intrusions. The overall detection rate of BPL is 93.7%. It is much lower than 99.2% of RBF. The false positive of BPL is 7.2%. It is also much higher than 1.2% of RBF. RBF has a very broad threshold range from 0.1 to 0.9. An optimized threshold, 0.5, in RBF, can be
Comparison of BPL and RBF Network in Intrusion Detection System Overall Performance of Anomaly Detection ROC
Overall Anomaly Detection (1−DR) vs. FP
1
1
0.9
0.9 BPL RBF
BPL FP BPL (1−DR) RBF FP RBF (1−DR)
0.8
0.7
0.7
0.6
0.6
%
Detection Rate
0.8
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
469
0.1
0
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
0.9
1
0
0
0.2
0.4
0.6
0.8
1 Threshold
1.2
1.4
1.6
1.8
2
Fig. 1. Comparison of BPL and RBF in Anomaly Detection
chosen for all of the intrusion detections, including novel intrusion detections. It is impractical for BPL to find a threshold suitable for all off intrusions. For example, in BPL, when the threshold is set to 0.28, the intrusion detection system achieves the best performance only for DoS intrusions. However, it has 12% FP in “Guess Password” intrusion detection, and 13% FP in PROBE detection. Table 2. Detail Results of BPL and RBF in Anomaly Detection
TRa TGPb DR FP a b
3
OVERALL BPL RBF 0.2-0.4 0.1-0.9 0.34 0.5 93.7% 99.2% 7.2% 1.2%
PROBE BPL RBF 0.3-0.5 0.1-0.9 0.43 0.5 97% 98.4% 2.8% 1.2%
DoS BPL RBF 0.2-0.4 0.1-0.9 0.28 0.5 90.4% 100% 13.2% 1.2%
R2L BPL RBF 0.2-0.4 0.1-0.9 0.34 0.5 88% 100% 7.2% 1.2%
TR - Threshold Range TGP - Threshold Golden Point
Conclusion
Both BPL and RBF can be used in Intrusion Detection System. The use of neural networks in IDS results in acceptable detection and misclassification rates for identifying intrusions. ANN act like black boxes that classify different kind of instances just in accordance with the understanding of the training data. This is a significant advantage for IDS because it does not need to fully construct intrusion patterns mathematically. Both BPL and RBF have good performance in identifying well-known intrusion patterns. The results of experiment 1 shows that a neural network is an excellent tool in identifying well-known intrusions. The performance of both
470
C. Zhang, J. Jiang, and M. Kamel
BPL and RBF for misuse detection is comparable with other methods, such as rule-based expert system IDS. The use of neural networks has the advantage of detecting intrusions that are variants of the well-known signatures. Compared to BPL, RBF achieved better performance. RBF-based anomaly detection has an overall 99.2% detection rate and less than 2% false positive rate. The flat slope and broad threshold range make the intrusion detection system reliable, and the training time is cut to five minutes. One direction that we are pursuing after this study is to use a hierarchical neural network in IDS. The intention is to use it as a hybrid intrusion detection system to detect both misuse instances and anomaly instances, and to make it adaptive to learn new intrusions.
References 1. Jones, A.K., Sielken, R.S.: Computer system intrusion detection: A survey. Technical report, University of Virginia Computer Science Department (1999) 2. Cannady, J.: Next generation intrusion detection: Autonomous reinforcement learning of network attacks. In: Proceedings of the 23rd National Information Systems Security Conference (NISSC 2000). (2000) 3. Cannady, J.: Artificial neural networks for misuse detection. In: Proceedings of the 1998 National Information Systems Security Conference (NISSC’98) October 5-8 1998. Arlington, VA. (1998) 443–456 4. Ryan, J., Lin, M.J., Miikkulainen, R.: Intrusion detection with neural networks. In Jordan, M.I., Kearns, M.J., Solla, S.A., eds.: Advances in Neural Information Processing Systems. Volume 10., The MIT Press (1998) 5. Ghosh, A., Wanken, J., Charron, F.: Detecting anomalous and unknown intrusions against programs. In: Proceedings of the 1998 Annual Computer Security Applications Conference (ACSAC’98), December 1998., Los Alamitos, CA, USA : IEEE Comput. Soc, 1998 (1998) 259–267 6. Fan, W., Miller, M., Stolfo, S., Lee, W., Chan, P.: Using artificial anomalies to detect unknown and known network intrusions. In: IEEE Intl. Conf. Data Mining. (2001) 7. Stolfo, S., Fan, W., Lee, W., Prodromidis, A., Chan, P.: Cost-based modeling for fraud and instrusion detection: Results from the jam project. In: DARPA Information Survivability Conference and Exposition. Volume II., IEEE Computer Press (2000) 130–144
Back Propagation with Randomized Cost Function for Training Neural Networks H.A. Babri1, Y.Q. Chen2, and Kamran Ahsan3 1Lahore
University of Management Sciences, Computer Science Department, DHA Lahore, Pakistan, Tel: 092-42-5722670 {haroon }@lums.edu.pk, http://www.lums.edu.pk 2 Nanyang
3 Stanford
Technology University, School of EEE Block S1, Nanyang Avenue Singapore 639798 {eyqchen}@ntu.edu.sg
University, Department of Management Science and Engineering, Terman Engineering Center, 3rd Floor Stanford, California 94305-4026 {kahsan}@stanford.edu
Abstract. A novel method to improve both the generalization and convergence performance of the back propagation algorithm (BP) by using multiple cost functions with a randomizing scheme is proposed in this paper. Under certain conditions, the randomized technique will converge to the global minimum with probability one. Experimental results on benchmark Encoder-Decoder problems and the NC2 classification problem show that the method is effective in enhancing BP’s convergence and generalization performance.
1 Introduction The Backpropogation (BP) algorithm for training multilayered feedforward neural networks is based on gradient descent minimization of a cost function. Gradient descent techniques are slow and prone to local minima [1]. There has been much research to address these problems. QuickProp [2] and its variants [3] use entropy measures to improve learning speed; the delta-bar-delta (DBD) algorithm[4], and the RPROP algorithm [5] reported improvements in convergence times. However these methods do not necessarily increase the probability of the network to escape from local minima. The incremental or stochastic gradient descent method [6] is better at avoiding local minima but is slow. In this paper, we propose a novel method – Back Propagation with Randomized Cost Functions - to help escape from local minima and improve convergence and generalization performance. BPRCF is described in section 2, and its ability to find global minimum is discussed in section 3. Experimental results comparing BFRCF with BP, and delta-bar-delta, are given in section 4.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 471–475, 2003. © Springer-Verlag Berlin Heidelberg 2003
472
H.A. Babri, Y.Q. Chen, and K. Ahsan
2 Randomizing the Cost Function The cost function ε of the widely used BP algorithm, with tn and xn the desired and th the actual outputs for the n output neuron of the network, is given by:
ε =
N
∑ (t
n
− xn ) 2
(1)
n =1
To minimize the cost function, weights in the network are adapted as:
w(jil ) (t +1) = w(jil ) (t ) + η δ (JL ) xi( l −1) (2) l th th η being the learning rate, and δ j the error signal at the j neuron in the l layer. The above cost function can be generalized to a class of functions of the following form: N
ε * = ∑ cn (t n − xn ) 2
(3)
n =1
where c n > 0, are weighing factors on individual components of the cost function. For * a vector (c1,c2 .. cN), ε represents one cost function and different vectors give rise to different cost functions. As long as c n > 0, all the resultant cost functions lead to a generally desired objective, i.e., to minimize some distance metric between the desired and actual outputs. Randomized BP is obtained by choosing a sequence of vectors, which converge to the unit vector over the learning phase. Since a given member cost function corresponds to a distinct error landscape, this process utilizes a randomly changing error surface to provide an opportunity to change the search direction determined by the new error landscape and avoid the local minima of the previous landscape. This is different from simulated annealing, which uses noise as a mechanism to disturb the search process on a fixed error surface. To make c n ’s approach 1 as learning progresses, c n ’s can be chosen from a timevarying probability distribution function over [1-a(t),1+ a(t)], where a(t) can be:
a (t ) = 1 − e − ket
(4) Here ke is a positive parameter which determines the span of the randomization process. The random variable c n is then obtained as follows:
c n =a(t) + 2(1-a(t))x
(5)
where x is a random number with uniform distribution on [0,1]. If the landscape changes too fast, the search process might not be able to settle down to optimize the use of various landscapes. Therefore, each landscape is kept fixed for a finite period CcP(in epochs). The computational complexity of BPRCF (N random numbers every CcP epochs and N multiplies for weight updates each epoch) is comparable to BP and is significantly less than the DBD algorithm[4].
3 Convergence Considerations Consider a one-dimensional cost function f2 with global minimum g, local minimum l2 and local maximum m2 (see figure 1). If the initial weight, w0 ∈ (m2, b), then gradient search will find g; else it will be stuck in l2. Assume w0 has a probability
Back Propagation with Randomized Cost Function for Training Neural Networks
473
density function (pdf) p(x). Then, the probability P(G), of finding the global minimum g is given by:
P(G ) =
b
∫m2 p( x)dx
(6)
If there are two cost functions f1 and f2, BPRCF utilizes them alternatively. If f1 and f2 have their global minima g at the same value for w, two possible situations arise. If l1 and l2 are on opposite sides of g, then alternate use of f1 and f 2 will lead to g. If l1 and l2 are on same side of g and w0 ∈ (a,m1), the search can alternate between l1 and l2. f(w) f2 f1
w a
l2
m2
g
m1
l1
b
Fig. 1. The cost functions f1 and f2 share the same location g for their global minima.
However, if a large number of independent cost functions are used, the probability of all local minima falling on one side of the g is negligible. In the limit, switching among cost functions ensures that g is found with probability 1. For n independent cost functions {f1, f2 ... fn}, the probability Pn(G) of finding g is n n g b P (G ) ≥1 − ∏ ∫a pli ( x)dx + ∏ ∫g pl i ( x)dx n i = 1 i = 1
(7)
th
where pli(x) is the pdf of the local minimum for the i cost function. If all the cost functions have the same pdfs for their local minima, equation (7) reduces to n n b g Pn (G ) ≥ 1 − ∫ p l ( x)dx + ∫ p l ( x)dx a g
Since each product term on the right side is strictly smaller than 1, (See [7] for proof). For example, if
∫
g
a
(8)
lim Pn (G ) = 1 n →∞
b
pl ( x)dx = ∫ pl ( x)dx = 0.5, Then P10 (G) g
=0.998, which indicates that the probability of reaching the global minimum is close to 1 if 10 independent cost functions are used.
474
H.A. Babri, Y.Q. Chen, and K. Ahsan
4 Experimental Results and Future Work We implemented BP, Delta-Bar-Delta (DBD) and the BPCRF algorithms on a threelayered network and evaluated their performance on the m-n-m Encoder Decoder problems (Table 1) and NC2, a difficult non-convex pattern recognition problem [6, Chapter 6] (Table 2). For BPRCF, ke and Ccp, were set to 0.00125 and 10 respectively for these problems. BPRCF’s performance is not too sensitive to the choice of these parameters, which determine the span of randomization and time interval over which the landscape is stable. Examination of the learning curves showed that BP usually got stuck at a local minimum while BPRCF avoided the local minima by using different cost functions. DBD often avoided some of the local minima but its behavior was highly unstable as problem size increased. Hence BPRCF gave the best performance in terms of the mean squared error obtained and average classification accuracy on the test set (Table 2). More work is being done on BPRCF’s performance on other benchmark problems and on heuristics for selecting the randomizations and stability parameters. Table 1. Summary of Results for the Encoder – Decoder Problems (25 trials) Problem 10-5-10 8-3-8 16-4-16 32-5-32
BP (Epochs) Average 490 1200 >2000 >2000
SD 20 168 >2000 >2000
DBD (epochs) Average 419 601 847 >2000
SD 85 204 500 >2000
BPRCF (epochs) Average 125 265 277 308
SD 11 55 36 34
Table 2. NC2 Problem Results (50 Trials) Algorithm BP Delta Bar Delta BPRCF
Average CCR
Mean Error
SD
79.24& 90.33% 93.14%
0.038 0.02 0.002
0.01 0.005 0.001
References [1] Baldi. P.(1995) “Gradient descent learning algorithm overview: a general dynamical systems perspective” IEEE Trans. Networks Vol. 6, pp. 182–195 [2] Solla, S.A., Levin, E. and Fleisher, M.(1988) “Accelerated learning in layered neural networks”, Complex Systes, Vol. 2, PP. 625–640 [3] M. Joost and W. Schiffmann. (1998) “Speeding up back propagation algorithms by using cross-entropy combined with pattern normalization”, International Journal of Uncertainty, Fuzziness and Knowledge based systems (IJUFKS), 6(2):117–126 [4] Jacobs; R.A.(1988) “Increased rates of convergence through learning rate adaptation”, Neural Networks, Vol. 1, p. 295–307
Back Propagation with Randomized Cost Function for Training Neural Networks
475
[5] M. Reidmiller and H. Braun. (1993) “A Direct adaptive method for faster back propagation learning: The RPROP algorithm”, In the Proceedings of the IEEE International Conference on Neura Network. IEEE Press [6] Haykin, S.(1999) Neural Networks: A Comprehensive Foundation, Macmillan, Prentice Hall, Second Edition [7] H. Babri and K.Ahsan, (2002) “Randomized Cost Functions in Back Propagation Learning”, Technical Report CS-002-018, CS Department, LUMS.
Selective Ensemble of Decision Trees Zhi-Hua Zhou and Wei Tang National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China [email protected] [email protected]
Abstract. An ensemble is generated by training multiple component learners for a same task and then combining their predictions. In most ensemble algorithms, all the trained component learners are employed in constituting an ensemble. But recently, it has been shown that when the learners are neural networks, it may be better to ensemble some instead of all of the learners. In this paper, this claim is generalized to situations where the component learners are decision trees. Experiments show that ensembles generated by a selective ensemble algorithm, which selects some of the trained C4.5 decision trees to make up an ensemble, may be not only smaller in the size but also stronger in the generalization than ensembles generated by non-selective algorithms.
1 Introduction Ensemble is a learning paradigm where multiple component learners are trained for a same task by a same learning algorithm, and the predictions of the component learners are combined for dealing with future instances. Since an ensemble is often more accurate than its component learners [4, 20, 25], such a paradigm has become a hot topic in recent years and has already been successfully applied to optical character recognition [9, 17], face recognition [13, 16], scientific image analysis [1, 6], medical diagnosis [7, 26], etc. In general, an ensemble is built in two steps, that is, training multiple component learners and then combining their predictions. According to the styles of training the component learners, current ensemble algorithms can be roughly categorized into two classes, that is, algorithms where component learners must be trained sequentially, or algorithms where component learners could be trained in parallel. The representative of the first category is AdaBoost [11], which sequentially generates a series of component learners where the training instances that are wrongly predicted by a component learner will play more important role in the training of its subsequent learner. Other representatives of this category include Arc-x4 [5], fBoost [14], MiniBoost [21], MultiBoost [24], etc. The representative of the second category is Bagging [4], which utilizes bootstrap sampling [10] to generate multiple training sets from the original training set and then trains a learner from each generated training set. Other representatives of this category include SEQUEL [1], Wagging [2], pBagging [2], GASEN [27], etc.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 476–483, 2003. © Springer-Verlag Berlin Heidelberg 2003
Selective Ensemble of Decision Trees
477
It is worth mentioning that after obtaining multiple learners, most ensemble algorithms employ all of them to constitute an ensemble. Although such a scheme works well [2, 8, 18], recently some researchers [27] show that when the component learners are neural networks, it may be better to ensemble some instead of all of them. In this paper, such a claim is generalized to situations where the component learners are decision trees. Experiments show that ensembles generated by a selective ensemble algorithm may have not only smaller size but also stronger generalization ability than those generated by non-selective ensemble algorithms. The rest of this paper is organized as follows. In Section 2, a practically feasible selective ensemble algorithm, i.e. GASEN-b, is proposed. In Section 3, experiments comparing GASEN-b against some non-selective ensemble algorithms are reported. Finally in Section 4, the main contribution of this paper is summarized and several issues for future works are indicated.
2 GASEN-b Since it is difficult to identify the component learners that should be excluded from the ensemble, Zhou et al. [27] proposed an algorithm named GASEN to build selective ensembles. At first, GASEN assigns a random weight to each of the available component learners. Then, it employs genetic algorithm to evolve those weights so that they can characterize to some extent the fitness of the learners in joining the ensemble. Finally, it selects the learners whose weight is bigger than a preset threshold to constitute the ensemble. Each individual in the evolving population is a weight vector w = (w1, w2, …, wT), where wi is the weight for the i-th component learner. In order to evaluate the V goodness of the individuals, a validation data set is employed. Let Ew be the validation error of the ensemble corresponding to the individual w on the validation V set V. It is obvious that Ew can express the goodness of w in the way that the smaller V V Ew is, the better w is. So, GASEN uses f(w) = 1/ Ew as the fitness function. Here GASEN is modified in the way that instead of using weight representation, that is, assigning a weight to each component tree and then selecting the trees according to the evolved weights, bit strings are used where “1” denotes a tree appearing in the ensemble while “0” denotes its absence. Such a bit representation gets rid of the need of manually setting the threshold for selecting the component learners according to their evolved weights. Moreover, since evolving shorter strings is faster than evolving longer ones, GASEN-b may be faster because the strings generated in bit representation are significantly shorter than those generated in weight representation. The modified algorithm is called GASEN-b, i.e. GASEN with bit representation. This algorithm is shown in Fig. 1, where T bootstrap samples S1, S2, …, ST are generated from the original training set and a component learner Ct is trained from each St, an ensemble C* is built from C1, C2, …, CT whose output is the class label receiving the most number of votes. Note that if there are enough training data,
478
Z.-H. Zhou and W. Tang
Input: training set S, learner L, trials T, validation set SV with size m Output: ensemble C* Process: 1. for t = 1 to T { 2. St = bootstrap sample from S 3. Ct = L(St) 4. } 5. generate a population of bit strings 6. evolve the population where the fitness of a string b is measured as
f ( b ) = 1/ Eb = m /
∑
1
xi ∈SV : arg max 1 ≠ yi y∈Y bt =1: Ct ( xi )= y
∑
7.
b* = the evolved best bit string
C * ( x ) = arg max y∈Y
∑
1
bt* =1:Ct ( x ) = y
Fig. 1. The GASEN-b algorithm.
employing a separate validation set, i.e. SV, may help obtain better performance. Otherwise SV may be generated by bootstrap sampling from the training set. It is worth mentioning that besides S, L, T, and SV, there are several other parameters that should be set in GASEN-b. Those mainly include the size of the population, the maximum generation, the crossover probability, and the mutation probability. Although the setting of those parameters may affect the efficiency of GASEN-b, we believe that optimizing them is the affair of genetic algorithm community. Fortunately, it was shown [27] that even without finely tuning those parameters, GASEN could already obtain good results by utilizing some publicly available GA tools to implement the genetic algorithm and simply setting all the parameters to their default values.
3 Experiments 3.1 Compared Algorithms In our experiments, GASEN-b is compared against Bagging [4], AdaBoost [11], and Arc-x4 [5], all of which are non-selective ensemble algorithms, i.e. algorithms that use all the component learners to constitute an ensemble. Bagging (Bootstrap aggregating) is proposed by Breiman [4]. It employs bootstrap sampling to generate several training sets from the original training set, and then trains a component learner from each generated training set. The component predictions are often combined via majority voting.
Selective Ensemble of Decision Trees
479
AdaBoost (Adaptive Boosting) is proposed by Freund & Schapire [11]. It sequentially generates a series of component learners, where the training instances that are wrongly predicted by a learner will play more important role in the training of its subsequent learner. The component predictions are combined via weighted voting where the weights are determined by the algorithm itself. Arc-x4 belongs to the family of algorithms called Arcing (Adaptively resample and combine) proposed by Breiman [5]. It is similar to AdaBoost in that it also sequentially generates a series of component learners. However, the weight of an instance is proportional to the number of misclassifications made by all the previous component learners to the fourth power, plus one. Moreover, the component predictions are combined via majority voting instead of weighted voting. Note that AdaBoost requires a weak learner whose error is bounded by a constant strictly less than 0.5. In practice, this requirement cannot be guaranteed especially when dealing with multi-class tasks. It is also worth mentioning that if the successive training sets are generated by re-weighting [20], when the error of a component learner reaches zero, which is often occurred in decision tree induction, all the following component learners will be replications because no instance weight is changed. To overcome those problems, in our experiments if the error of a component learner is beyond 0.5, then the learner is aborted and a new learner is trained from a bootstrap sample of the original training set; if the error of a component learner reaches zero, then the next learner is trained from a bootstrap sample of the original training set. In both cases, the total number of bootstrap sampling is at most equal to the number of trials. Moreover, in order to prevent floating overflow, as suggested by -8 Webb [24], the smallest instance weight bounds to 10 . 3.2 Methodology In our experiments, for Bagging, AdaBoost, and Arc-x4, each ensemble contains twenty C4.5 decision trees. But for GASEN-b, the component learners are selected from twenty trees, i.e. the number of C4.5 decision trees contained in an ensemble generated by GASEN-b is often far less than twenty. The parameters of the genetic algorithm utilized by GASEN-b are set as follows. The maximum generation is set to 150, the crossover probability is set to 0.6, the mutation probability is set to 0.1, and the population size is set to 30. Note that the population size is the number of bit strings kept in each generation. Each string contains 20 bits each corresponds to a component decision tree. The validation data set used by GASEN-b is with the same size of its training data set, which is bootstrap sampled from its training data set. Fifteen data sets from UCI machine learning repository [3] are used. The number of classes ranges from 2 to 7. The number of attributes ranges from 8 to 38. The number of examples ranges from 214 to 7,200. 10-fold cross validation is performed on each data set, where ten runs are performed in each fold and the result of the fold is the average result of those ten runs. The training sets of the ensembles are bootstrap sampled from the training set of the fold. In order to increase the diversity of those ensembles, the size of their training sets is roughly half of that of the fold.
480
Z.-H. Zhou and W. Tang
In detail, suppose we have a data set comprising 1,000 instances. For 10-fold cross validation, the training set in each fold comprises 900 instances. Then, for each fold, 10 data sets with 450 instances are sampled as the training sets for 10 runs. In each run, all the compared ensemble algorithms will build an ensemble from the same 450 instances. The validation set used by GASEN-b is also sampled from the 450 instances. The reason why we do not use the original entire data set as the training set directly is that in order to compare the average performance of the algorithms, we hope to perform 10 runs. In order to make the average of the 10 runs meaningful, random samples should be used for the 10 runs. However, since all the data are from the same data set, the samples are opportunity samples instead of random samples. Using half of the volume of the original data set will improve the randomness of the opportunity samples, so that the average performance may be more reliable. Such a scheme has been adopted by Bauer & Kohavi [2] before. For comparison, we also test single C4.5 decision tree in each fold, whose training data set is generated in the same way as those of the ensembles. 3.3 Results The experimental results are shown in Table 1. Note that for single C4.5 decision trees, we record their predictive error. But since we are interested in the relative performance of the ensemble algorithms, the predictive error of Bagging, AdaBoost, Table 1. Error ratios against single C4.5 tree for Bagging, AdaBoost, Arc-x4, and GASEN-b Bagging
single
AdaBoost
Arc-x4
GASEN-b
Data Set tree
ratio
std
ratio
std
ratio
std
ratio
std
size
Glass
.371
.867
.072
.833
.057
.773
.094
.802
.093
7.43
Ionosphere
.117
.868
.086
.770
.228
.865
.183
.861
.100
7.95
WDBC
.075
.698
.156
.621
.210
.691
.201
.694
.172
7.30
Credit
.260
.918
.038
.955
.060
.891
.065
.907
.039
7.71
Diabetes
.288
.869
.059
.938
.061
.874
.058
.856
.060
8.38
Annealing
.143
.894
.053
.773
.073
.844
.077
.820
.093
7.29
Vehicle
.313
.921
.047
.802
.057
.830
.078
.910
.060
8.08
Heart
.213
.922
.022
.975
.054
.923
.042
.920
.026
8.31 7.38
German
.345
.898
.044
1.012
.053
.897
.041
.894
.047
Image
.054
.759
.096
.493
.080
.625
.106
.701
.089
7.83
Hypothyroid
.014
.873
.089
1.296
.431
.882
.215
.855
.124
8.75
Allhypo
.025
.904
.122
1.809
.563
1.102
.154
.846
.089
7.18
.257
.742
.028
.674
.026
.666
.023
.754
.029
12.39
.036
.864
.048
.893
.062
.790
.058
.836
.042
7.89
.018
.859
.056
.952
.089
.902
.066
.844
.071
7.12
.189
.857
.068
.920
.140
.837
.097
.833
.076
8.07
Waveform21 Page Ann Thyroid Ave.
Selective Ensemble of Decision Trees
481
Arc-x4, and GASEN-b have been normalized according to that of the single C4.5 decision trees. In other words, Table 1 shows the error ratios against single C4.5 trees for the ensemble algorithms, which is obtained through dividing the predictive error of the ensemble algorithms by that of single trees. The standard deviations of those error ratios are shown in columns titled as “std”. It is also worth mentioning that the sizes of the ensembles generated by Bagging, AdaBoost, and Arc-x4 are always twenty, that is, each ensemble contains twenty C4.5 decision trees, whereas the sizes of the ensembles generated by GASEN-b are shown in Table 1. Table 1 reveals that the performance of GASEN-b is significantly better than that of the compared ensemble algorithms because it can achieve stronger generalization with smaller sizes of ensembles. In detail, GASEN-b is better than Bagging on nine data sets, comparable to Bagging on five data sets, and worse than Bagging on only one data set; it is better than AdaBoost on nine data sets, and worse than AdaBoost on six data sets; it is better than Arc-x4 on six data sets, comparable to Arc-x4 on four data sets, and worse than Arc-x4 on five data sets. Moreover, the average size of the ensembles generated by GASEN-b is roughly 40% (8.07/20.0) of that of the ensembles generated by other algorithms. These results explicitly support the claim that when the component learners are decision trees, selective ensembles may be superior to non-selective ones. Table 1 also shows that Bagging and GASEN-b can always improve the generalization. As for AdaBoost, there are cases such as IMA where it greatly improves the generalization, but there are also cases such as ALL where it seriously deteriorates the generalization. This reflects that AdaBoost is not a stable algorithm, which has already been observed in previous empirical studies [2, 8, 19, 27]. As for Arc-x4, although there are cases such as ALL where it deteriorates the generalization, it is more stable than AdaBoost and its average error ratio is only slightly worse than GASEN-b.
4 Conclusion At present, most ensemble algorithms utilize all the trained learners to make up an ensemble. This paper shows that when the learners are decision trees, it may be better to build selective ensembles, that is, ensembles containing some instead of all of the trained decision trees. In order to show the feasibility of selective ensemble of decision trees, this paper presents GASEN-b algorithm. Experiments show that ensembles generated by GASEN-b may be not only smaller in the size but also stronger in the generalization than non-selective ensembles. Some researchers [18, 22] have investigated the problem of pruning boosted decision trees to reduce the complexity of the final ensemble. However, it has been proved that the boosting pruning problem is NP-complete and is even hard to approximate [22], and the pruning may sacrifice the generalization ability of the final ensemble [18, 22]. On the other hand, through employing rough set reduct, Hu [15] has demonstrated that ensembles better than those generated by boosting or bagging can be built with limited number of decision trees. In fact, in the context of ensemble
482
Z.-H. Zhou and W. Tang
pruning, GASEN-b could be regarded as a method for pruning bagged decision trees. It is interesting that from the theoretical analysis and experimental results presented in this paper, it is evident that pruning bagged decision trees may improve the generalization ability of the final ensemble. Thus, the different effects caused by pruning on the generalization ability of boosted or bagged decision trees may serve as an evidence for supporting that boosting and bagging has different natures. An advantage of GASEN-b that has not been put forth in this paper is that although it is a general algorithm, it has the potential for incorporating domain specific knowledge in a way investigated by genetic algorithm community for long time [12]. That is, employing coding scheme, fitness function, and genetic operators devised according to domain specific knowledge. This is an interesting issue to be explored in real-world applications in the future. It was found that combining component class probabilities may be better than combining component class labels [23]. So, powerful selective ensemble algorithms may be developed from the aspect of combining component class probabilities, which is also an interesting issue to be explored in the future. Acknowledgements. The comments and suggestions from the anonymous reviewers greatly improved this paper. This work was supported by the National Natural Science Foundation of China under the grant number 60273033, and the Natural Science Foundation of Jiangsu Province, China, under the grant number BK2001202.
References 1. 2. 3.
4. 5. 6.
7. 8.
Asker L., Maclin R.: Ensembles as a sequence of classifiers. In: Proceedings of the 15th International Joint Conference on Artificial Intelligence (1997) 860–865. Bauer E., Kohavi R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning 36 (1999) 105–139. Blake C., Keogh E., Merz C. J.: UCI repository of machine learning databases [http://www. ics.uci.edu/~mlearn/MLRepository.html]. Department of Information and Computer Science, University of California, Irvine, CA, 1998. Breiman L.: Bagging predictors. Machine Learning 24 (1996) 123–140. Breiman L.: Arcing classifiers. Annals of Statistics 26 (1998) 801–849. Cherkauer K. J.: Human expert level performance on a scientific image analysis task by a system using combined artificial neural networks. In: Proceedings of the AAAI-96 Workshop on Integrating Multiple Models for Improving and Scaling Machine Learning Algorithms (1996) 15–21. Cunningham P., Carney J., Jacob S.: Stability problems with artificial neural networks and the ensemble solution. Artificial Intelligence in Medicine 20 (2000) 217–225. Dietterich T. G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning 40 (2000) 139–157.
Selective Ensemble of Decision Trees 9.
10. 11.
12. 13. 14. 15.
16.
17.
18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
483
Drucker H., Schapire R., Simard P.: Improving performance in neural networks using a boosting algorithm. In: Hanson S. J., Cowan J. D., Giles C. L. (eds.): Advances in Neural Information Processing Systems 5, Morgan Kaufmann, San Mateo, CA (1993) 42–49. Efron B., Tibshirani R.: An Introduction to the Bootstrap. Chapman & Hall, New York (1993). Freund Y., Schapire R. E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the 2nd European Conference on Computational Learning Theory (1995) 23–37. Goldberg D. E.: Genetic Algorithm in Search, Optimization and Machine Learning. Addison-Wesley, Reading (1989). Gutta S., Wechsler H.: Face recognition using hybrid classifier systems. In: Proceedings of the International Conference on Neural Networks (1996) 1017–1022. Harries M.: Boosting a strong learner: evidence against the minimum margin. In: Proceedings of the 16th International Conference on Machine Learning (1999) 171–179. Hu X.: Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications. In: Proceedings of the IEEE International Conference on Data Mining (2001) 233–240. Huang F. J., Zhou Z.-H., Zhang H.-J., Chen T. H.: Pose invariant face recognition. In: Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition (2000) 245–250. Mao J.: A case study on bagging, boosting and basic ensembles of neural networks for OCR. In: Proceedings of the International Joint Conference on Neural Networks (1998) 1828–1833. Margineantu D., Dietterich T. G.: Pruning adaptive boosting. In: Proceedings of the 14th International Conference on Machine Learning (1997) 211–218. Opitz D., Maclin R.: Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research 11 (1999) 169–198. Quinlan J. R.: Bagging, boosting, and C4.5. In: Proceedings of the 13th National Conference on Artificial Intelligence (1996) 725–730. Quinlan J. R.: Miniboosting decision trees. http://www.cse.unsw.edu.au/~quinlan/ miniboost.ps. Tamon C., Xiang J.: On the boosting pruning problem. In: Proceedings of the 11th European Conference on Machine Learning (2000) 404–412. Ting K. M., Witten I. H.: Issues in stacked generalization. Journal of Artificial Intelligence Research 10 (1999) 271–289. Webb G. I.: MultiBoosting: a technique for combining boosting and wagging. Machine Learning 40 (2000) 159–196. Wolpert D.: Stacked generalization. Neural Networks 5 (1992) 241–259. Zhou Z.-H., Jiang Y., Yang Y.-B., Chen S.-F.: Lung cancer cell identification based on artificial neural network ensembles. Artificial Intelligence in Medicine 24 (2002) 25–36. Zhou Z.-H., Wu J., Tang W.: Ensembling neural networks: many could be better than all. Artificial Intelligence 137 (2002) 239–263.
A Maximal Frequent Itemset Algorithm1 Hui Wang, Qinghua Li, Chuanxiang Ma, and Kenli Li Computer School, Huazhong University of Science and Technology, 430074, WuHan, P.R.China [email protected]
Abstract. We present MinMax, a new algorithm for mining maximal frequent itemsets(MFI) from a transaction database. It is based on depth-first traversal and iterative. It combines a vertical tidset representation of the database with effective pruning mechanisms. MinMax removes all the non-maximal frequent itemsets to get the exact set of MFI directly, needless to enumerate all the frequent itemsets from smaller ones step by step. It backtracks to the proper ancestor directly, needless level by level . We found MinMax to be more effective than GenMax, a state-of-the-art algorithm for finding maximal frequent itemsets, to prune the search space to get the exact set of MFI .
1 Introduction Mining frequent itemsets is a fundamental and essential problem in many data mining applications such as consumer market_basket analysis, inferring patterns from web page access logs, and network intrusion detection, and many other important data mining tasks. The problem can be formulated as follows: Given a large database of set of items transactions, find all frequent itemsets, where a frequent itemset is one that occurs in at least a user_specified percentage of the data base. Many of the proposed itemset mining algorithms are a variant of Apriori [1], which employs a bottom-up, breadth-first search that starts from enumerating the frequent k_itemsets to get (k+1)_itemsets, k=1,2,……,step by step, up to get all the maximal k frequent itemsets. Counting a k_itemset means to count its all possible (2 -2) subsets. In many applications k can easily be 30 or 40 or longer. It is not scalable for long itemsets. Since the maximal frequent itemsets contain all the frequent ones, it is wise to mine the maximal frequent itemsets directly. Thus, there has been recent interest in mining maximal frequent patterns[2,3,4,5]. DepthProject[2] finds long itemsets using a depth-first search of a lexicographic tree of itemsets, and uses a counting method based on transaction projections along its branches, and also uses the look-ahead pruning method with item reordering. It returns a superset of the set of MFI and would require post-pruning to eliminate non-maximal patterns. 1
This paper is supported by the National Natural Science Foundation of China under Grant No.60273075
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 484–490, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Maximal Frequent Itemset Algorithm
485
MaxMiner[3] employs a breadth-first traversal of the search space; it reduces database scanning by employing a lookahead pruning strategy, i.e., if a node with all its extensions can determined to be frequent, there is no need to further process that node. It also employs item (re)ordering heuristic to increase the effectiveness of supersetfrequency pruning. Since MaxMiner uses the original horizontal database format, it can perform the same number of passes over a database as Apriori does. Mafia [4] uses a few pruning strategies to remove non-maximal sets, and uses vertical bit-vector data format, and compression and projection of bitmaps to improve performance. Mafia mines a superset of the set of MFI, also requires a post-pruning process to get the exact set of MFI. GenMax[5] is most recent method for mining the MFI. It is a backtrack search based algorithm , and uses a number of optimizations to prune the search space and is good at returning the exact set of MFI. But GenMax has to backtrack level by level, and dose some unnecessary exploration in the search space. In this paper we introduce MinMax, a new algorithm for mining maximal frequent itemsets from a transaction database. It is based on depth-first traversal and iterative. It combines a vertical tidset representation of the database with effective pruning mechanisms. MinMax removes all the non-maximal frequent itemsets to get the exact set of MFI directly, needless to enumerate all the frequent itemsets from smaller ones step by step. It backtracks to the proper ancestor directly, needless level by level . We found MinMax to be more effective than GenMax, a state-of-the-art algorithm for finding maximal frequent itemsets, to prune the search space to get the exact set of MFI .
2 Preliminaries The problem of mining maximal frequent patterns can be formally stated as follows: Let I={i1,i2,……,im} be a set of m distinct items. A set X ⊆ I is also called an itemset. An itemset with k items is called a k-itemset. Let D denote a database of transactions, where each transaction has a unique identifier (tid) and contains a set of items. The set of all tids is denoted T={t1,t2,……,tn}. The set tidset(X) ⊆ T , consisting of all the transaction tids which contain X as a subset, is called the tidset of X . For convenience we write an itemset {x,y,z} as {xyz} , and its tidset {1,2,3,4} as {1234}. The support of an itemset X , denoted σ (X) , is the number of transactions in which that itemset occurs as a subset. Thus σ (X)=|tidset(X)|. An itemset is frequent if its support is more than or equal to some threshold minimum support (minsup) value, i.e., if σ (X) ≥ minsup. We denote by Fk the set of frequent k-itemsets, and denote the set of all frequent itemsets by FI. A frequent itemset is called maximal if it is not a subset of any other frequent itemset. The set of all maximal frequent itemsets is denoted as MFI. The set of MFI is orders of magnitude smaller than the set of FI. Given a user specified min_sup value ,our goal is to efficiently enumerate all patterns in MFI.
486
H. Wang et al.
3 Algorithm Descriptions The basic idea of the algorithm MinMax is to find the maximal frequent itemsets as soon as possible and to use them to prune away the non-maximal frequent itemsets which have superset in the set of MFI. This way would narrow the search space quickly and strongly. It take a several techniques to make such process more efficiently. Firstly, it orders F1 from the idea of making the leftmost branches be as slim as possible, which helps to get a maximal frequent itemset as soon as possible. Secondly, when a maximal frequent itemset has been found, MinMax backtrack to find its oldest ancestor which has the same set of items with the newest maximal frequent itemset and back to the brother of that ancestor directly, needless to back level by level like GenMax does. MinMax prunes away the other branches rooted at that ancestor, which avoids unnecessary exploration. Thirdly, it makes full use of infrequent itemsets to avoid extending them further , which also helps to narrow down the search space, because any itemsets which have an infrequent subset must be infrequent. Finally, it prunes frequent itemset which has a superset in the set of MFI, because any subset of a maximal frequent itemset is also frequent, needless to consider it more. MinMax is a depth-first traversal algorithm, and a iterative one. It has a stack keeping the search trace, which makes backtracking possible. Let P be a stack with the pointer P, which includes three parts: head, tail and flag. Head is a frequent kitemset; tail is extension of head; flag is corresponding to tail, each bit of flag is for each element in tail, if the corresponding element x in tail has been considered, flag(x)=1; otherwise, flag(x)=0. The initial status of stack P would be like: P.head= Φ , P.tail=F1, F1 is ordered, all the flag bits are 0. The algorithm has a main while loop with the ending condition: only one record in the stack and all the flag bits are 1. It returns the exact set of MFI. The pseudocode is shown in appendix 1. 3.1 Representation of the Database The database representation can be very important for efficiently generating and counting itemsets. Most previous algorithms use a horizontal row layout, with the database organized as a set of rows and each row representing a transaction. The alternative vertical column layout associates with each item x a set of transaction identifiers (tids) for the set tidset(x). The vertical representation allows simple and efficient support counting. Mafia chose to use a vertical bitmap representation for the database. In a vertical bitmap, there is one bit for each transaction in the database. If item i appears in transaction j, then bit j of the bitmap for item i is set to 1; otherwise, the bit is set to 0. Let bitmap(X) correspond to the vertical bitmap that represents the transaction set for the itemset X, bitmap(Y) for the itemset Y, σ (X U Y)= bitwise-AND( bitmap(X), bitmap(Y)). But a vertical bitmap representation can be very sparse, especially at the lower support levels. Since every transaction has a bit in vertical bitmaps, there are many zeros
A Maximal Frequent Itemset Algorithm
487
since both the absence and presence of the itemset in a transaction need to be represented. MinMax chose to use an alternative representation tidset for each item, where each item has the set of all transaction tids where it occurs. It is more efficient to compute the support of itemsets since it involves only the intersections of tidsets. In this way, σ (x)=|tidset(x)|. MinMax counts F1 and F2 more efficiently as follows: F1={x: x ∈ I and |tidset(x)| ≥ minsup}. F2={{x,y}:x, y ∈ F1 and x σ (x2),it means x1 would make more frequent itemsets than x2. Let’s consider the capability of an element in F1 to make infrequent itemsets. Definition 1: λ (x) is the capability of x making infrequent itemsets. λ (x)=|{y: y ∈ F1 and {x,y} is infrequent}| For x1,x2 ∈ F1, if λ (x1)> λ (x2), it means x1 would make less frequent itemsets than x2. So it would be more effective to sort F1 by the decreasing order of λ to get more narrow left branches in the search tree, also, it would be helpful by the increasing order of σ . MinMax orders F1 by λ ↓ σ ↑ to get slim left branches. This heuristic was first used in MaxMiner, and has been used in other methods since then.
4 Comparison with GenMax Consider our example database in fig.1. There are five different items, I={A,B,C,D,E} and six transactions, T={1,2,3,4,5,6}. minsup=3. The vertical representation and corresponding support σ and capability of making infrequent itemsets λ are shown in fig.2. So, F1 is ordered as {DABEC}. The result is MFI={DEC,ABEC}. The search tree by MinMax is shown in Fig.3, by GenMax is shown in fig.4. n
Let X be a maximal frequent itemset, |X|=n, GenMax would explore
∑ i nodes. At the 1
n −2
best situation, that is MinMax backtracks n levels, GenMax is more
∑ i nodes than MinMax. 1
488
H. Wang et al.
MinMax only explores
n
n −2
1
1
∑ i - ∑ i =2n-1 nodes. At the worst situation, MinMax explores the
same nodes as GenMax. At the average situation, that is MinMax backtracks n/2 levels, Genn / 2 −2
Max is more
∑i
n
nodes than MinMax. MinMax explores
1
Fig. 1. Our example database
Fig. 3. The search tree by MinMax
∑i 1
n / 2 −2
∑ i =3/8n +5/4n-1. 2
1
Fig. 2. The vertical layout
Fig. 4. The search tree by GenMax
5 Conclusions We presented MinMax, an algorithm for mining maximal frequent itemsets. It shows that MinMax prunes the search space more strongly and efficiently than GenMax.
A Maximal Frequent Itemset Algorithm
489
References 1. R.Agrawal, H.Mannila, R.Srikant, H.Toivonen and A.I. Verkamo: Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining, Chapter 12, AAAI/MIT Press, 1995. 2. R. Agrawal, C. Aggarwal, and V. Prasad. Depth First Generation of Long Patterns. In ACM SIGKDD Conf., Aug. 2000. 3. R. J. Bayardo. Efficiently mining long patterns from databases. In ACM SIGMOD Conf., June 1998. 4. D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: a maximal frequent itemset algorithm for transactional databases. In Intl. Conf. on Data Engineering, Apr. 2001. st 5. K. Gouda, M. J. Zaki. Efficiently Mining Maximal Frequent Itemsets. In 1 IEEE Intl. Conf. On data mining, Nov. 2001.
Appendix 1: The Pseudocode of MinMax //MinMax mines the exact set of maximal frequent itemsets hidden in a transactional database// procedure MinMax(F1,MFI) {while NOT (only one record in P and all the flag bits are 1) do begin if P.flag still has some 0-bits left then x= the element from tail which corresponds to the leftmost 0 flag ;
U {x} ; possibleextensions={y:y ∈ P.tail and y>x}; if currentnode U possibleextensions has a currentnode=p.head
superset in MFI then flag(x)=1 //prune away the entire subtree rooted at currentnode// else extensions={x:x ∈ possibleextensions and currentnode U {x} is frequent }; //avoid infrequent itemset being extended// if extensions==empty then if currentnode has no superset in MFI then //got a new maximal frequent itemset// MFI=MFI
U
currentnode;
x0=x; //back to the oldest ancestor// while currentnode==P.head U P.tail do {pop;x0=P.head.last;}
490
H. Wang et al.
flag(x0)=1 else flag(x)=1 endif else push(currentnode, extensions,0) endif endif else {x0=P.head.last; pop; flag(x0)=1} //P.head.last means the last element in head// endif enddo return MFI }
On Data Mining for Direct Marketing Chuangxin Ou1 , Chunnian Liu1 , Jiajing Huang1 , and Ning Zhong2 1
Beijing Municipal Key Lab. of Multimedia & Intelligent Software Tech. School of Computer Science, Beijing University of Technology, China 2 Dept. of Information Eng., Maebashi Institute of Technology, Japan
Abstract. Direct marketing is a new business model by an interactive one-to-one communication between marketer and customer. There is great potential for data mining to make useful contributions to the marketing discipline for business intelligence. This paper provides an overview on the recent development in data mining applications for direct marketing.
1
Introduction
Direct marketing aims at obtaining and maintaining direct relations between suppliers and buyers within one or more product/market combinations. In marketing there are two main different approaches to communication: mass marketing and direct marketing [4]. Mass marketing uses mass media such as print, radio and television to the public without discrimination. While direct marketing involves the identification of customers having potential market value by studying the customers’ characteristics and the needs (the past or the future) and selects certain customers to promote. Direct marketing becomes more and more popular because of the increased competition and the cost problem. It is an important area of applications for data mining, data warehousing, statistical pattern recognition, and artificial intelligence [6]. Although standard data mining methods may be applied for the purpose of direct marketing, many specific algorithms need to be developed and applied for direct marketer to make decisions effectively. The paper investigates the recent development in data mining applications for direct marketing. Section 2 gives the process of direct marketing. Section 3 discusses what are the main problems of applying the traditional data mining techniques for direct marketing. Section 4 introduces the target selection algorithms in direct marketing. Section 5 provides the evaluation methods for direct marketing data mining methods. Finally, we conclude the paper in Section 6.
2
The Process of Direct Marketing
The process of data mining for direct marketing is a more specific version of a general data mining process that is usually a multi-phase process involving numerous steps like data preparation, preprocessing, search for hypothesis generation, pattern formation, knowledge evaluation, representation, refinement, and G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 491–498, 2003. c Springer-Verlag Berlin Heidelberg 2003
492
C. Ou et al.
management. We observed the process of data mining for direct marketing should have at least the following steps (see Fig. 1). Data preparation and preprocessing: Collect potential customers related data from multiple data sources. Such customer data includes all sales, promotion, and customer service activities that have occurred as a result of the customer’s relationship with a company, as well as personal related information such as names, address, age, sex, hobby, income, occupation, employment status, marital status. After that the data will be transformed into the format of which a data mining system needs. Finding patterns: Split the potential customer data into training dataset and testing dataset. Apply learning algorithms to training dataset for finding patterns and evaluate the patterns on the testing dataset. Furthermore, if the pattern are not satisfactory it may iterate to previous step. Promoting clients: Use the patterns to predict and promote the likely buyers.
Fig. 1. The process of direct marketing.
3
Problems in Direct Marketing
Although the traditional algorithms of data mining may be applied to direct marketing and provide a good background for our research, no satisfactory solutions have yet been found to solve real-world problems such as the following: – The traditional algorithms cannot be used for the very low response rate situation in marketing databases. – We observed that the distribution of many real world data is unbalanced ones and the coverage of rules mined from such data is usually very low. Therefore, most existing learning algorithms do not work well on such datasets and the minded results cannot be used to predict likely buyers.
On Data Mining for Direct Marketing
493
– The predictive accuracy cannot be used as a suitable evaluation criterion for direct marketing. One reason is that classification errors must be dealt with differently. Another reason is that the predictive accuracy is too weak for the purpose of customers targeting [4]. It cannot provide flexibility in choosing a certain percent of likely buyers for promotion. – Most of the methods only consider response. In practice, a direct marketer is more interested in maximizing profit rather than maximizing response. Sometimes the maximizing response is not equal to the maximizing profit. We may get more profit from the low probability responders than the high one. – Most of the methods do not consider the different actions of responders. It is not realistic that each of customers will respond in the same way. In practice the actions of response are different to the same promotion material. Some responders may buy cheap goods, but a few of the responders may choose luxury goods. Thus the direct marketers get different profits from the two kinds of responders. – Existing learning algorithms may not work well in direct marketing datasets since such data are with a large number of variables (features) and enormous amounts of data. – The features are difficult to choose from the datasets if the features are numerous. This is because of the extremely unbalanced class distribution. For instance, we cannot get the preferable features, if we choose the method of information gain to choose the features. The original entropy is very low if the positive is only 1%. We found that the values are very near with each other when calculating the information gain for each of the features. Therefore, it is difficult to find the key features. The RFM (Recency of the last purchase, Frequency of purchase, Monetary Value) related features always are important ones, but they cannot be used to predict the non-buyers because we have not the RFM data of such kind of customers.
4
Target Selection Algorithms for Direct Marketing
In order to solve the problems stated above, researchers have proposed target selection algorithms as the core technique for direct marketing. Such target selection algorithms can be divided into two major categories: segmentation models and response models [6]. 4.1
Segmentation Models
Segmentation models are to divide the individuals into groups or clusters on the basis of the similarities in characteristics or attributes that describe them [1]. They are as homogeneous as possible within segments and as heterogeneous as possible between segments in response behaviors [10]. The group with the highest probability to respond is selected to promote. The most often-used segmentation techniques are cluster analysis such as AID, CHAID and CART. The result of the three methods is a decision tree, and
494
C. Ou et al.
each node in the tree is a group in which every individual is homogeneous. In practice, the segmentation is often performed by calculating the RFM-score and divide the total list into different segments. The segmentation cannot work well on the dataset with extremely unbalanced class distribution because the rate of positive instances is always low in marketing databases. It is difficult to construct a suitable tree and they may only discover simply unusable rules or patterns. In such segmentation models it cannot rank the individuals and all clients in the same segment are treated alike.
4.2
Response Models
Response models are to use some models to calculate the response probability of each individual and then choose the highest probability responders to promote. In the following, we will introduce several response models including Rough Set Model, Logit/Probit Model, Genetic Model, Neural Network Model, and Market Value Functions Model.
Rough Set Model. Rough set theory constitutes a sound basis for data mining. It offers useful tools to discover patterns hidden in data in many aspects [5]. It can be used in different phases of knowledge discovery process like attribute selection, attribute extraction, data reduction, decision rule generation, and pattern extraction (templates, association rules). The ProbRough system, which combines basic principles of rough-set based rule induction systems with the flexibility of statistical techniques for classification, is a data mining system of rough classifier generation [9]. It can be used to analyze the potential customers and purchase prediction. The algorithm of rough classifier generation consists of two phases. The first phase is the global segmentation of the attribute space. It tries to minimize the average global cost of the decision-making. The second phase is to reduce the number of decision rules. In this phase it minimizes the number of rules of the resultant classifier. The ProbRough algorithm behaves well on the training data that involve redundant attributes. It can deal with two key problems: prior probabilities and unequal misclassification costs. The resultant rough classifiers are not sensitive to outliers in the data and can accept databases with noisy and inconsistent information. Users of the ProbRough system are not required to adhere to assumptions about the data or model that is produced.
Logit/Probit Model. The logit and probit model can handle discrete response data. It can be used to calculate the target score of each individual in marketing databases. The model assumes that every individual has a certain tendency to respond rt∗ (Eq. (1)) to a mailing received at time t. This tendency is influenced by Xt . rt∗ = Xt β + εt (1)
On Data Mining for Direct Marketing
495
If the tendency r∗t is larger than 0, then we assume individual will respond, otherwise individual will not respond. The individuals whose the predicted response probability is sufficiently high are selected to promote. However, the formulation of the model has some drawbacks. Firstly, it assumes the customers will spend the same amount of money on the same promotion material according to the respond probability. Secondly, the two models are only models response. It can predict the maximize response, but cannot predict the maximize profit. In order to solve this problem, Bult and Wansbeek added a cutoff point to the logit model so that only those individuals will be selected for which profit is maximized [2]. The formula is defined as follows: Πi = rRi − c
(2)
where Πi is the revenue for the firm generated by individual i, r is the revenue from positive reply, c is the mail cost and Ri is a binary variable indicating response by individual i. Genetic Model. The genetic model in direct marketing is used to build models that maximize expected response and profit from solicitations. The genetic model we referred in this paper is the GMAX Model [8], which is a hybrid AI-statistic method. The GMAX Model uses genetic modeling as the optimization techniques for direct marketing. Each model in the genetic modeling has an associated fitness value. A model with higher fitness value solves the problem that is better than a model with lower fitness value, and survives and reproduces at a higher rate. The advantage of the genetic model is that genetic models are assumptionfree, robust, nonparametric models and perform well on both large and small samples [8]. It can be used to learn complex relationships. However, genetic models have some problems needed to be solved. The first is to find a fitness function so that the genetic model performs good in the dataset. The second problem is how to set the genetic model parameters: population size and reproduction, crossover and mutation probabilities. Sometimes even with the correct parameter settings, genetic models cannot guarantee the optimal solution. Neural Network Model. One advantage of neural networks is that they can adapt to the non-linearity in the data to capture the complex relations. This is an important motivation for applying neural networks for target selection. The neural network sets a threshold, and only the ones that score above the given threshold are given promotion materials. The complexity of the neural network depends on the degree of nonlinear of the problems to be solved. Generally the feed-forward neural networks are sufficient for target selection problems. Important parameters that determine the complexity of feed-forward neural networks are the number of hidden layers and the number of neurons in each layer. In target selecting problems a feed-forward network with one hidden layer can provide models with sufficient accuracy [7].
496
C. Ou et al.
If we use a feed forward neural network with back propagation, we may meet the problem of local minima. Because they are gradient-descent algorithm chance is that you will end up in a local minimum instead of a global minimum. It is difficult to initialize a neural network given a certain problem. One way to solve this problem is to use a genetic algorithm to determine the initial weight in the neural network [1]. Neural networks are a kind of useful technique for pattern discovery, which can model non-linear/complex relationships, and is non-parametric and robust for noise. The neural network model can give us a target score for each client. We can rank the clients by the scores and send promotion material to the high probability client. However, the neural network model has also some drawbacks: firstly, it generates a complex formula that is too difficult to interpret; secondly, the neural network is difficult to configure. Market Value Functions Model. The market value model proposes a linear model to solve the target selection problem of direct marketing by drawing and extending result from information retrieval [11,12]. It is assumed that each object is represented by values of a finite set of attributes. A market value function is a linear combination of utility functions on attribute values. It depends on two parts: utility function and attribute weighting. Attribute weighting is based on information-theoretic measures. If an attribute has a lower entropy value it indicates the attribute is more informative in predicting. On the other hand, an attribute with larger entropy is less informative. The well-known Kullback Leibler divergence measure offers one of the weighting formulas: Pr(v|P ) ωa1 = Pr(v|P ) log (3) Pr(v) v∈Va
It measures the degree of deviation of the probability of attribution of P r(·|P ) from the distribution P r(·). Furthermore, we can get the attribute weighting from both positive and negative instances. |P | |N | 2 ωa = HP ∪N (a) − HP (a) + HN (a) (4) |P ∪ N | |P ∪ N | In this case, we have three sub-populations P , N and P ∪ N . HP (a), HN (a) and HP ∪N (a) denote the entropy value of attribute a in the three subpopulations, respectively. On the other hand, the estimations of utility functions draw from probability models of information retrieval. The definitions of utility functions are based on either only positive instances (Eq. (5)) or both positive and negative instances (Eq. (6)). Pr(v|P ) u1a (v) = (5) Pr(v)
On Data Mining for Direct Marketing
497
Pr(v|P ) (6) Pr(v|N ) where P is called positive and N is called negative, v is the attribute value in attribute set V , Pr(v|P ) is the conditional probability and Pr(v) is the unconditional probability. A market value functions is a linear combination of the attribute weighting and the utility function. The market value function can be denoted in the following form: r(x) = ωa ua (Ia (x)) (7) u2a (v) =
a∈At
where ωa is the weight of attribute a, ua (Ia (x)) is the utility function. At is the finite nonempty set of attributes. We can use the linear market value function to calculate the target score of each individual. The market value function has some advantages: firstly, it can rank individuals according to their market value instead of classifying; secondly, the market value functions is interpretable; thirdly, the system of the market value function can perform without expertise.
5
Evaluation of the Learning Algorithms
Since the predictive accuracy cannot be used as a suitable evaluation criterion for direct marketing process as mentioned in previous section, a new evaluation criterion is necessary to be developed. The main reason is that classification errors (false negative, false positive) must be dealt with differently. So far, many evaluation methods instead of predict accuracy have been employed in direct marketing. Among them, decile analysis and lift measure are two well-known ones. Decile analysis is a tabular display of model performance [8]. If the model has founded the regulations, we will see more responders in the top decile than in the bottom decile and the value of the Cum Response lift will distribute in descent order in the table. From the table of the decile we can evaluate a learning algorithm. In the KDD-97-Cup Competition, two measurements were used. One is the number of responders in the top decile; the second is the number of responders in the top four deciles. They are not a suitable way of measuring learning algorithms. Ling and Li proposed a solution to the problem [4]. They calculate a weight sum of the items in the lift table as the evaluation. Assume the 10 deciles in the lift table are S1 , S2 , . . . , S10 , then define the lift index as: Slift = 1.0 × S1 + 0.9 × S2 + ... + 0.1 × S10 /
10
Si
(8)
i=1
The Slif t value shows the degree of how well the model solves the problem concerned. A higher Slif t value indicates that more responders distribute in the top deciles than those in the bottom deciles, and it is more preferable than those models with low values.
498
6
C. Ou et al.
Concluding Remarks
Direct marketing is a new business model in which data mining and marketing databases are used for personalization and business intelligence. Although various individual data mining algorithms may be applied in this field, the specific issues of direct marketing need new integrated methodologies and systems to be developed for mining in enormous amounts of multiple data sources and for multi-aspect analysis. Furthermore, direct marketing must be expanded to the Web-based direct marketing and enhancing the strategic customer relationship management. Acknowledgments. This work is supported by the Natural Science Foundation of China (60173014) and Beijing Municipal Natural Science Foundation (4022003).
References 1. David Shepard Associates: The New Direct Marketing, McGraw-Hill (1999) 2. J.J. Jonker, P.H. Franses, N. Piersma: Evaluating Direct Marketing Campaigns; Recent Findings and Future Research Topics. Erasmus Research Institute of Management (ERIM), Erasmus University Rotterdam in its series Discussion Paper with number 166 (2002) 3. W. Klosgen and J.M. Zytkow: Handbook of Data Mining and Knowledge Discovery, Oxford University Press (2002) 4. C.X. Ling and C. Li: Data Mining for Direct Marketing: Problems and Solutions, Proc. 4th International Conference on Knowlege Discovery and Data Mining (KDD’98) (1998) 73–79 5. Z. Pawlak: Rough Sets, Theoretical Aspects of Reasoning about Data, Kluwer (1991). 6. P. Van Der Putten: Data Mining in Direct Marketing Databases, W. Baets (ed). Complexity and Management: A Collection of Essays, World Scientific (1999) 7. R. Potharst, U. Kaymak, W. Pijls: Neural Networks for Target Selection in Direct Marketing, K.A. Smith and J.N.D. Gupta (eds.) Networks in Business: Techniques and Applications, Idea Group Publishing (2001) 8. B. Ratner: Finding the Best Variables for Direct Marketing Models, Journal of Targeting Measurement and Analysis for Marketing, Vol. 9 (2001) 270–296 9. D. Van den Poel and Z. Piasta: Purchase Prediction in Database Marketing with the ProbRough System, L. Polkowski and A. Skowron (eds.) Rough Sets and Current Trends in Computing, LNAI 1424 (1998) 593–600 10. M. Wedel and W.A. Kamakura: Market Segmentation: Conceptual and Methodological Foundations, Kluwer Academic Publishers (1999) 11. Y.Y. Yao and N. Zhong: Mining Market Value Functions for Targeted Marketing, Proc. 25th IEEE International Computer Software and Applications Conference (COMPSAC’01), IEEE Computer Society Press (2001) 517–522 12. Y.Y. Yao, N. Zhong, J. Huang, C. Ou, C. Liu: Using Market Value Functions for Targeted Marketing Data Mining, International Journal of Pattern Recognition and Artificial Intelligence, Vol. 16, No. 8 (2002) 1117–1131
A New Incremental Maintenance Algorithm of Data Cube Hongsong Li, Houkuan Huang, and Youfang Lin Dept. of Computer Sci. and Tech., Northern Jiaotong University, Beijing, China, 100044 [email protected],[email protected],[email protected]
Abstract. A well-known challenge in data warehousing is the efficient incremental maintenance of data cube in the presence of source data updates. In this paper, we present a new incremental maintenance algorithm developed from Mumick’s algorithm. Instead of using one auxiliary delta table, We use two to improve efficiency of data update. Moreover, when a materialized view has to be recomputed, we use its smallest ancestral view’s data , while Mumick uses the fact table which is usually much lager than its smallest ancestor. We have implemented this algorithm and found the performance has a significant improvement.
1
Introduction
The data cube maintenance problem is the problem of keeping the contents of the materialized view consistent with the contents of the base relations as the base relations are modified. Maintenance of a view may involve several steps, one of which brings the view table up-to-date. We call this step refresh. A view can be refreshed within the transaction that updates the base tables, or the refresh can be delayed. The former case is referred as immediate view maintenance, while the latter is called deferred view maintenance. In this paper, maintenance means deferred maintenance. Algorithms that compute changes to a view in response to changes to the base relations are called incremental view maintenance algorithms[4]. Recent years there has been a significant amount of research devoted to maintenance of views in data cubes. They typically focus on how to update data and remain their concurrence when data cube has multiple data sources. On the other hand, papers focusing on how to make maintenance more efficient are relatively less. L. Colby et al. in [2] split the maintenance work into propagate and refresh functions to minimize the batch time needed for maintenance. I. S. Mumick et al. in [6] extend it and present a summary-delta algorithm to incrementally maintain pre-computed aggregates. Reference [5] evaluated performances of three implementations of the view maintenance procedures. References [7] introduce a new index structure called the SB-tree to maintain materialized temporal aggregate views efficiently. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 499–506, 2003. c Springer-Verlag Berlin Heidelberg 2003
500
H. Li, H. Huang, and Y. Lin
This paper includes two contributions: Firstly, we point out that if views are refreshed by a proper order, each of them can be re-computed more efficiently by visiting its parent view, which is usually much smaller than base table(s). Secondly, we use two summary-delta tables instead of one. The two deltas which store insert and delete data separately can avoid considerable amount of judgement and re-computation operation. The paper is organized as follows: the background and notation are given in Section 2. We present, in section 3, our algorithm (mainly refresh functions for maintaining individual views) and explain why it is better. Then in Section 4 we conduct experiments to evaluate our algorithm, while conclusions are presented in Section 5.
2
Background and Notation
In this section we review the concepts of self-maintainable aggregate functions and the computation lattice corresponding to a data cube. 2.1
Self-Maintainable Aggregate Functions
According to J. Gray, Aggregate functions can be classified into three categories: distributive, algebraic and holistic.[3] Distributive aggregate functions can be computed by partitioning their input into disjoint sets, aggregating each set individually, then further aggregating the (partial) results from each set into the final result. COUNT(), MIN(), MAX() and SUM() are all distributive. Algebraic aggregate function can be expressed as a scalar function of distributive aggregate functions[6]. For example, Average() can be expressed as SUM()/COUNT(). Holistic aggregate functions cannot be computed by dividing into parts. Median(), MostFrequent() are common examples of holistic functions. There is few efficient way of computing super-aggregates of holistic functions using the standard GROUP BY . We will not consider holistic functions in this paper. A set of aggregate functions is self-maintainable if the new value of the functions can be computed solely from the old values of the aggregation functions and from the changes to the base data. Aggregate functions can be self-maintainable with respect to insertions, with respect to deletions, or both. All distributive aggregate functions are self-maintainable with respect to insertions. However, not all distributive aggregate functions are self-maintainable with respect to deletions. The COUNT(*) function can help to make certain aggregate functions self-maintainable with respect to deletions, by helping to determine when all tuples in the group (or in the full table) have been deleted, so that the grouped tuple can be deleted from the view. COUNT(): The function COUNT(*) is always self-maintainable with respect to deletions. With the help of COUNT(*), COUNT(E)(count the number of nonnull values of attribute E) can be self-maintainable with respect to deletion.
A New Incremental Maintenance Algorithm of Data Cube
501
SUM(): With the help of COUNT(*) and(or) COUNT(E), SUM() can be self-maintainable with respect to deletion. MIN() and MAX() : MIN() and MAX() are not self-maintainable with respect to deletions, and cannot be made self-maintainable. if Count(*)>0 and COUNT(E)>0, and a tuple having minimum (maximum) value is deleted, we have to re-compute it from a more detail table. If the aggregate functions calculated in the view either are self- maintainable, or can be made self-maintainable, the view can be refreshed solely from the old view and from the changes to the view in response to changes to the base table. In other words, the view can be maintained incrementally. 2.2
Cube
The date cube [3] is a convenient way of thinking about multiple aggregate views, all derived from a base fact table using different sets of group-by attributes. The CUBE operator generalizes the standard GROUP-BY operator to compute aggregates for each combination of GROUP BY attributes. In OLAP parlance, the grouping attributes are called dimensions, the attributes that are aggregated are called measures, and one particular GROUP BY (e.g., Product, Store) in a CUBE computation is sometimes called a cuboid, a view or simply a group-by. If k dimensions are concerned in a hypercube, the hypercube totally has 2k views. We can distinguish views by their different in group-by, namely, we can use the combination of dimensions to express a view (or a query) uniquely. In a CUBE, more detailed aggregates can be used to compute less detailed aggregates. This property induces a partial ordering (i.e., a lattice) on all of the group-bys of the CUBE. A group-by is called a child of some parent group-by if the parent can be used to compute the child (and no other group-by is between the parent and the child). If we remove some nodes of the lattice to represent the fact that the corresponding views are not being materialized, a partial lattice is obtained. The various dimensions represented by the group-by attributes of a fact table often are organized into dimension hierarchies. A dimension hierarchy can also be represented by a lattice, similar to a data cube lattice. the bottom element of each lattice is ’none,’ meaning no grouping by that dimension. If dimensions in the cube have hierarchies, the relationship among views of the cube is more complicated. However, it’s still a lattice.
3
Maintenance Algorithm
Data warehouses are typically maintained in deferred mode, with the source changes received during the day applied to the views in a nightly batch window. The work involves updating the base tables (if any) stored at the warehouse, and maintaining all the materialized views. During that time the warehouse is unavailable to readers. The main aim of maintenance is to minimize the batch window.
502
H. Li, H. Huang, and Y. Lin
A general strategy is to split the maintenance work into two phases: propagate and refresh[6]. Each materialized view has its own delta table which stores the net change to the view due to the change to the fact table. In the phase of propagate, we propagate the source changes to the delta tables. Then in the phase of refresh, each materialized view is refreshed individually by applying its delta tables. Propagate can take place without locking the materialized views so that the warehouse can continue to be made available for querying by clients. Materialized views are not locked until the refresh phase, during which time the materialized view is updated from the delta tables. So the propagate phase can occur outside the batch window. Our algorithm is similar to that of reference [6],and the difference mainly lies in that each materialized view has two delta tables in our algorithm while only one in the algorithm of reference [6]. In our algorithm, the delta table storing insertion data is called insertion delta table and the delta table storing deletion data is called deletion delta table. In the algorithm of reference [6], the only delta table stores the union of insertion and deletion data. It should be noted that the maintenance operations on the data cube may be different from the operations on the data sources. For example, when the operation on the data sources is a deletion, the tuple is not physically deleted from the data cube because older versions might be needed by readers. Instead a new version of the tuple is produced. 3.1
Propagate
Since multiple cube views can be arranged into a (partial) lattice and views can be computed by more detailed views, delta tables can also be computed by more detailed delta tables. Therefore the change data on fact table, which is usually much large than any delta table, is not necessarily needed. So the time used to compute delta tables can be significantly shortened. The relations among all delta table is also a partial lattice called delta-lattice. Each delta table can be derived from the delta table above it in the partial lattice by a join with any annotated dimension table, followed by a simple group-by operation. Propagation of changes to multiple delta tables involves computing all the delta tables in the delta-lattice. The problem now is how to compute the delta lattice efficiently, since there are possibly several choices for ancestor delta tables from which to compute a delta. It turns out that that this problem maps directly to the problem of computing multiple views from scratch, as addressed in [1]. 3.2
Refresh
The phase of refresh can begin as soon as propagation finished. If G is a directed acyclic graph, then a topological order for G is a sequential listing of all the vertices in G such that,for all vertices v, w in G, if there is an edge from v to w ,then v precedes w in the sequential listing.
A New Incremental Maintenance Algorithm of Data Cube
503
Since lattice is a directed acyclic graph, it means that all views are refreshed after its parent view if views in the cube are refreshed by topological order. In our algorithm , we maintain the cube in topological order so that the parent view are ”clean” and can be used in case of re-computation. In the following part of this section, we’ll discuss how to refresh a single materialized view. The functions to refresh the base table are rather easy and will not be mentioned. For each materialized view, refresh procession is composed of two functions. One function refresh the view with insertion delta table, and the other refresh it with deletion delta table. Refresh insertion function applies the changes represented in the insertion delta table to the view. Each tuple in the insertion delta table causes a change to a single corresponding tuple in the view. The corresponding tuple in the view is either updated, or if the corresponding tuple is not found, the insertion delta tuple is inserted into the view. Since all of distributive aggregate functions are self- maintainable with respect to insertions, and all of algebraic aggregate functions can be expressed as a scalar function of distributive aggregate functions, there is no need for re-computation. FUNCTION refresh_insertion BEGIN FOR each tuple delta_t in insertion delta table DO BEGIN Let t = tuple in the view having the same values for its group-by attributes as delta_t; IF t does not exist THEN Insert delta_r into the view; ELSE t.count=t.count+delta_t.count; t.count(e)=t.count(e)+ delta_t.count(e); t.sum=t.sum+delta_t.sum; t.avg=t.sum/t.count(e); t.max=MAX(t.max, delta_t.max); t.min=MIN(t.min, delta_t.min); END IF; END /*FOR*/ END
Refresh deletion function applies the changes represented in the deletion delta table to the view. Each tuple in the deletion delta table causes a change to a single corresponding tuple in the view . The corresponding tuple in the view is either updated, or deleted. Although some aggregate functions (such as SUM(), AVG()) can be (or can be made) self- maintainable with respect to deletions, other aggregate functions (such as MAX(), MIN()) are un-self-maintainable. In this case, a re-computation may occur. For MIN function, e.g., if a value equal to the minimum value was deleted, we have to execute re-computation except that the new value of COUNT(E) is zero , which means all the corresponding tuples in base table are deleted and the new value should be set to null.
504
H. Li, H. Huang, and Y. Lin
It should be noted that the algorithm in reference [6] can’t tell a extremum in its single delta table is to be inserted or deleted. Therefore, it has to execute a re-computation even though the value is actually to be inserted and the recomputation is unnecessary. If a re-computation is inevitable, the new value can be re-computed by using the parent of the view, while the base table is used in reference [6]. As we discussed before, the parent view will have be refreshed by then if we maintain the cube in a topological order. Since the base table is usually much larger than the parent view, it is obvious that our algorithm should perform better than the one in reference [6]. FUNCTION refresh_deletion BEGIN FOR each tuple delta_t in deletion delta table DO BEGIN Let t = tuple in the view having the same values for its group-by attributes as delta_t; IF t.count=delta_t.count THEN Delete t from the view; ELSE t.count=t.count-delta_t.count; t.count(e)=t.count(e)- delta_t.count(e); IF t.count(e)=0 THEN t.sum==t.avg=t.max=t.min=NULL; ELSE t.sum=t.sum-delta_t.sum; t.avg=t.sum/t.count(e); IF t.max=delta_t.max OR t.min= delta_t.min THEN re-compute t from the parent view; END IF; END IF; END IF; END ;/*FOR*/ END
4
Experimental Results
In this section, we study the performance of our algorithm and demonstrate its effectiveness for data cube maintenance. We first introduce the test datasets. In the second and third experiments we maintain the data in both a data cube and its base table, while in the first experiment we only refresh one view in the data cube. There are six hierarchical dimensions and eight aggregations in the data cube. The aggregate functions include sum(), max(), min(), avg(), count(*),count(e), and so on. There are totally 1960 logical views and 86 materialized views. The number of rows in Top-View is approximately a half of that in the base table. In the first experiment, we compare our algorithm with the algorithm presented by Mumick in reference [6], which is considered one of the most efficient
A New Incremental Maintenance Algorithm of Data Cube
505
Table 1. Comparison with Mumick’s algorithm Rows Rows Mumick’s algorithm Our algorithm number of number of Count of re- Time for Count of re- Time for insertion data deletion data computations refresh(secs.) computations refresh(secs.) 3687 0 257 35 0 4 3657 717 209 34 31 12 3656 701 191 28 25 10 3635 712 317 44 33 10 3606 704 315 42 27 8 3651 3620 1720 197 851 133 3649 3613 1222 122 636 79 3661 3639 995 120 540 67 3619 3651 927 97 555 63 719 3650 578 69 553 67 740 3619 540 62 490 59 728 3608 491 59 449 53 717 3633 448 54 402 49 0 3608 391 42 391 37
algorithms[5]. Since two algorithms have a similar mechanism to maintain multi views, we compare the number of re-computing and the time consumed in refreshing a single materialized view. In the experiment, both Mumick’s algorithm and ours execute recomputation by visiting the parent view to make result analysis easer, while the ’pure’ Mumick’s algorithm use the base table. The results indicate that, in all instances, our algorithm is faster than Mumick’s, especially when insertion data are overwhelming in the refreshing data. It also imply that the time used to refresh the view is greatly effected by count of re-computation. Our algorithm performs better mainly because it can reduce the count of re-computing and needn’t judge if a re-computation is necessary or not. In the second experiment(Fig. 1), we maintained the whole data cube and its base table with fix-sized refreshing data. In the course of the experiment, the sizes of the data cube and base table have been increasing. In the third experiment(Fig. 2), we maintained them with different sized refreshing data. The results show that the time used to maintain the cube is linear and acceptable.
5
Conclusions
In this paper, we have presented a new incremental maintenance algorithm of data cubes, which uses two types of auxiliary delta tables. We present one refresh function for each type of delta table. The approach of two delta tables can avoid unnecessary re-computations and needn’t judge insertion data’s self-maintainability. Moreover, it visits its parent view instead of the base table in case of a view that has to be re-computed. These two points are the main contribution of this paper.
506
H. Li, H. Huang, and Y. Lin
time for maintenance(seconds)
6500
10000
5500 5000 4500 4000 3500 3000 0
time for maintenance(seconds)
12000
6000
0.5
1
1.5
2
2.5
3
the length of base table(rows)
3.5
4 6
x 10
8000
6000
4000
2000
0 0
0.5
1
1.5
2
2.5
changed base data(rows)
Fig. 1. Experiment with fix-sized refreshing data(about 123,000 rows changed in the base table every time)
3 5
x 10
Fig. 2. Experiment with fix-sized data cube
Experimental results on several changed data sets with varying characteristics show that our algorithm is faster than the algorithm in reference [6], especially when data for insertion are overwhelming in changed data (it is often the case that changes to a warehouse involve only insertion data [6]). Experimental results also indicate the time for maintaining the data cube is in proportion to the size of changed data and the size of the data cube itself.
References 1. S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Vijayaraman et al. [TMB96], pages 506–521. 2. L. Colby, T. Griffin, L. Libkin, I. Mumick, and H. Trickey. Algorithms for deferred view maintenance. In Proc. of SIGMOD’97, pages 469–80, 1997. 3. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In Proc. 12th ICDE, pages 152–159, New Orleans, March 1996. 4. A. Gupta, I. Mumick: Maintenance of Materialized Views: Problems, Techniques, and Applications. In: IEEE Data Engineering Bulletin, 18(2), 1995 5. W. Labio, J. Yang, Y. Cui, H. Garcia-Molina, and J.Widom, 1999. Performance issues in incremental warehouse maintenance. Technical report (Oct.), Stanford University Database Group. 6. I. Mumick; D. Quass; B. Mumick: Maintenance of Data Cubes and Summary Tables in a Warehouse. In: Proc. of the ACM Int. Conf. on Management of Data, 1997 7. J. Yang and J. Widom. Incremental computation and maintenance of temporal aggregates. In Proceedings of the 17th International Conference on Data Engineering, pages 51–60, Heidelberg, Germany, 2001.
Data Mining for Motifs in DNA Sequences D.A. Bell1 and J.W. Guan1'2 1
School of Computer Science
The Queen's University of Belfast BT7 1NN, Northern Ireland, U.K.
2
fda.bell,[email protected] College of Computer Science and Technology Jilin University 130012, Changchun, P.R.CHINA
Abstract. In the large collections of genomic information accumulated in recent years there is potentially signiÿcant knowledge for exploitation in medicine and in the pharmaceutical industry. One interesting approach to the distillation of such knowledge is to detect strings in DNA sequences which are very repetitive within a given sequence (eg for a particular patient) or across sequences (eg from diþerent patients who have been classiÿed in some way eg as sharing a particular medical diagnosis). Motifs are strings that occur relatively frequently. In this paper we present basic theory and algorithms for ÿnding such frequent and common strings. We are particularly interested in strings which are maximally frequent and, having discovered very frequent motifs we show how to mine association rules by an existing rough sets based technique. Further work and applications are in process.
Keywords: Rough Sets, Data Mining, Knowledge Discovery in Databases, Gene Expression, Bioinformatics
Introduction A potentially very important aspect of diagnoses and treatments for a patient is to take careful account of patterns in the DNA sequences in a group of genes. One way of contributing to the solutions of problems in this area is to gain clear insights into gene structure using
data mining,
the computer-based tech-
nique of discovering interesting, useful, and previously unknown patterns from massive databases (Frawley Piatetsky-Shapiro Matheus 1991) | such as those generated in gene expression. Large amounts of data are being accumulated in biological and genomic information systems. Making full use of this data to gain useful insights into, for example, health issues, presents a tremendous challenge and opportunity. Discovering the similarity between DNA sequences can lead to signiÿcant understanding in bioinformatics (Kiem Phuc 2000). As observed by Kiem-Phuc (Kiem-Phuc 2000), in a group of genes (DNA sequences) motifs can be considered as phrases of a collection of documents. So text mining techniques (eg Feldman
et al
1997, 1998, 1998a; Landau
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 507−514, 2003. Springer-Verlag Berlin Heidelberg 2003
et al
508
D.A. Bell and J.W. Guan
1998) can be used for ÿnding motifs and discovering knowledge about motifs, and ultimately about ailments and treatments. Mining sequential patterns is an attractive and interesting issue, there are various and extensive areas of exploration and application to which it is related. In this paper we develop theory and algorithms for discovering motifs in the special application area of DNA sequences systematically. Incidentally, some exsisting results for other applications can be extended to this area. For example, Srikant and Agrawal (1995-1996) have addressed and solved a general problem of ÿnding maximal sequences from large datasets; Feldman
et al
(1997-1998) have
investigated maximal association rules for mining for keyword co-occurrences in large document collection and proposed an integrated visual environment for text mining; etc. Here we extend a method which was developed by Keim and Phuc for speciÿc application to DNA sequences. The paper is organised as follows. Section 1 introduces the concept of DNA sequences, and the deÿnition of frequent motifs in the DNA sequences is presented in section 2. Much of this introduction follows the presentation of Kiem-Phuc (Kiem-Phuc 2000). We develop and reÿne their work in subsequent sections. Theory and algorithm for discovering motifs in the DNA sequences are proposed in section 3, and section 4 shows how associations between motifs in the DNA sequences can be discovered using rough sets techniques (Pawlak 1991, Bell Guan 1998, Guan Bell 1998).
1
DNA Sequences
The DNA sequences are formed by nucleotides sequence is a non-empty
string
u
=
= 1; 2; :::; N . As usual, we denote
n
DNA sequence
N
=
u.
A; C; G; T .
n :::SN
That is, each DNA
n = A; C; G; T for juj > 0 and say juj is the length of the
1 2
S S :::S
where
S
N.
Theorem 1.1. The number of DNA sequences of length N is 4
Consider two DNA sequences v
is a sub-string of
Given a DNA sequence
by
u; v .
We say
u
contains
v
and denote
u. u,
the set of non-empty sub-strings of
L(u), which is equal to fvjv þ u; jvj > 0g. 4 u ý 1). Theorem 1.2. jL(u)j þ (4 3
u
u
ÿ v if
is denoted
j j
Proof. According to theorem 1.1, we know that
jL(u)j þ 41 + 42 + 43 + ::: + 4 u
j j
2
j j
ÿ
U
U = fu1; u2 ; :::; uk ; :::; uK g; K = jUj. Denote ju1 j; ju2 j; :::; juk j; :::; juK j). Considering the union L(U ) = [K k=1 L(uk ) = [K k=1 fvjv þ uk ; jvj > 0g, we be a set of DNA sequences,
= max(
have the following.
U be a set of DNA sequences, U = fu1; u2; :::; uk ; :::; uK g; jUj, L = max(ju1 j; ju2 j; :::; juk j; :::; juK j).
Theorem 2.1. Let K
1 u +1 ý 4): 4 1 (4
Frequent Motifs
Let L
=
=
Data Mining for Motifs in DNA Sequences
509
jL(U )j ÿ 43 (4L þ 1).
Then
Proof. According to theorem 1.1, we know that
jL(U )j ÿ 41 + 42 + 43 + ::: + 4L = 43 (4L þ 1). The proof is completed. For a DNA sequence w 2 L(U ), the set of strings in U containing w is fu 2 Uju ý wg, denoted by w . So jw j ÿ jUj = K: Checking whether a u 2 U contains w or not needs L comparisons at most. So computing the set w = fu 2 Uju ý w g for a given w needs at most K L = jUjL U
U
U
comparisons.
Given a threshold
jw j=jUj ý ÿ .
ÿ >
0, a DNA sequence
U
motif if
The time complexity for computing O (K L).
A DNA sequence
w
w
2 L(U )
jw j=jUj ý ÿ U
is called a
for a given
w
is
ÿ -frequent
KL
+3=
2 L(U ) is a 1-frequent motif when jw j = jUj, i.e., w ÿ u U
2 U . A 1-frequent motif is also called a fully-frequent motif. ÿ -frequent motifs with length l is denoted by M (U ; ÿ ; l ) = fw j w 2 L(U ); jw j=jUj ý ÿ ; jwj = lg, where l ÿ L. For example, for ÿ = 1=jUj the set of 1=jUj-frequent motifs with length l is M (U ; 1=jUj; l ) = fw j w 2 L(U ); jw j ý 1; jw j = l g = fw j w 2 L(U ); in U there is at least one u such that u ý w; jw j = l g: For ÿ = 2=jUj, the set of 2=jUj-frequent motifs with length l is M (U ; 2=jUj; l ) = fw j w 2 L(U ); jw j ý 2; jw j = l g = fw j w 2 L(U ); in U there are at least two u such that u ý w; jw j = l g: for every
u
The set of U
U
U
For
= 1, the set of fully-frequent motifs with length
ÿ
l
is
U ; 1; l) = fw j w 2 L(U ); jw j = jUj; jwj = lg = fw j w 2 L(U ); w ÿ u for every u 2 U ; jw j = l g: Let M (U ; ÿ ) = [L l=1 M (U ; ÿ ; l). We say that a w 2 M (U ; ÿ ) is a maximal ÿ -frequent motif in M (U ; ÿ ) if there is no ÿ -frequent motif w 2 M (U ; ÿ ) other than w such that w ý w . Example 2.1. (Kiem Phuc 2000) Let U = fu1 ; u2 ; u3 ; u4 g, where U
M(
0
0
u1
=
AC GT AAAAGT C AC AC GT AGC C C C AC GT AC AGT ,
u2
=
C GC GT C GAAGT C GAC C GT AAAAGT C AC AC AGT ,
u3
=
GGT C GAT GC AC GT AAAAT C AGT C GC AC AC AGT ,
u4
=
AC GT AAAAGT AGC T AC C C GT AC GT C AC AC AGT .
ju1 j = ju2 j = ju3 j = ju4 j = 32 = L, and = fu1 ; u3 ; u4 g, C C GT AAAA = fu2 g, C AC AC AGT = fu2 ; u3 ; u4 g, C GT AAAA = fu1 ; u2 ; u3 ; u4 g, CCC = fu1 ; u4 g, GT C = fu1 ; u2 ; u3 ; u4 g, T CA = fu1 ; u2 ; u3 ; u4 g, C AC A = fu1 ; u2 ; u3 ; u4 g, AC AGT = fu1 ; u2 ; u3 ; u4 g, AAAA = fu1 ; u2 ; u3 ; u4 g. Then
K
= 4;
AC GT AAAA
U
U
U
U
U
U
U
U
U
U
510
D.A. Bell and J.W. Guan AC GT AAAA
is a maximal 3=4-frequent motif with length 8,
C C GT AAAA
is not a maximal 1=4-frequent motif with length 8
since
C C GT AAAA < u2 ,
C AC AC AGT C GT AAAA
is a maximal 3=4-frequent motif with length 8,
is a maximal 1-frequent motif with length 7,
CCC
is a maximal 1=2-frequent motif with length 3,
GT C
is a maximal 1-frequent motif with length 3,
T CA
is a maximal 1-frequent motif with length 3,
C AC A
is not a maximal 3=4-frequent motif since
AC AGT AAAA
3
C AC A < C AC AC AGT ,
is not a maximal 3=4-frequent motif since AC AGT
is not a maximal 1-frequent motif since
< C AC AC AGT .
AAAA < C GT AAAA.
Theory and Algorithm for Finding Motifs
The idea to ÿnd maximal motifs is the following: Algorithm Appending.
1. Find all singleton motifs
M (frequency-threshold,length)
for
A; C; G; T
from all DNA sequences. 2. For all motifs
M (frequency-threshold,l )
of length
l
keep adding in single-
tons | check occurrence/frequency of motifs of length if frequent enough add it to if it does not occur in
l
+1
M (frequency-threshold,l
M (frequency-threshold,l
+ 1)
+ 1) it is also maximal.
Now, let us propose the following theory which tells us how to ÿnd motifs and maximal
ÿ -frequent
motifs.
The ÿrst fact we need to show is the following: If
jwU j=jUj ÿ ÿ
2 M (U ; ÿ ; 1). This fact allows us to ÿnd M (U ; ÿ ; 1). then
w
ÿ -frequent
2 fA; C; G; T g
and
w
Example 3.1. Continue Example 2.1. Using this theorem we ÿnd that
M(
U ; 1=4; 1) = M (U ; 3=4; 1) = M (U ; 1=2; 1) = M (U ; 1; 1) = fA; C; G; T g. The second fact we need to show is the following: Let
w
=
uv
be the con-
jwj = juj + jvj; juj; jvj > 0. If w 2 M (U ; ÿ ; jw j) then u 2 M (U ; ÿ ; juj) and v 2 M (U ; ÿ ; jv j) . Conversely, if u 2 M (U ; ÿ ; juj) and v 2 M (U ; ÿ ; jv j), and jw U j=jUj ÿ ÿ w 2 M (U ; ÿ ; jw j).
catenation of two strings
u
and
v;
then
In fact, it can be shown that
U w 2 M (U ; ÿ ; jw j) then jw j=jUj ÿ ÿ . From w = uv ; u; v < w we þ wU and juU j; jvU j ÿ wU . Thus juU j=jUj; jvU j=jUj ÿ jwU j=jUj ÿ ÿ . So u 2 M (U ; ÿ ; juj) and v 2 M (U ; ÿ ; jv j) . (SuÆciency.) Conversely, if jw U j=jUj ÿ ÿ then w 2 M (U ; ÿ ; jw j). This fact allows us to ÿnd w 2 M (U ; ÿ ; jw j) from u 2 M (U ; ÿ ; juj) and v 2 U M (U ; ÿ ; jv j) with w = uv by checking jw j=jUj ÿ ÿ . (Necessity.) If
have
U ; vU
u
The third fact we need to show is the following which is a special case of the second one:
Data Mining for Motifs in DNA Sequences
511
Right Concatenation by a Nucleotide.) Let w = uv ; juj = jw j ÿ 1; jv j = 1. 2 M (U ; ÿ ; jwj) then u 2 M (U ; ÿ ; jwj ÿ 1) and v 2 M (U ; ÿ ; 1) . Conversely, if U u 2 M (U ; ÿ ; jw jÿ 1) and v 2 M (U ; ÿ ; 1), and jw j=jUj þ ÿ then w 2 M (U ; ÿ ; jw j). (2. Left Concatenation by a Nucleotide.) Let w = vu; juj = jw j ÿ 1; jv j = 1. If w 2 M (U ; ÿ ; jw j) then u 2 M (U ; ÿ ; jw j ÿ 1) and v 2 M (U ; ÿ ; 1) . Conversely, if U u 2 M (U ; ÿ ; jw jÿ 1) and v 2 M (U ; ÿ ; 1), and jw j=jUj þ ÿ then w 2 M (U ; ÿ ; jw j). This fact allows us to ÿnd w 2 M (U ; ÿ ; jw j) from u 2 M (U ; ÿ ; juj) and v 2 M (U ; ÿ ; jv j) with w = uv; juj = jw j ÿ 1 þ 1; jv j = 1 (Right Concatenation) or U w = vu; juj = jw jÿ 1 þ 1; jv j = 1 (Left Concatenation) by checking jw j=jUj þ ÿ . Example 3.2. Continue Example 3.1. From M (U ; 3=4; 1) = fA; C; G; T g we
(1.
If
w
ÿnd that
U ; 3=4; 2) = fAA; AC; AG; C A; C C fu1 ; u2 ; u4g; C G; GC; GT ; T A; T C g; M (U ; 3=4; 3) = fAAA; AAGfu1 ; u2 ; u4 g; AC A; AC Gfu1 ; u3 ; u4 g; AGT ; C AC; C AG; C GT ; GT A; GT C; T AA; T C Ag; M (U ; 3=4; 4) = fAAAA; AAAGfu1 ; u2 ; u4 g; AAGT fu1 ; u2 ; u4 g; AC AC; AC AG; AC GT fu1 ; u3 ; u4 g; AGT C fu1 ; u2 ; u3 g; C AC A; C AGT ; C GT A; GT AA; GT C Afu1 ; u2 ; u4 g; T AAA; T C AC fu1 ; u2 ; u4 gg; M (U ; 3=4; 5) = fAAAAGfu1 ; u2 ; u4 g; AAAGT fu1 ; u2 ; u4 g; AC AC Afu2 ; u3 ; u4 g; AC AGT ; AC GT Afu1 ; u3 ; u4 g; C AC AC; C AC AGfu2 ; u3 ; u4 g; C GT AA; GT AAA; GT C AC fu1 ; u2 ; u4 g; T AAAA; T C AC Afu1 ; u2 ; u4 gg; M (U ; 3=4; 6) = fAAAAGT fu1 ; u2 ; u4 g; AC AC AGfu2 ; u3 ; u4 g; AC GT AAfu1 ; u3 ; u4 g; C AC AC Afu2 ; u3 ; u4 g; C AC AGT fu2 ; u3 ; u4 g; C GT AAA; GT AAAA; GT C AC Afu1 ; u2 ; u4 g; T AAAAGfu1 ; u2 ; u4 g; T C AC AC fu1 ; u2 ; u4 gg; M (U ; 3=4; 7) = fAC AC AGT fu2 ; u3 ; u4 g; AC GT AAAfu1 ; u3 ; u4 g; C AC AC AGfu2 ; u3 ; u4 g; C GT AAAA; GT AAAAGfu1 ; u2 ; u4 g; GT C AC AC fu1 ; u2 ; u4 g; T AAAAGT fu1 ; u2 ; u4 gg; M (U ; 3=4; 8) = fAC GT AAAAfu1 ; u3 ; u4 g; C AC AC AGT fu2 ; u3 ; u4 g; C GT AAAAGfu1 ; u2 ; u4 g; GT AAAAGT fu1 ; u2 ; u4 gg; M (U ; 3=4; 9) = fC GT AAAAGT fu1 ; u2 ; u4 gg; M (U ; 3=4; 10) = fg. Theorem 3.1. Suppose that u 2 M (U ; ÿ ; juj). If j(uv )U j=jUj < ÿ and j(vu)U j=jUj < ÿ (both Right and Left Concatenations) for all v 2 M (U ; ÿ ; 1) then u 2 M (U ; ÿ ; juj) is maximal. Proof. Assume that u 2 M (U ; ÿ ; juj) is not maximal. Then there exists a w 2 M (U ; ÿ ; juj) such that w > u. Then by the necessity of the second fact we can assume that w = vu or w = uv such that w 2 M (U ; ÿ ; juj) and v 2 M (U ; ÿ ; jv j) for some v; jv j = 1. And this means that j(uv )U j=jUj þ ÿ or j(vu)U j=jUj þ ÿ (either Right or Left Concatenations) for some v 2 M (U ; ÿ ; 1). This is a contradiction. Thus, u 2 M (U ; ÿ ; juj) is maximal. M(
The proof is completed. This theorem allows us to ÿnd a maximal
j(uv)U j=jUj < ÿ and j(vu)U j=jUj < ÿ all v 2 M (U ; ÿ ; jv j) such that jv j = 1.
(both
u
2
U ; ÿ ; juj)
M(
by checking
Right and Left Concatenations)
for
512
D.A. Bell and J.W. Guan
Continuing Example 3.2, we have MaxM (U ; 3=4) = fC C fu1 ; u2 ; u4 g; GC; AGT C fu1 ; u2 ; u3 g; Example 3.3.
GT C AC AC
fu1 ; u2; u4 g; AC GT AAAAfu1 ; u3 ; u4 g; fu2 ; u3; u4 g; C GT AAAAGT fu1; u2; u4 gg:
C AC AC AGT
4
Discovering Associations between Motifs
When we have identiÿed motifs using the simple \appending" algorithm, the question arises: can we ÿnd associations between motifs? First of all, we should point out that fully-frequent motifs for ÿ = 1 mean there is only one class present, and so there is no classiÿcation on the whole set of DNA sequences U . So in this case we do not need to use rough set theory to mine for association rules since the theory is based on classiÿcation. We have a theorem for fully-frequent motifs immediately. Theorem 4.1. All fully-frequent motifs co-occur in every DNA sequence. Proof. For ÿ = 1, the set of fully-frequent motifs is M (U ; 1) = fw j w 2 L(U ); w ÿ u for every u 2 Ug: The proof is completed. Example 4.1. From example 2.1, it can be found that 7 M (U ; 1) = [i=1 M (U ; 1; i) = fA; C; G; T g
[fAA; AC; AG; C A; C G; GT ; T A; T C g [fAAA; AC A; AGT ; C AC; C AG; C GT ; GT A; GT C; T AA; T C Ag [fAAAA; AC AC; AC AG; C AC A; C AGT ; C GT A; GT AA; T AAAg [fAC AGT ; C AC AC; C GT AA; GT AAA; T AAAAg [fC GT AAA; GT AAAAg [fC GT AAAAg Theorem 4.1 asserts that all fully-frequent motifs in this set M (U ; 1) co-occur in every DNA sequence in set U = fu1 ; u2 ; u3 ; u4 g.
On the other hand, for ÿ < 1 we can apply rough set theory to discover knowledge eþectively and eÆciently. For example, focusing on the following 3 maximal frequent motifs with ÿ < 1 in example 2.1: a 3=4-frequent motif AC GT AAAA, a 1=2-frequent motif C C C , and a 3=4-frequent motif C AC AC AGT , we obtain an information table as follows. An Information Table for Motifs
UnM j AC GT AAAA C C C C AC AC AGT u1 j 1 1 0 u2 j 0 0 1 u3 j 1 0 1 u4 j 1 1 1 Following the knowledge discovery method presented in (Bell Guan 1998, Guan Bell 1998), we ÿnd that (using notation U =s for string s to mean the equivalence classes for s) U =AC GT AAAA = fX0 ; X1 g; X0 = fu2g; X1 = fu1 ; u3; u4 g;
Data Mining for Motifs in DNA Sequences
513
U =C C C = fY0 ; Y1 g; Y0 = fu2; u3 g; Y1 = fu1; u4 g; U =C AC AC AGT = fZ0; Z1 g; Z0 = fu1g; Z1 = fu2; u3; u4 g: Furthermore, it can be found that X0 \ Y0 = fu2 g; X0 \ Y1 = fg; X1 \ Y0 = fu3g; X1 \ Y1 = fu1; u4 g.
Notice that the lower approximation from AC GT AAAA and C C C to Z1 in rough set theory is [X \Y;X \Y ÿZ2 X \ Y = [X \Y =X0 \Y0 ;X0 \Y1 ;X1 \Y0 ;X1 \Y1 ;X \Y ÿfu2 ;u3 ;u4 g X \ Y = (X0 \ Y0 ) [ (X0 \ Y1 ) [ (X1 \ Y0 ) = fu2 ; u3 g: We call this the support set to C AC AC AGT on Z1 from AC GT AAAA and C C C , and denote it by SAC GT AAAA^C C C (Z1 ). Notice that for u2 ; u3 we have AC GT AAAA(u2 ) = C C C (u2 ) = 0; AC GT AAAA(u3 ) = 1; C C C (u3 ) = 0. The support degree to C AC AC AGT on Z1 from AC GT AAAA and C C C is deÿned and denoted by sptAC GT AAAA^C C C (Z1 ) = jSACGT AAAA^C C C (Z1 )j=jZ1 j = jfu2 ; u3 gj=jfu2 ; u3 ; u4 gj = 2=3: It is also called the rule strength. Thus, we ÿnd Association Rules as follows: If
5
CCC
absents then
C AC AC AGT
presents with strength 2=3.
Summary and Future Work
We have presented algorithms based on theorems developed here to ÿnd maximal frequent motifs in DNA strings. Associations between these motifs are then found by applying a data mining technique based on rough set analysis. Further work and applications to discover knowledge about motifs in DNA sequences is currently in process.
References Bell, D.A.; Guan, J. W. (1998). \ Computational methods for rough classiÿcation and discovery ", Journal of the American Society for Information Science, Special Topic Issue on Data Mining, Vol.49(1998), No.5, 403-414. Feldman, R.; Aumann, Y.; Amir, A.; Zilberstain, A.; Kloesgen, W. Ben-Yehuda, Y. 1997, Maximal association rules: a new tool for mining for keyword cooccurrences in document collection, in Proceedings of the 3rd International Conference on Knowledge Discovery (KDD 1997), 167-170. Feldman, R.; Aumann, Y.; Zilberstain, A.; Ben-Yehuda, Y. 1998, Trend graphs: visualizing the evolution of concept relationships in large document collection, in Proceedings of the 2nd European Symposium on Knowledge Discovery in Databases, PKDD'98, Nantes, France, 23-26 September 1998; Lecture Notes in Artiÿcial Intelligence 1510: Principles of Data Mining and Knowledge Discov-
, Jan M. Zytkow Mohamed Quafafou eds.; Springer, 38-46.
ery
514
D.A. Bell and J.W. Guan
Feldman, R.; Fresko, M.; Kinar, Y.; Lindell, Y.; Liphstat, O.; Rajman, M.; Schler, Y.; Zamir, O. 1998, Text mining at the term level, in Proceedings of the 2nd European Symposium on Knowledge Discovery in Databases, PKDD'98, Nantes, France, 23-26 September 1998; Lecture Notes in Artiÿcial Intelligence
, Jan M. Zytkow Mo-
1510: Principles of Data Mining and Knowledge Discovery
hamed Quafafou eds.; Springer, 65-73.
Frawley, W.J., Piatetsky-Shapiro, G., & Matheus, C.J. (1991). Knowledge discovery in databases: an overview. In G. Piatetsky-Shapiro, W.J. Frawley (eds). Knowledge Discovery in Databases (pp. 1-27). AAAI/MIT Press. Guan, J. W. ; Bell, D. A. (1998), \ Rough computational methods for information systems ", Artiÿcial Intelligence | An International Journal, Vol.105(1998), 77104. Kiem, H.; Phuc, D. 2000, \ Discovering motif based association rules in a set of DNA sequences ", in W. Ziarko & Y. Yao (ed.) Proceedings of the Second International Conference on Rough Sets and Current Trends in Computing
, Banÿ, Canada, October 16-19, 2000; 348-352. ISBN 0828-3494, ISBN 0-7731-0413-5 (RSCTC'2000)
Landau, D.; Feldman, R.; Aumann, Y.; Fresko, M.; Lindell, Y.; Liphstat, O.; Zamir, O. 1998, TextVis: an integrated visual environment for text mining, in Proceedings of the 2nd European Symposium on Knowledge Discovery in Databases, PKDD'98, Nantes, France, 23-26 September 1998; Lecture Notes in Artiÿcial
, Jan M.
Intelligence 1510: Principles of Data Mining and Knowledge Discovery
Zytkow Mohamed Quafafou eds.; Springer, 56-64.
Pawlak, Z. (1991). Rough sets: theoretical aspects of reasoning about data. Kluwer. Srikant, R.; Agrawal, R. 1995-1996, Mining sequential patterns: generalizations and performance improvements, in Proceedings of the Fifth International Conference on Extending Database Technology (EDBT), Avignon, France, March 1996; IBM Research Report RJ 9994, December 1995 (expanded version).
Maximum Item First Pattern Growth for Mining Frequent Patterns Hongjian Fan1 , Ming Fan2 , and Bingzheng Wang2 1
Department of CSSE, The University of Melbourne, Parkville, Vic 3052, Australia {hfan}@cs.mu.oz.au 2 Department of Computer Science, Zhengzhou University, Zhengzhou, China {mfan,wbz}@zzu.edu.cn
Abstract. Frequent pattern mining plays an essential role in many important data mining tasks. FP-Growth, an algorithm which mines frequent patterns in the frequent pattern tree (FP-tree), is very efficient. However, it still encounters performance bottlenecks when creating conditional FP-trees recursively during the mining process. In this work, we propose a new algorithm, called Maximum-Item-First Pattern Growth (MIFPG), for mining frequent patterns. MIFPG searches the FP-tree in the depth-first, top-down manner, as opposed to the bottom-up order of FP-Growth. Its key idea is that maximum items are always considered first when the current pattern grows. In this way, no concrete realization of conditional pattern bases is needed and the major operations of mining are counting and link adjusting, which are usually inexpensive. Experiments show that, in comparison with FP-Growth, our algorithm is about three times faster and consumes less memory space; it also has good time and space scalability with the number of transactions. Keywords: data mining, frequent patterns, FP-tree, association rules
1
Introduction
Frequent pattern mining[1,10,11] plays an essential role in mining association rules[3], correlations[6], sequential patterns[4], max-patterns[5], partial periodicity[9], classification[8], emerging patterns[7,8] and many other important data mining tasks. Frequent patterns have the so-called anti-monotone Apriori property: if a pattern with k items is not frequent, any of its superpatterns with (k +1) or more items can never be frequent. The recently proposed projection-based pattern growth approach such as FP-Growth[10], is about an order of magnitude faster than Apriori[3], which represents the level-wise, candidate generation-test approach. However, FP-Growth suffers from constructing conditional FP-trees recursively. FP-Growth needs to create a conditional FPtree whenever the current pattern grows. So, even for a small database, millions of conditional FP-trees will be created when the support threshold is very low or there exist long frequent patterns. It costs CPU time and memory space to physically construct a large number of conditional FP-trees dynamically. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 515–523, 2003. c Springer-Verlag Berlin Heidelberg 2003
516
H. Fan, M. Fan, and B. Wang
In this paper, we propose a new algorithm, Maximum-Item-First Pattern Growth (MIFPG), for mining frequent patterns in the FP-tree. We adopt the FP-tree as the basis because it is a compact data structure and it contains the complete information in relevance to frequent pattern mining. MIFPG belongs to the category of a pattern fragment growth methodology, which avoids expensive candidate generation and test. It absorbs the advantages of pattern growth while uses a mining strategy very different from FP-Growth. We use the concept of a prefix sub-forest (which is the set of all prefix subtrees sharing the same prefix) to perform mining recursively with such a forest. The key idea of MIFPG is that the maximum item is considered first whenever the current prefix pattern is to grow. We can easily obtain prefix sub-forests by dynamically adjusting links and counts in the FP-tree. The main advantage is that we do not need to physically construct separate memory structures of nodes and links, which saves a lot of efforts on managing space. Based on Apriori property, a “skipping” optimization strategy is proposed to reduce search space as soon as possible. Many items do not need to be considered (even no node-links need to be adjusted for them) as the prefix pattern grows, because concatenation of the prefix with those items will definitely not produce frequent patterns. Up to half of mining time is saved by using the optimization alone. A performance study has been conducted to compare MIFPG with FP-Growth. Our study shows that MIFPG is about three times faster and consumes less memory space. Moreover, it has good time and space scalability with the number of transactions. The remaining of the paper is organized as follows. In Section 2, the FPtree structure and the concept of prefix sub-forests are introduced. In Section 3, we present the algorithm, Maximum-Item-First Pattern Growth (MIFPG), to mine frequent patterns directly from the FP-tree. Its performance is studied and compared with FP-Growth in Section 4. We conclude our work in section 5.
2
Background and Terminology
Let I={x1 , x2 · · · xm } be a set of items and T DB={T1 , T2 , · · · , Tn } be a transaction database, where Ti (i ∈ [1 · · · n]) is a transaction which contains a set of items in I. A set of items is called an itemset or a pattern. A transaction T is said to contain itemset X if and only if X ⊆ T . The number of transactions in T DB that contain X is called the support count of X, denoted as count(X), and the support of X is denoted as support(X), which equals to count(X)/|T DB|, where |T DB| is the total number of transactions in T DB. Given a transaction database T DB and a support count threshold ξ, an itemset X is a frequent pattern if and only if count(X) ≥ ξ. The problem of frequent pattern mining is to find the complete set of frequent patterns in a given transaction database with respect to the given support count threshold ξ. Without losing generality, we assume that there is a partial order on I, denoted as ≺. There are many ways to define ≺. For example, we can sort items according to their support counts in ascending or descending order; or simply use lexicographic order. We modify the FP-tree from [10] a little to suit our needs.
Maximum Item First Pattern Growth for Mining Frequent Patterns
517
The differences are: (1)Our FP-tree is only traversable top-down (from root to leaves), i.e., there is no pointer from child to parent nodes; (2)it is ordered: if a node M is the parent of a node N , and item i and j appear in M and N respectively, then i ≺ j. Consider the following transaction database: T DB={{c, e}, {a, b, c, e}, {b, c, d}, {b, c}, {a, b}, {b, f }, {b, e}, {a, c, e}, {a, c, d}}. Suppose the minimum support count threshold is 2 and ≺ is defined as lexicographic order. f will be removed when creating FP-tree because its support count is below the minimal count. The FP-tree1 based on this T DB is shown in Figure 1(a). Please note that nodes with the same item names are linked in sequence via node-links (the dotted lines in Figure 1), which is to facilitate tree traversal. 2.1
Prefix Sub-forests
A subtree whose root is N includes the node N and its entire offspring. Such a subtree is simply denoted as N in this paper. Definition 1. Let i1 ≺ i2 ≺ · · · ≺ ik (k ≥ 1) be items. {i1 , i2 , · · · , ik } is a prefix of subtree N if each ij (1 ≤ j ≤ k − 1) appears in the path from the root to N and ik appears in N . N is a prefix subtree with prefix {i1 , i2 , · · · , ik } , if {i1 , i2 , · · · , ik } is its prefix. Note that a subtree may have many prefixes. In fact, if {i1 , i2 , · · · , ik } is a prefix of a subtree N , each subset of {i1 , i2 , · · · , ik } is a prefix of subtree N as long as ik is in the subset. Specially, {ik } is a prefix of subtree N . Among prefix subtrees, we are interested in those that have the same prefix. Definition 2. The set of all prefix subtrees with the same prefix {i1 , i2 , · · · , ik }, denoted as P SF (i1 , i2 , · · · , ik ), is called a prefix sub-forest with prefix {i1 , i2 , · · · , ik }.
Header table Item count head of node-link a 4 b 6 c 6 d 2 b 2 e 4 c1
root a4 c 2 d 1
b4 c 2 e 1
c1 e1 d1
e1
e 1
Header table Item count head of node-link a 4 b 6 c 3 d 1 b 2 e 2 c1
root a4 c 2 d 1
b4 c 2 e 1
e1
(a) Original FP-tree
(b) P SF (b)
c1 e1 d1
e 1
Header table Item count head of node-link a 4 b 6 c 3 d 1 b 2 e 1 c1
root a4 c 2 d 1
b4 c 2 e 1
c1 e1
e 1
d1
e1
(c) P SF (bc)
Fig. 1. Original FP-tree and Building PSFs recursively
Given an item i, the header table is very useful for building P SF (i). P SF (i) can be built directly from the FP-tree by travelling i’s node-link, whose head is 1
The definition and construction of the FP-tree can be found in [10].
518
H. Fan, M. Fan, and B. Wang
head[i]. Those prefix subtrees, whose root appears in the node-link, become members of P SF (i). However, P SF (i1 , i2 , · · · , ik )(k ≥ 2) can not be built based on the FP-tree in the similar way. A prefix subtree with prefix {i1 , i2 , · · · , ik−1 , ik } is a subtree of a tree in P SF (i1 , i2 , · · · , ik−1 ). P SF (i1 , i2 , · · · , ik−1 , ik ) can be built from P SF (i1 , i2 , · · · , ik−1 ). P SF (b), which is shown by Figure 1(b), can be built from the FP-tree by travelling the node-link pointed by head[b]. Please note that the counts and node-links of c, d, e have changed, because some do not appear together with b (“under” b). P SF (bc), which is shown by Figure 1(c), can not be built in the same way from the original FP-tree. But it can be built by travelling the nodelink pointed by P SF (b).head[c]. Note that the node-link of c is local to b and only those d, e which appear together with bc are counted and linked.
3
Maximum Item First Pattern Growth (MIFPG)
3.1
Overview
We show the ideas of mining by using the trees shown in Figure 1. Let ξ = 2 is a minimum support count threshold. Given the set enumeration tree as shown in Figure 2, we can explore the itemset space in the specific order, where we visit the node, then visit the right and left subtree. Namely, the itemsets are considered in the following order: – – – – –
{e} {d}, {d, e} {c}, {c, e}, {c, d}, {c, d, e} {b}, {b, e}, {b, d}, {b, d, e}, {b, c}, {b, c, e}, {b, c, d}, {b, c, d, e} {a}, {a, e}, {a, d}, {a, d, e}, {a, c}, {a, c, e}, {a, c, d}, {a, c, d, e}, {a, b}, {a, b, e}, {a, b, d}, {a, b, d, e}, {a, b, c}, {a, b, c, e}, {a, b, c, d}, {a, b, c, d, e}
{} b
a
a c
a b
a b d
a b c a b c d
a b c e
a b d e
a b e
a c d a c d e
a d
a c e
a d e
a e
c
b c b c d
b d b c e
b d e
b e
c d
d c e
e
d e
c d e
b c d e
a b c d e
Fig. 2. A complete set enumeration tree over I, with items lexically ordered
For e, we get a length-1 frequent pattern (e : 4).2 For d, P SF (d) is built. Since e’s count under d is 0, (d, e : 0) is not frequent. For c, P SF (c) is built, and counts of d, e under c are 2 and 3, respectively. (c, e : 3) and (c, d : 2) are frequent patterns. (c, d : 2) needs to be further grown 2
Since all items in the FP-tree are frequent, we will ignore length-1 frequent patterns in the following discuss.
Maximum Item First Pattern Growth for Mining Frequent Patterns
519
by adding e. So P SF (cd) is built, and e’s count under cd is 0. (c, d, e) is not frequent. For b, P SF (b) is built, and counts of c, d, e under b are 3, 1 and 2, respectively. (b, e : 2) and (b, c : 3) are frequent patterns. Since (b, d : 1) is not frequent, there is no further growth of it. But (b, c : 3) needs to grow. P SF (bc) is built, and counts of d, e under bc are 1 and 1, respectively. No more frequent patterns are output. For a, P SF (a) is built, and counts of b, c, d, e under a are 2, 3, 1 and 2, respectively. (a, e : 2), (a, c : 3) and (a, b : 2) are frequent patterns. Since (a, d : 1) is not frequent, we do not grow it. We grow (a, c : 3) first. P SF (ac) is built, and counts of d, e under ac are 1 and 2. (a, c, e : 2) is frequent while (a, c, d : 1) is not. We then grow (a, b : 2). P SF (ab) is built, and counts of c, d, e under ab are 1, 0 and 1. No more frequent patterns are found. From the above discuss, we can see that if we always consider maximum item first when the prefix pattern grows, we do not need to create a copy of subtrees/nodes; we can update the node-link information of these nodes “in place” and accumulate counts for each item at the same time. For example, P SF (e), which are all nodes of e in the FP-tree, are considered first. Then we build P SF (d). Now only those nodes of e “appearing” under d are linked by its node-link and the count of e is accumulated as a by-product. Since P SF (e) has been checked, it is safe to break the global node-link of e and construct a local one under d. However, if P SF (d) are built before P SF (e), the global node-link of e is destroyed when building P SF (d), so it has to be reconstructed when to build P SF (e). To do the reconstruction, we have to search the whole FP-tree for all occurrences of e. It wastes a lot of CPU time because such reconstructions of node-link have to be done recursively. One may want to keep the global node-link information of e in separate memory instead of reconstruction. The problem is that the kept information grows very fast in size as the mining goes recursively. 3.2
The “Skipping” Optimization Strategy
Maximum item first pattern growth ensures all operations are taken directly in the FP-tree. Seen from set enumeration tree, if ac is not frequent, then its branch (namely, acd, ace, acde) can be pruned based on Apriori property. We can also see that, if ae is not frequent, when we search the ab branch, abce, abcde should not be reached. How to achieve this? A complicated solution is not desired, because it may be less costly to visit those itemsets than to do a lot of computation to determine which itemsets should not be reached. We provide a simple solution, called the “skipping” Optimization Strategy. If we check P SF (a) and find that e are not frequent under a, we will regard d (not e, which is the global maximum item) as the local maximum item in the branch of a. So for a, only b, c, d (e is excluded) are considered. If we check P SF (ab) and find that d, e are not frequent under ab, c is the local maximum item in the branch of ab. A transaction database usually contains thousands of items and much search space is pruned by “skipping” many items. Suppose there are [1 ... 1000] items. When building P SF (1), it may be that those items from 900 to 1000 are not frequent under
520
H. Fan, M. Fan, and B. Wang
1, so the local frequent maximum becomes 899. When {1} grows by adding 2, if those items from 700 to 899 are not frequent under {1,2}, the new local frequent maximum becomes 699. What is more, those supersets of {1,2} which contains items larger than the local maximum, are not enumerated at all, and their corresponding nodes in FP-tree are not counted and linked. In this way, many irrelevant items are skipped and a lot of unnecessary nodes counting and linking are avoided. This achieved considerable performance gains(see section 4). 3.3
The Algorithm MIFPG
Suppose all items are mapped into a set of consecutive positive integers [0 ... max] and ξ is the minimum support count threshold. Mining frequent patterns can be done by calling MIFPGrowth(φ,max). α and β are prefix (itemsets). The variable last is the local frequent maximum item. Procedure MIFPGrowth(α, last) { (1) if (count[last] ≥ ξ) then output α ∪ last with count[last]; (2) Let k be the maximum item in α; (3) for i = last -1 downto k+1 do { (4) if (count[i] ≥ ξ) { (5) Output β = α ∪ i with count[i]; (6) BuildPSF(β); (7) Let newlast be the local frequent maximum in P SF (β); (8) MIFPGrowth(β, newlast); }}} Procedure BuildPSF(β) { (1) find local frequent maximum item in P SF (β); (2) Adjust node-links and accumulate counts of all nodes in P SF (β), except those nodes with larger items than local maximum; }
4
Performance Study
In this section, we present a performance comparison of MIFPG with FPGrowth, one of the most efficient published algorithms for mining frequent patterns; we also give the comparison of them in two aspects: time and space scalability with number of transactions. MIFPG is written in Microsoft Visual C++6.0. The experiments are performed on a 450 MHz Pentium PC machine with 128 megabytes main memory, running on Microsoft Windows 98. The experimental environment is almost the same as [10]. And the same data sets are used in our experiments, T25:I10:D10K (D1) and T25:I20:D100K (D2), which are synthetic data generated using the procedure described in [5]. Before we do performance comparison with FP-Growth, we perform experiments to understand more of MIFPG’s behaviour: the effect of the “skipping” optimization strategy and different types of tree orderings. To investigate the effect of the “skipping” optimization strategy, we conduct experiments on data
Maximum Item First Pattern Growth for Mining Frequent Patterns 50
80 70
With "skipping" Without "skipping"
60
Run time (sec.)
Run time (sec.)
521
50 40 30 20
Descending order Ascending order Lexicographic order
40 30 20 10
10
0
0 0
0.1
0.2
0.4
0.5
0.6
0.8
0
1
0.1
0.2
0.4
0.5
0.6
0.8
1
2
3
Minimum support (%)
Minimum support (%)
Fig. 3. The behaviour of MIFPG
set D2 using two versions of MIFPG. One version uses the “skipping” optimization strategy, the other does not. Figure 3(left) shows that using the “skipping” optimization strategy can save more than half of mining time, especially when support threshold is very low. As mentioned before, the partial order on I (the set of all items) can be support-counts-ascending order, support-counts-descending order, or simply lexicographic order. To see the effect of different orderings on MIFPG’s performance, we run it on data set D2, using these three orders respectively. In all the experiments, the “skipping” optimization strategy is used, and D2 is scanned twice. Figure 3(right) shows that three orders give almost the same performance, where lexicographic order is a bit worse. 50
D1 MIFPG D1 FP-Growth
30
Run time (sec.)
Run time (sec.)
120 40
20 10
D2 MIFPG D2 FP-Growth
90 60 30 0
0 0
0.1
0.2
0.3
0.5
0.8
1
Minimum support (%)
1.5
2
3
0
0.1
0.2
0.3
0.5
0.8
1
1.5
2
3
Minimum support (%)
Fig. 4. Scalability with support threshold
After studying the behaviour of our algorithm, we can see that MIFPG reach top performance when it uses the “skipping” optimization strategy and supportcounts-descending order. Now we compare it with FP-Growth. The scalability of MIFPG and FP-Growth as the support threshold goes down from 3% to 0.1% is shown in Figure 4, where the run time used here means the total execution time, i.e., the period between input and output, and the data of run time of FP-Growth algorithm is taken from [11]. MIFPG scales better than FP-Growth. It is because as the support threshold goes down, the number and the length of frequent patterns increase dramatically. The number of conditional FP-trees that FP-Growth must build becomes extremely large. To construct a conditional FP-tree, it needs to build new nodes, copy the information from old nodes to new ones, build links to form a tree structure, and create node-links. It needs not only a lot of memory, but also a lot of CPU time. On the other hand, to build a PSF, only node links need to be adjusted. MIFPG is about three times faster than FP-Growth when support threshold is 0.1%.
522
H. Fan, M. Fan, and B. Wang 30
30 25
D2 MIFPG D2 FP-Growth
20
Memory (MB)
Run time (sec.)
25
15 10 5
D2 MIFPG D2 FP-Growth
20 15 10 5
0
0
0
10
20
30
40
50
60
70
Number of transactions (K)
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Number of transactions (K)
Fig. 5. Scalability with number of transactions
To test the scalability with the number of transactions, experiments on data set D2 are used. The support threshold is set to 1%. As shown in Figure (left), both of MIFPG and FP-Growth have linear scalability with the number of transactions from 10K to 100K, but the run time of MIFPG grows slower than FPGrowth as the number of transaction goes up. The run time of MIFPG increases about six times when the number of transactions increases ten times, while the run time of FP-Growth increases about ten times in the same condition. Figure (right) shows that the memory consumed by MIFPG is always less than that consumed by FP-Growth. This is because (1) although the FP-tree structure used by both of algorithms contains the same number of nodes for a given data set and a support threshold, MIFPG is only traversable top-down, so one pointer is saved in each node, and (2) FP-Growth needs additional memory to build conditional FP-trees when it mines frequent patterns, while MIFPG mines frequent patterns directly in FP-tree structure without using any extra memory. Compared with FP-Growth, about one fourth of memory is saved by MIFPG.
5
Conclusions
We have proposed an efficient algorithm, MIFPG, to mine frequent patterns directly in the FP-tree without generating conditional FP-trees. Our performance study shows that MIFPG is about tree times faster than FP-Growth, and saves about one fourth of memory. It also shows that MIFPG is more scalable than FP-Growth as the number of transaction goes up. Acknowledgement. This joint work is supported in part by the Natural Science Foundation of Henan Province of China under Grant No. 0111060700.
References 1. Ramesh C. Agarwa, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3), 2001. 2. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACMSIGMOD’98, pages 94–105, Seattle, WA, USA, June 1998. 3. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. VLDB’94, pages 487–499, Santiago, Chile, Sept 1994.
Maximum Item First Pattern Growth for Mining Frequent Patterns
523
4. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. ICDE’95, Taipei. 5. R.J. Bayardo. Efficiently mining long patterns from databases. In Proc. ACMSIGMOD’98, pages 85–93, Seattle, WA, USA, June 1998. 6. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. Data Mining and Knowledge Discovery, 2, 1998. 7. G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. KDD’99, pages 43–52, San Diego, CA, USA, Aug 1999. 8. H. Fan and K. Ramamohanarao. An efficient single-scan algorithm for mining essential jumping emerging patterns for classification. In Proc. PAKDD’02, Taipei. 9. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. ICDE’99, pages 106–115, Sydney, Australia, 1999. 10. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. SIGMOD’00), pages 1–12, Dallas, TX, USA, May 2000. 11. J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-mine: Hyper-structure mining of frequent patterns in large databases. In Proc. ICDM’01, San Jose, CA.
Extended Random Sets for Knowledge Discovery in Information Systems Yuefeng Li School of Software Engineering and Data Communications Queensland University of Technology, Brisbane, OLD 4001, AUSTRALIA [email protected]
Abstract. In this paper, we discuss the problem of knowledge discovery in information systems. A model is presented for users to obtain either “objective” interesting rules or “subjective” judgments of meaningful descriptions based on their needs. Extended random sets are presented firstly to describe the relationships between condition granules and decision granules. The interpretation is then given to show what we can obtain from the extended random sets.
1 Introduction Data mining, which is also referred to as knowledge discovery in database is a process of nontrivial extraction of implicit, previously unknown and potentially useful information (patterns) from data in databases [Chen et al., 1996] [Fayyad et al., 1996]. To discover the potential useful knowledge, several typical approached have been presented. They are data classification [Yao et al., 1997] [Yao and Yao, 2002], data clustering, and mining association rules. The objective of mining association rules is to discover all rules that have support and confidence greater than the user-specified minimum support and minimum confidence [Agrawal et al., 1993]. The form of a rule is “A1 ∧ A2 ∧…∧Am ⇒ B1 ∧ B2 ∧…∧Bm ”, where Ai and Bj are sets of attributes values from the relevant data base sets in a database. The meaning of association rules can be explained by probability theory such that A ⇒ B is an interesting rule iff P(B|A) – P(B) is greater than a suitable constant. The frequency of occurrence is a well-accepted criterion for mining association rules. Except the frequency, the rules should reflect real world phenomena, that is, data mining is to find real world patterns in the portion of the real world that the data is presented about [Lin, 2002a]. In rough set theory, patterns in data can be characterized by means of approximations, or equivalently by decision rules induced by the data [Pawlak, 2002]. The measures of uncertainty that are combined in knowledge-based systems may come from various sources [Walley, 1996]. Some may be “objective” measures, based G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 524–532, 2003. © Springer-Verlag Berlin Heidelberg 2003
Extended Random Sets for Knowledge Discovery in Information Systems
525
on relative frequencies or on well established statistical models. The domain expert, based on “subjective” judgements may supply other assessments of uncertainty. In this paper, we discuss the problem of knowledge discovery in information systems. We present a model for users to obtain either “objective” interesting rules or “subjective” judgments of meaningful descriptions based on their needs. For this motivation, we present extended random sets to describe the relationships between condition granules (premises) and decision granules (conclusion). An extended random set can be interpreted as a probability function. By using the probability function, we can easily determine the interesting rules. We also notice that some descriptions on the set of decision granules are meaningful. We present the concept of the meaningful descriptions, and use another interpretation (belief functions) of extended random sets to measure uncertainties of the meaningful descriptions as well. The remainder of the paper is structured as follows. We begin by introducing some basic definitions for rough sets and information systems in Section 2. Working from these definitions, in Section 3 we discuss the relationship between condition granules and decision granules. The extended random sets are defined in this section. We show how to get the interesting rules from an extended random set in Section 4. In section 5 we present the concept of meaningful descriptions on the set of decision granules. We interpret an extended random set as a belief function. This interpretation can be used to get “subjective” judgments for the meaningful descriptions. Section 6 closes this paper and gives some outlook to further work.
2 Basic Definitions My presentation of information systems in this section is based largely on the discussion presented in [Pawlak, 2002] and [Yao, 2001], which represent the rough set view of data reasoning from data. Let U be a universe (a non-empty finite set of objects), and A be a set of attributes. We call a pair S = (U, A) an information system if there is a function for every attribute a∈A such that a: U → Va, where Va is the set of all values of a. We call Va the domain of a. Let B be a subset of A. B determines a binary relation I(B) on U such that (x, y) ∈ I(B) if and only if a(x) = a(y) for every a∈B, where a(x) denotes the value of attribute a for element x∈U. it is easy to prove that I(B) is an equivalent relation, and the family of all equivalent classes of I(B), that is a partition determined by B, is denoted by U/I(B) or simply by U/B. The classes in U/B are referred to B-granules or Belementary sets. The class which contains x is called B-granule induced by x, and is denoted by B(x). To meet user information needs, the attributes in an information system can be divided into two groups: condition attributes and decision attributes, respectively. The information system is now represented by a triple (U, C, D), where C, condition attributes, and D, decision attributes, are disjoint sets of A, and C∪D=A. We call the
526
Y. Li
tripe (U, C, D) a decision table of the information table S = (U, A), and C(x) and D(x) the condition granule and the decision granule induced by x, respectively. Table 1 shows an example of decision tables. Originally, the similar decision table was used in [Pawlak, 2002], in which 6 facts were mentioned. In order to clearly show my motivation in this paper, I add a new fact, the last row in Table 1. Now, this table includes 7 facts that concern 1,000 cases of driving a car in various driving conditions, where N denotes the number of analogous cases. In this decision table, we assume that the columns labelled weather and road are condition attributes, and the column labelled by time and accident are decision attributes. Table 1. An example of decision tables Fact No.
Weather
Driving condition Road
Time
Consequence Accident
N
1
Misty
Icy
Day
Yes
80
2
Foggy
Icy
Night
Yes
140
3
Misty
Not icy
Night
Yes
40
4
Sunny
Icy
Day
No
500
5
Foggy
Icy
Night
No
20
6
Misty
Not icy
Night
No
200
7
Misty
Icy
Night
Yes
20
Let A = {weather, road, time, accident}, C = {weather, road}, and D = {time, accident}. Then we have U/C = {{1,7}, {2,5}, {3,6}, {4}} = {c1, c2, c3, c4} – the set of condition granules U/D = {{1}, {2,3,7}, {4}, {5,6}} = {d1, d2, d3, d4} – the set of decision granules We could obtain decision rules from the above decision table, e.g., “if the weather is foggy and road is icy then the accident occurred at night” in 140 cases. We also can define a language on attributes of A. Let / be a language defined using attributes of A, an atomic formula is given by a = v, where a∈A and v∈Va. Formulas can be also formed by logical negation, conjunction and disjunction [Yao, 2001]. A formula is called a basic formula in this paper if it is an atomic formula or is formed only by conjunction. We will use the concept of basic formulas to distinguish descriptions on the decision granules.
3 Extended Random Sets Let Θ = U/C be the set of condition granules, and Ω = U/D be the set of decision granules. We can give descriptions on Θ and Ω to describe the possible decision rules. Table 2 gives such descriptions for the decision rules in Table 1. In general, facts are not knowledge. In order to find knowledge from information systems, we now consider the relationships between the premises and the conclusions of rules.
Extended Random Sets for Knowledge Discovery in Information Systems
527
We could rewrite the decision rules in Table 2 as c1 → { (d1, 80), (d2, 20) }; c2 → { (d2, 140), (d4, 20) } c3 → { (d2 , 40) (d4, 200) }; c4 → { (d3, 500) } Ω× R
These determine a mapping from Θ to 2 , where R is the set of real numbers. The mapping can be normalized as the following mapping Γ such that Γ : Θ → 2 Ω×[ 0 ,1] and snd = 1 for every θ∈Θ
∑
( fst , snd )∈Γ (θ )
For example, if we use the decision table in Table 1, we can obtain the following mapping by using the above definition. Table 2. Description of decision rules on granules if c1 then d1 in 80 cases or d2 in 20 cases else if c2 then d2 in 140 cases or d4 in 20 cases else if c3 then d2 in 40 cases or d4 in 200 cases else if c4 then d3 in 500 cases
Γ(c1) = Γ(c2) = Γ(c3) = Γ(c4) =
{ (d1, 80/100), (d2, 20/100) } = { (d1, 0.8), (d2, 0.2) } { (d2, 140/160), (d4, 20/160) } = { (d2, 0.875), (d4, 0.125) } { (d2, 40/240), (d4, 200/240) } ={ (d2, 1/6 ), (d4, 5/6) } { (d3, 500/500) } = { (d3, 1) }
Now we consider the support degree for each condition granule. The obvious way is to use the frequency in the table of facts, that is, (1) f i = ∑ x∈c N ( x ) i
where N(x) is the number of analogous cases of fact x. For data mining, the frequency is an well-accepted criterion. But it is not the only criterion for support degree because some condition granules with high frequencies may be meaningless. For example, when we use keywords to represent the meaning of documents, we usually consider both keywords frequency and inverse document frequency (e.g., the popular technique tf*idf in information retrieval) because some words (like “this”) have high frequencies in a document but they may appear in all documents. In order to use the above idea, we assume there is a base which contains many information systems (or databases). Given a decision table (U, C, D), we can define a weight function w on Θ =U/C which satisfies w(ci) = fi × log(m/ni) for every ci ∈Θ, where, m is the total number of the information tables in the bases, fi is the frequency of the condition granule ci as defined in equation (1), and ni is the
528
Y. Li
number of the information tables in the base that contain the given condition granule ci. By normalizing, we can get a probability function P on Θ such that w(θ ) for very θ∈Θ P (θ ) = w ( ϑ ) ∑ϑ∈Θ
Based on the above analysis, we could use a pair (Γ, P) to represent what we can obtain from an information system. We call the pair (Γ, P) an extended random set.
4 Mining Interesting Rules According to the definitions in the previous section, we can obtain the following decision rules: (2) ci → fst i ,1 , ci → fst i , 2 , ... , and ci → fst i ,|Γ ( c )| i
for a given condition granule ci, where
Γ (ci ) = {( fst i ,1 , snd i ,1 ), ... , ( fst i ,|Γ ( ci )| , snd i ,|Γ ( ci )| )} .
We
define
the
strengths
of
the
decision
rules
are
P (c i ) × snd i ,1 , ... , P (c i ) × snd i ,|Γ ( ci )| , and the corresponding certainty factors are
snd i ,1 , ... , snd i ,|Γ ( ci )| . If w(ci) = fi , we can prove that P(ci ) × snd i , j =
| ci ∩ fst i , j |
, and snd i , j =
| ci ∩ fst i , j |
|U | | ci | These results are same to the definitions of [Pawlak, 2002] in case of w(ci) = fi. A decision rule ci → fst i , j is an interesting rule if pr ( fst i , j | ci ) − pr ( fst i , j ) is greater than a suitable constant. From the definition of mapping Γ, we have pr ( fst i , j | ci ) = snd i , j . To decide the probabilities of the decision granules, we give the following function: pr : Ω → [0,1] such that pr (ω ) = ∑θ ∈Θ , (ω , snd )∈Γ (θ ) P (θ ) × snd
(3)
We can prove that pr is a probability function on Ω, because P(θ ) × snd = ∑ pr (ω ) = ∑ ∑ ω∈Ω
∑θ ∑ ∈Θ
ω∈Ω
( fst , snd )∈Γ (θ )
θ ∈Θ , (ω , snd )∈Γ (θ )
P (θ ) × snd =
∑θ
∈Θ
P (θ )∑ ( fst , snd )∈Γ (θ ) snd = ∑θ ∈Θ P(θ ) × 1 = 1
Table 3. Probability function on the set of decision granules
Decision granule
Description
pr
d1
Accident occurred at night
0.08
d2
Accident occurred in daytime
0.20
d3
Accident not occurred in daytime
0.50
d4
Accident not occurred at night
0.22
Extended Random Sets for Knowledge Discovery in Information Systems
529
For example, if we assume w(ci) = fi , then we can obtain an extended random set (Γ, P) from Table 1, where P(c1) =0.1, P(c2) = 0.16, P(c3) = 0.24, P(c4) = 0.5; and Γ(c1) = { (d1, 0.8), (d2, 0.2)}, Γ(c2) = { (d2, 0.875), (d4, 0.125) }Γ(c3) = { (d2, 1/6), (d4, 5/6) }, and Γ(c4) = { (d3, 1) }. From equation (3) we have pr(d1) = P(c1)×0.8 = 0.1 ×0.8 = 0.08; pr(d2) = P(c1)×0.2 + P(c2)×0.875 + P(c3)×(1/6) = 0.1×0.2 + 0.16×0.875 + 0.24×(1/6) = 0.20 ; pr(d3) = P(c4)×1 = 0.5 ×1 = 0.5 ; and pr(d4) = P(c2)×0.125 + P(c3)×(5/6) = 0.16×0.125 + 0.24×(5/6) = 0.22. Table 3 shows the probability function on the set of decision granules.
5 Interpretation of Extended Random Sets From Table 3, we can find a very interesting phenomena, that is, only some descriptions on the set of decision granules are meaningful for a given information system if we use “or” to combine decision granules. For example, “d1 or d2” means “Accident occurred”, and this conclusion can be derived from some decision rules, e.g., “if the weather is foggy and road is icy and time is day then the accident occurred”. We can also use a basic formula (e.g., accident = yes) on the decision attributes to represent it. The description of “d1 or d3” (i.e., “accident occurred at night or accident not occurred in daytime”), however, is not any conclusion of a decision rule on the information system, that is, we cannot find a basic formula on the decision attributes to represent it. From table 1 we know the probability of “Accident occurred” is (80+140+40+20)/1000 = 0.28, that is, pr(d1)+pr(d2). We could interpret this observation by defining another decision table (U, E, F) of S, where E = {weather, road, time}, F = {accident}, and E⊃C, and F⊂D. For the new decision table, there are only 2 decision granules, e.g., “accident = yes” and “accident = no”. We can compute the probability function using equation (3), hence we have pr(“accident = yes”) = 0.28 and pr(“accident = no”) = 0.72. This result can also be derived from Table 3, that is, pr(“accident = yes”) = pr(d1) + pr(d2), and pr(“accident = no”) = pr(d3) + pr(d4). I would not like to formally prove this result in this paper. I will give the proof in the future. In this section, I try to define the meaningful descriptions. A description X on the set of decision granules of decision table (U, C, D) is meaningful if there is a decision table (U, E, F), such that E⊃C, and X∈F. Now we consider how to measure the uncertainty of a description on the set of decision granules. The easy way is to use ∑ pr ( x) ; however, it is very sensitive to x∈X
the frequencies of facts. To consider a relative stable measure, we consider a random set (ξ, P) [Kruse et al., 1991] [Liu and Li, 1997] which is derived from the extended random set (Γ, P): ξ : Θ → 2 Ω and ξ (θ ) = { fst ( fst , snd ) ∈ Γ(θ )}
530
Y. Li
The random set (ξ, P) determines a Dempster-Shafer mass function m on Ω such that (4) m( X ) = P({θ θ ∈ Θ, ξ (θ ) = X }) for every X⊆Ω. This mass function can decide a belief function and a plausibility function as well. They are defined as follows: bel m : 2 Ω → [0,1] , pl m : 2 Ω → [0,1] and
belm ( X ) = ∑Y ⊆ X m(Y ) , pl m ( X ) = ∑Y ∩ X ≠∅ m(Y ) for every X⊆Ω. We can prove that belm(X) ≤
∑
x∈X
(5)
pr ( x) ≤ plm(X) for every X⊆Ω.
For example, from the above definitions we have belm ( X ) = ∑Y ⊆ X m(Y ) = ∑Y ⊆ X P({θ θ ∈ Θ, ξ (θ ) = Y }) = P ({θ θ ∈ Θ, ξ (θ ) ⊆ X }) and
∑
x∈X
pr ( x) =
∑θ ≥∑
∈Θ ,ξ (θ ) ⊆ X
θ ∈Θ ,ξ (θ ) ⊆ X
∑ ∑
x∈X
∑θ
∈Θ , ( x , snd )∈Γ (θ )
( fst , snd )∈Γ (θ )
P (θ ) × snd =
P (θ ) × snd +
∑θ
∈Θ ,ξ (θ )⊄ X
∑
y∈ξ (θ ) ∩Y
P(θ ) × snd y
P (θ ) × snd = ∑θ ∈Θ ,ξ (θ ) ⊆ X P (θ ) ∑( fst , snd )∈Γ (θ ) snd ∑ θ =∑ P (θ ) × 1 = P({θ θ ∈ Θ, ξ (θ ) ⊆ X }) θ ∈Θ ,ξ (θ ) ⊆ X ( fst , snd )∈Γ ( )
where, (y, sndy) ∈Γ(θ). Analogically, we can prove
∑
x∈X
pr ( x) ≤ plm(X).
An explanation of proposed measures of uncertainty and an interpretation of the interval [belm, plm] (see [Shafer, 1976]) is suggested here. Experts can use the interval [belm, plm] to check if their “subjective” judgements for some meaningful descriptions are correct. For example, from equation (4) we have m({d1, d2}) = P(c1) = 0.1, m({d2, d4}) = P(c2) + P(c3) = 0.4, m({d3}) = P(c4) = 0.5, and m(X) = 0.0 for other X⊆Ω. We can verify that there are 4 meaningful descriptions out of 4+6+4+1=15 descriptions on Ω. They are “d1 or d2”, “d3 or d4”, “d2 or d3”, and “d1 or d4”. Table 4 shows the uncertainty measures for the meaningful descriptions. Table 4. Probability function on the set of decision granules
Description on Ω
Subset of Ω
pr
m
[belm, plm]
“d1 or d2”
{d1, d2}
0.28
0.1
[0.1, 0.5]
“d3 or d4”
{d3, d4}
0.72
0.0
[0.5, 0.9]
“d2 or d3”
{d2, d3}
0.70
0.0
[0.5, 1.0]
“d1 or d4”
{d1, d4}
0.30
0.0
[0.0, 0.5]
Extended Random Sets for Knowledge Discovery in Information Systems
531
6 Summary As mentioned in Section 1, KDD is a promising new field and it is a difficult task nowadays. This paper uses information systems to formalise the concept of knowledge for KDD. It also presents extended random sets to describe the interesting rules and the meaningful descriptions. The main contributions of this paper are as follows: the concept of extended random sets is presented to describe the relationships between condition granules and decision granules. It supports to make a new approach to find interesting rules from a list of facts; and the concept of meaningful descriptions on the set of decision granules is presented for mining some useful concepts from an information system. Our current and ongoing efforts focus on the following subjects: Discussing relationships between probabilities that are defined in different decision tables; and presenting algorithm to find meaningful descriptions on the set of decision granules, and the corresponding premises on the set of condition granules.
References [Agrawal, 1993] R. Agrawal, T. Imielinski and A. Swami, Database mining: a performance perspective, IEEE Transactions on Knowledge and Data Engineering, 1993, 5(6):914–925. [Chen et al., 1996] M.-S. Chen, J. Han, and P. S. Yu, Data mining: an overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6):866– 883. [Fayyad et al., 1996] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthrusamy, eds., Advances in knowledge discovery and data mining, Menlo Park, California : AAAI Press/ The MIT Press, 1996. [Greco et al., 2002] S. Greco, Z. Pawlak, and R. Slowinski, Generalized decision algorithm, rd rough inference rules, and flow graphs, 3 International Conference on Rough Sets and Current Trends in Computing, Malvern, PA, USA, 2002, 93–104. [Kruse et al., 1991] R. Kruse, E. Schwecke and J. Heinsoln, Uncertainty and vagueness in knowledge based systems (Numerical Methods), Springer-Verlag, New York, 1991. [Lin, 2002a] T. Y. Lin, The lattice structure of database and mining multiple level rules, Bulletin of International Rough Set Society, 2002, Vol 6 No ½, pp 11–16. [Liu and Li, 1997] D. Liu and Y. Li, The interpretation of generalized evidence theory, Chinese Journal of Computers, 1997, 20(2):158–164. [Pawlak, 2002] Z. Pawlak, In pursuit of patterns in data reasoning from data – the rough set rd way, 3 International Conference on Rough Sets and Current Trends in Computing, Malvern, PA, USA, 2002, 1–9. [Shafer, 1976] G. Shafer, A mathematical theory of evidence, Princeton University Press, Princeton,NJ, 1976. [Walley, 1996] P. Walley, Measures of uncertainty in expert systems, Artificial Intelligence, 1996, 83: 1–58. [Yao and Yao, 2002] J. Y. Yao and Y. Y. Yao, Induction of classification rules by granular rd computing, 3 International Conference on Rough Sets and Current Trends in Computing, Malvern, PA, USA, 2002, 331–338.
532
Y. Li
[Yao, 2001] Y. Y. Yao, On modelling data mining with granular computing, Proceedings of COMPASAC, 2001, 638–643. [Yao et al., 1997] Y. Y. Yao, S. K. M. Wong and T.Y. Lin, A review of rough set models, in: Rough sets and data mining, edited by T.Y. Lin and N. Cercone, Kluwer Academic Publishers, Boston, 1997, 47–75.
Research on a Union Algorithm of Multiple Concept Lattices 1
2
3
Zongtian Liu , Liansheng Li , and Qing Zhang 1
School of Computer, Shanghai University, 200072 [email protected] 2 Automatic Command Staff, Armored force Engineering Institute,Beijing 3 Department of Computing and Information Technology, Fudan University, 200433
Abstract. Concept lattice has played an important role in data mining and data processing. The paper gives definitions of same-field contexts, consistent contexts, same-field concept lattices, and consistent concept lattices, provides definitions of the addition operation of two same-field and consistent contexts as well as the union operation of two same-field and consistent concept lattices, and proves that the two operations above are isomorphic and satisfy other interesting mathematical properties, such as commutative and associative laws as well as having left and right identity elements. According to the definitions and properties of the union operation, a union algorithm of multiple concept lattices is deduced, in which some heuristic knowledge from order relation of the concepts is used, so the time efficiency of the algorithm can be improved. Experiments show that using the algorithm to merge two concept lattices distributed on different sites into one is evidently superior to the method of using Gordin’s algorithm to insert the objects of the formal context corresponding to second concept lattice one by one into the first lattice. Evidently, the algorithm provided in the paper is an effective parallel algorithm to construct concept lattice. Keyword: Formal Concept Analysis, Multiple concept lattices, Union operation of concept lattices, Union algorithm of concept lattices
1 Introduction In philosophy, a concept is understood as an idea unit with two parts: extension and intension. Based on the philosophic thought, Professor Wille provided the theory of formal concepts[1] for discovering, ordering, and displaying concepts. In the analysis of formal concepts, the extension of a concept is referred to as the set of all objects which belong to the concept, and the intension as the set of all attributes which all the objects possess. Concept lattice, as a core data structure of analysis for formal concepts, describes essentially the relationship between objects and attributes, and indicates the relationship of generalization and specialization between concepts. At present, there are some research on mathematical properties of concept lattice, relationship with fuzzy sets and rough sets, and its extended models. Some important research achievements have been gained [2], [3], [4], [5], [6]. Concept lattice has important theoretical signification and broad application worthiness in data mining and processing. But its time and space complexity is still a G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 533–540, 2003. © Springer-Verlag Berlin Heidelberg 2003
534
Z. Liu, L. Li, and Q. Zhang
puzzle for its application. The main reason is that most of the algorithms are serial, and the data to produce lattice is stored by concentrated mode. It is a basic approach for using parallel calculation and distributed storage to effectively solve the problem. [7] introduces a parallel algorithm to produce concept lattice, but it is lack for theory and experimentation. Due to concept lattice has well mathematical properties and adapts to distributed process [6], model of multiple concept lattices are a very perfect tool to solve parallel calculation and distributed storage. The paper studies some mathematical properties of multiple concept lattices, defines union operation of multiple concept lattices, studies mathematical properties of the operation, constructs a union algorithm of multiple concept lattices. Our experiments show that the algorithm is available and effective.
2 Mathematical Model In the analysis, a formal context is considered as a three tuple: K=(U, A, I), where U is a set of objects and A is a set of attributes, while I is a binary relationship between U and A, it is to say that I⊆U×A. Therefore oId ,i.e. (o, d)∈I, means that object o possesses characteristic d. In a formal context K, there are two defined maps: f and g between the power set of U and the power set of A as follows: ∀O⊆U: f(O)={d|∀x∈O(xId)} ∀D⊆A: g(D)={x|∀d∈D(xId)} A binary tuple (O, D) from P(U)×P(A) is called a formal concept, or concept for short, in a formal context K, if it satisfies both conditions: O=g(D) and D=f(O), denoted by C=(O, D), where D is referred to as intension of C and O as extension, denoted by intent(C) and extent(C) respectively. The set of all formal concepts in K is denoted by CS(K). There is an important relationship in CS(K): supper-concept and sub-concept, or generalization and specialization, or predecessor and successor. If D2⊆D1, (O1, D1) is a sub-concept of (O2, D2), denoted by (O1, D1)≤(O2, D2). With the relationship, we get an ordered set, CS(K)=(CS(K), ≤). It is a complete lattice, known as the concept lattice of the formal context K, denoted by L(K). Set of concept lattices is called multiple concept lattices. In CS(K), if C≥C1, …, C≥Cn, and there exits no such concept C’, C≥C’, as that C’≥C1, …, C’≥Cn, then we say that C is the smallest up-bound of C1, …, Cn. Definition 1: If the formal contexts K1=(U1, A, I1) and K2=(U2, A, I2) satisfy U1⊆U, U2⊆U, then K1 and K2 are called same field formal contexts of U and A, while L(K1) and L(K2) are called same field concept lattices of U and A, or same field for short. If K1 and K2 satisfy U1 V2 2 , then they are referred to as independent on extension, or independent for short, and so do L(K1) and L(K2). For any o V1 V2 and any d 6, If oI1 d⇔oI2d is satisfied, then K1 and K2 are referred to as consistent, and so do L(K1) and L(K2). Evidently, independent is consistent, but not vice versa.
536
3)
4)
Z. Liu, L. Li, and Q. Zhang
If O3=O1≠φ in K1 and O2= in K2, then there is C1=(O3, D3) in L(K1) and g(D3)= in K2. This means that there is the smallest concept C2=( , A) in K2. So C3 may be produced from C1 and C2 by definition 6-1. Therefore,C3=(O3,, D3) L(K1 ) L(K2 ), i.e. C3=(O3, D3) L(K1 ) L(K2 ). If O3=O2≠φ in K2 and O1= , then the process to prove is similar to 3).
Basic Laws 1) If denotes the concept lattice which has only the one concept, C=( , A), then any concept lattice L in field A satisfies G2 G 2G 2) If L1,ü G2 ,G3 is same field and consistent multiple concept lattices, then L1 G2=L2 G1 L1 G2 G3=L1 þG2 G3)
4 A Union Algorithm of Concept Lattices 4.1 Basic Idea of Algorithms Definition 7: For concept C=(O, D) and concept lattice L(K), K=(U, A, I),, we say that concept C and concept lattice L(K) is same field if D⊆A. Definition 8: For concept C=(O, D) which is same field with concept lattice L, if there exists a concept C1=(O1, D1) in L satisfying D1⊆D, then we say that C1 is an updated concept for C. Definition 9: For concept C=(O, D) which is same field with concept lattice L, if there exists a concept C1=(O1, D1) in L satisfying D∩D1=D’≠φ and no concept having intension D’ in L, then we say that C1 is a candidate of the producer of C for D’. Theorem 2: In all candidates of the producer of C for D’, there must exist one and only one concept which is the smallest up-bound of the all candidates. The smallest up-bound is referred to as the producer of C for D’. Proof: Suppose there exist multiple such the smallest up-bounds, two of them are C1=(O1, D1) and C2=(O2, D2), then D∩D1=D’ and D∩D2=D’, so D1∩D2=D3⊇D’ can be deducted. It is to say that there should exist a concept C3 in L whose intension is D3. If D3=D’, then nether C1 nor C2 is a candidate of the producer of C for D’. If D3⊃D’, then C3 is also a candidate of the producer of C for D’, but C3 is larger than C1 and larger than C2, so nether C1 nor C2 is supper-concept of C3. This result conflicts with the initial supposition.
Research on a Union Algorithm of Multiple Concept Lattices
537
Definition 10: To insert concept C=(O, D) into concept lattice L which is same field with it means that: 1) for any concept C1=(O1, D1) in L, if it is an updated concept for C, then modify it as C’1=(O1∪O, D1); 2) for any concept C1=(O1, D1) in L, if it is the producer of C for D’, then append new concept C’=( O1∪O, D’) in L. Evidently, definition 10 can be implemented with an algorithm similar to Cordin’s algorithm [8]. The difference is only that Cordon’s algorithm inserts objects but the algorithm does concepts. Theorem 3: Given two same field and consistent concept lattice L1 and L2, if all concepts in L2 are inserted into L1 one by one, the result L1’ is union of L1 and L2. Proof: omitted. Given L(K1) and L(K2) are two same field and consistent concept lattices, Let’s try to get L(K1) L(K2). The basic idea of our algorithms is to use the method similar to Gordin’s algorithm to insert all concepts of L(K2) into L(K1) one by one. However, the time complexity of the algorithm relates with the numbers of nodes in L(K1) and L(K2), and the number of the nodes in the new L(K1) increases gradually during the algorithm. Now we deduce some heuristic knowledge by the order of concept lattice to reduce the time complexity of the algorithm. Theorem 4: Suppose the all concepts in L2 are inserted into L1 by the ascending order of intent lengths of the concepts. If C1 is an updated or a new produced concept in L1 for a concept C2’ in L2 , and the concept C2 in L2 is inserted into L1 later than C2’, then the operation between C1 and C2 needn’t to be thought over. Proof: We know C1 is formed by C2’, if C2’ is a predecessor concept of C2, then D1⊆D2’⊂D2 , O1 ⊃O2’⊃O2, so that C1+C2=C1 , that is to say that no any operation need to be executed for C1 and C2 during C2 inserts into L1. If C2’ is not a predecessor concept of C2, supposes C1 is formed by C2’ and C1’, C1’=(O1 ’,D1’ ) in L1 , then D1 =D1 ’∩D2’,O1 =O1’∪O2’. Let D2∩D2’=D2’’,O2 ∪ O2 ’=O2’’, then C2 ’’= (O2 ’’, D2 ’’) is a concept in L2 , LW is a predecessor concept of C2’ and C2. If C1 and C2 form C, then D=D1∩D2=D1’∩D2’∩D2, O=O1 ∪O2=O1’∪O2’∪O2. However C2’’must be inserted into L1 before C2’, C2’’ and C1’ should form updated or new produced concept (O1’∪O2’∪O2 , D1 ’∩D2’∩D2), it is equal to C, so no any operation need to be executed for C1 and C2 during C2 inserts into L1. Because the concepts in L2 insert into L1 by the ascending order of intent lengths of the concepts, it is not possible for successors of C2 to insert L2 before that C2 inserts. According to theorem 4, the modified Gordin’s algorithm can be provided as follows.
538
Z. Liu, L. Li, and Q. Zhang
4.2 A Union Algorithm of Two Same Field and Independent Concept Lattices Algorithm: A union algorithm of two same field and independent concept lattices Input: L(K1) and L(K2), two same field and consistent concept lattices Output: L(K1) L(K2) BEGIN FOR every concept in L(K2)by the ascending order of intent lengths of the concepts DO Insert the concept into L(K1) using modified Gordin’s algorithms ENDFOR The updated L(K1) is the original L(K1) L(K2) END Algorithm: modified Gorndin’s algorithm Input: a concept lattice L, and a concept = (O, D) Output: The new lattice modified BEGIN Classify the concepts in L by the lengths of their intents, let B[i] := {C: ||Intent(C)|| = i}; size := max(i) Initialize all the updated and new produced concepts by the lengths of their intents, i.e. B′[i] := Φ; FOR k := 0 TO size DO FOR each C in B[i] IF C has tag of updated or new produced THEN continue; (* ) IF Intent(C) ⊆ D THEN {That is to say, concept C is an updated one} BEGIN Extent(C) := Extent(C) ∪ O Put C into B[′ ||intent(C)||] Make the tag of updated or new produced to C (*) IF Intent(C) = D THEN exit algorithm END ELSE int := Intent(C) ∩ D IF not exists C 1 ∈ B′[||int||] so that Intent(C1) = Int THEN {That is to say , C is a producing node} BEGIN add concept Cn := (Extent(C) ∪ O, int) to L1 add Cn to B′[||int||] update the edges related
Research on a Union Algorithm of Multiple Concept Lattices
539
Make the tag of updated or new produced to C(*) END ENDIF ENDFOR ENDFOR END In the algorithm, the lines with (*) are added for Gordin’s algorithm.
5 Analysis for the Results of the Experiments We have implemented the union algorithm with Visual C++6.0 on Windows2000. To test the time efficiency of the algorithm in many kinds of situations, six data sets have been random generated, according to the given numbers of the objects in the contexts and the probabilities of having relation between the objects and the attributes. In the experiments, the number of the attributes in every data set is 10. In the 6 data sets, probabilities of having relation are 30%,40% and 50% , respectively. The results are shown in the table 1. Table 1. The results of the experiments Database
The
The number
The time to
The number The time to The time to
abilities number of the nodes
produce the
of the nodes produce the produce the
union the
of the
of rela-
lattice from
in the lattice lattice from whole lat-
former and
objects
tion
number
Prob-
The
of the in the lattice nodes in of former whole
N/2
lattice
former N/2 by
of behind
Gordin algo-
N/2
rithm
behind N/2
tice by
by Gordin Gordin alalgorithm
The time to
the behind lattices
gorithm
1
1,000
30%
174
169
33,999
170
37,414
111,069
27,800
2
1,000
40%
368
321
62,350
324
68,138
197,414
91,251
3
1,000
50%
562
485
106,503
505
114,745
324,377
222,460
4
2,000
30%
177
176
103,369
177
111,260
351,936
50,794
5
2,000
40%
387
372
192,286
370
196,282
615,975
179,446
6
2,000
50%
619
566
323,646
563
322,183
1,005,657
430,830
The experiments show: 1) The union algorithm provided in the paper is reliable, as the result by the algorithm is consistent with that by Gordin’s algorthm; 2) The time to produce the lattice from former N/2 by Gordin’s algorithm, then to produce the lattice from behind N/2 by Gordin’s algorithm, and to unite the two lattices is almost equal to the time to produce the whole lattice by Gordin’s algorithm. 3) For the situation that two concept lattices have been constructed in a distribute environment and need to unite them, the time to unite the two lattices by
540
4)
Z. Liu, L. Li, and Q. Zhang
the algorithm provided in the paper is evidently shorter then that to add the concepts corresponding to the second lattice into the first lattice by Gordin’s algorithm. Even more, by the union algorithm, the context corresponding to the second lattice needn’t to be reserved. In parallel processing environment, concept lattice can be constructed by parallel method, that is to say, the context is divided into many parts, and the small lattices corresponding the parts are constructed by parallel method, then the lattices are united by the algorithm provided in the paper, again and again by parallel method, until the whole lattice is formed.
6 Conclusion The mathematical operations and properties on multiple concept lattices can be regarded as theoretical foundation to build distributed concept lattice storage model and its parallel algorithms. In the paper, we construct the union algorithm of multiple concept lattices. The algorithm will pay role in distributed or parallel environment I heartily thank the Council of Internal Natural Science Foundation of China for the research belongs to the projects of Internal Natural Science Foundation of China, No. 60275022 and No. 69985004.
References 1. Wille, R. Restructuring lattice theory: an approach based on hierarchies of concepts. In I. Rival (Eds.), Ordered Sets, Reidel, Dordrecht, 1982, 445–470. 2. B lohlávek R. Similarity relations in concept lattices. Journal of Logic Computation, 2000, 10(6): 823–845. 3. Burusco A, Fuentes R. The study of the L-fuzzy concept lattice. Mathware Soft Comput., 1994, I(3): 209–218. 4. Wang Z. Research on extended rough set model. Dissertation of Ph. D. Hefei University of Technology. Hefei, Anhui, China, 230009, 1998. 5. Liu Z. Research on Generalized Concept Lattice Model for Tolerance Approximate Space. Chinese Journal of Computer, 2000, 23(1): 66–70. 6. Xie Z. Research on knowledge Discovery Based on Concept lattice model, Dissertation of Ph. D. Hefei University of Technology. Hefei, Anhui, China, 230009, 2001. 7. Njiwoua P, Mephu Nguifo E. A parallel algorithm to build concept lattice. In proceedings of 4th Groningen Intl. Information Technical Conference for Students, 1997. 103–107. 8. Godin R, Missaoui R, & Alaoui H. Incremental concept formation algorithms based on Galois (concept) lattices. Computational Intelligence, 1995, 11(2): 246–267. 9. Bernhard Ganter, Rudolf Wille. Formal Concept Analysis. Springer, 1999.
A Theoretical Framework for Knowledge Discovery in Databases Based on Probabilistic Logic Ying Xie and Vijay V. Raghavan The Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, LA 70504-4330, USA {yxx2098, raghavan}@cacs.louisiana.edu Abstract. In order to further improve the KDD process in terms of both the degree of automation achieved and types of knowledge discovered, we argue that a formal logical foundation is needed and suggest that Bacchus’ probability logic is a good choice. By completely staying within the expressiveness of Bacchus’ probability logic language, we give formal definitions of a “pattern” as well as its determiners, which are “previously unknown” and “potentially useful”. These definitions provide a sound foundation to overcome several deficiencies of current KDD systems with respect to novelty and usefulness judgment. Furthermore, based on this logic, we propose a logic induction operator that defines a standard process through which all the potentially useful patterns embedded in the given data can be discovered. This logic induction operator provides a formal characterization of the “discovery” process itself.
1
Introduction
Knowledge Discovery in Databases (KDD) is defined to be the non-trivial extraction of implicit, previously unknown and potentially useful patterns from data [1]. This concise definition implies the basic capabilities that a KDD system should have. In reality, however, the user of current KDD systems is only provided limited automation: on the one hand, he is required to specify in advance exactly which type of knowledge is considered potentially useful and thus need to be discovered; on the other hand, often, he has to make the novelty judgment from a glut of patterns generated by KDD systems. This deficiency offers big room for the improvement of KDD process in terms of both the degree of automation and ability to discover a variety of knowledge types. In the following subsections, we will show in detail that in order to achieve such improvement, a formal theoretical framework of KDD is needed, and suggest that Bacchus’ probabilistic logic provides a good foundation for us to start from. 1.1
On “Previously Unknown Pattern”
One can imagine that without being aware of what is ”already known”, there is no way for KDD systems to judge what is ”previously unknown”. Nevertheless, G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 541–548, 2003. c Springer-Verlag Berlin Heidelberg 2003
542
Y. Xie and V.V. Raghavan
the view of a ”real world” of current KDD systems is just a set of facts or data, so that any discovered pattern with parameters higher than thresholds will be deemed novel by the KDD systems, though they may not be so to human users. In other words, the lack of the ability to model the ”already known patterns” burdens the user with the task of novelty judgment. In order to improve the automation of the knowledge discovery process, we need a comprehensive model of the ”real world”, which requires a common formalism to represent both facts and patterns. However, the formalization of the representation alone is not enough to solve the novelty judgment problem. The inference ability among patterns is also required. For example, assume that we only have the following two ”already known patterns”: 1) More than 80% of the students didn’t get A in the test; 2) Every student living on campus got A. Now suppose we obtained another pattern through some KDD process: more than 80% of the students didn’t live on campus. Does it belong to ”previously unknown pattern” or not? Most of us may agree that it does not, because this pattern can be easily deduced from the other two. Therefore, in order to effectively identify ”previously unknown pattern”, the deductive ability of the formalism that helps in recognizing relationships among patterns (statistical assertions) is also a necessity. 1.2
On “Potentially Useful Pattern”
In almost all the previous literature, ”potential usefulness” was viewed as a user-dependable measure. This popular opinion implies that the user has already known or currently has the ability to know which type of pattern will be potentially useful. If we cannot say that there exists to some degree a paradox in the way the terms of ”previously unknown” and ”potentially useful” are handled, we can at least say that current practices ignore a group of patterns that are indeed potentially useful. This group of patterns has the potential to lead to some useful actions that are not currently realized by the user. In practice, a pattern with conditional probability 30% will be thought to be uninteresting. However, consider the following situation: the conditional probability of A given B is 1% or 80% (that is, either it is too low or too high) and the conditional probability of A given B ∧ C is 30%, where A is some kind of disease; B denotes some population and B ∧ C is a subpopulation of B. Now one will feel the latter pattern is potentially useful. Conversely, a pattern with conditional probability 90% will be always thought useful. However, consider the scenario: the conditional probability of A given B ∧ C is 90%, and the conditional probability of A given B equals to 91.5%. For this case, obviously, it is not B ∧ C that implies A. Therefore, on the one hand, what we can get, with current practices, is just a subset of all potentially useful patterns; on the other hand, what we really get may include those that are not potentially useful. In order to solve this problem, apparently, we need a more precise formulation of the criteria for the notion of ”potential usefulness”.
A Theoretical Framework for Knowledge Discovery in Databases
543
One of the major purposes of KDD is to utilize the statistical knowledge extracted from given data to help one’s decision making. In decision theory, propositional probability, whose statements represent assertions about the subjective state of one’s beliefs, is always employed as the guide to action in the face of uncertainty [2]. However, the subjective state of one’s belief is always coming from or affected by the objective statistical state of the world, which is described by statistical probability. Therefore, direct inference, which deals with the inference of propositional probability from statistical assertion in AI, provides us the clue of how to define ”potential usefulness”. Roughly speaking, the statistical pattern that can affect one’s propositional assertion, when one faces some particular situation, will be potentially useful. 1.3
Bacchus’ Probabilistic Logic
Given that the scope of our discussion is limited to statistical knowledge discovery, which is one of the most active themes in KDD, we found that Bacchus’s probabilistic logic [2], which is an augmentation of first order logic and provides a unified formalism to represent and reason with both the statistical probability and propositional probability, provides a good foundation for our formal analysis of KDD. For example, its first order logic part can be applied to model facts and relational structures, while the statistical probability part of its language can be used to represent and reason with patterns. Moreover, it integrates an elegant direct inference mechanism, based on which we can give a formal definition of ”potential usefulness”. Most important of all, the logic itself is sound. For detailed information about this logic, please refer to [2][3]. 1.4
Contributions of This Paper
Based upon Bacchus’s probabilistic logic, this paper proposes a theoretical framework for (statistical) knowledge discovery in databases: first, a logical definition of ”previously unknown, potentially useful pattern” are given by completely remaining within the framework of Bacchus’s logic language, which provides a sound foundation to solve several deficiencies of current KDD systems with respect to novelty and usefulness judgment. Secondly, we propose a logic induction operator that defines a standard process through which all the potentially useful patterns embedded in the given data can be discovered. This logic induction operator formally characterizes the discovery process itself. Therefore, general knowledge discovery (independent of any application) is defined to be any process functionally equivalent to the process determined by this logic induction operator for a given data. Finally, by showing that the following knowledge types: association rule, negative rule, exception rule and peculiarity rule are all special cases of our potentially useful patterns, user-guided knowledge discovery is defined to be discovery of a subset of all previously unknown and potentially useful patterns, discoverable via general knowledge discovery, in order to satisfy user’s current needs.
544
2
Y. Xie and V.V. Raghavan
Logical Model of a “Real World”
In this paper, a ”real world” W is a tuple (I, KB0 ), where I denotes a information table transformed from a database, while KB0 is the background knowledge base. Therefore, logically modeling a ”real world” includes building the logic semantic structure from the information table and representing the background knowledge base with logic formulas. In this section, we mainly talk about the first task. Now, given an information table I=(U, A, {Va | a ∈ A}, {Ia | a ∈ A}), where U is a set of objects of the interesting domain; A is the set of attributes; Va is called the value set of attribute a ∈ A and Ia : U → Va is an information function, we will build the logic semantic structure M=< O, V, µ> through the following steps: • O = U; • for each Va ∈ {Va | a ∈ A} for each va ∈ Va , va is a unary object predicate symbol, and interpretation function V assigns to va the set of objects {o ∈ O | Ia (o) = va}, i.e. vaV = {o ∈ O | Ia (o) = va}; • Two specific object predicate symbols are defined, Ø and F, such that ØV = φ, F V = O; • µ is a discrete probability function, suchthat for any object o ∈ O, µ(o) = 1/|O |, and for any subset A ⊆ O, µ(A)= o∈A µ(o) and µ(O )=1; • Interpretation function V assigns to every n-ary numeric predicate symbol a subset of Rn , for example, 2-ary predicate symbol ”=” will be assigned the set {(1/8, 1/8), (2/8, 2/8), . . .}; and V assigns to every n-ary numeric function symbol a function from Rn to R, which gives the proper meanings to symbols +, -, ×, min, max and so on. Example 1: The student information table (shown in table 1) is given by I=(U, A, {Va | a ∈ A}, {Ia | a ∈ A}), where U={S1, S2, . . ., S8}; A={Grad./Under., Major, GPA}; VGrad./U nder. ={Gr, Un}, VM ajor ={Sc, En, Bu}, VGP A ={Ex, Gd, Fa}; IGrad./U nder. = U → VGrad./U nder. , IM ajor = U → VM ajor , IGP A = U → VGP A . Hence, the corresponding semantic structure for this information table is M =< O, V, µ>, where • O = U = {S1, S2,. . ., S8}; • The set of unary object predicate symbols is {Gr, Un, Sc, En, Bu, Ex, Gd, Fa}; and by interpretation function V, GrV ={S1, S2, S3}; U nV ={S4, S5, S6, S7, S8}; ScV ={S1, S3, S5} and so on; • By discrete probability function µ, for every student Si ∈ O, µ(Si)=1/8; and for any subset A ⊆ O, µ(A)= Si∈A µ(Si) and µ(O )=1; • By interpretation function V, numeric predicate symbols such as >, =, <, and numeric function symbols such as +, -, ×, min, max get the proper meaning. Now, based on the semantic structure M we build, some truth assignments of formulas and interpretations of terms will be shown as follow:
A Theoretical Framework for Knowledge Discovery in Databases
545
Table 1. Student Information Table Student Grad./Under. Major GPA S1 S2 S3 S4 S5 S6 S7 S8
– – – – – – –
3
Gr Gr Gr Un Un Un Un Un
Sc En Sc Bu Sc Bu En En
Ex Ex Gd Ex Gd Fa Fa Gd
M |= Gr(S1), because S1 ∈ GrV ; M | = En(S1), because S1 ∈ / EnV ; M | = Ø(Si), for any Si ∈ O; M |= F(Si), for any Si ∈ O; M |= F = ¬Ø; [Gr]x = µ{Si : M |= Gr(Si)} = µ{S1, S2, S3}=3/8; M |= [Gr]x = 3/8.
Previously Unknown and Potentially Useful Pattern
Based on the logical representation of a ”real world”, in this section, a logical definition of ”previously unknown, potentially useful pattern” are provided by completely remaining within the framework of Bacchus’s logic language. In terms of statistical knowledge, pattern can be viewed as the quantitative description of concepts and the relationships between concepts. Therefore, a logic definition of pattern will be based on the logic definition of concept. Definition 1: Concept. If P is an object predicate symbol, then P is a concept (it is also called atomic concept); Both Ø and F are called special concept; If P and Q are concepts, then P ∧ Q is a concept (it is also called compositive concept if both P and Q are not special concept). Definition 2: Pattern. If P is a concept, then [P ]x is a p-term; If P, Q are concepts and Q = Ø, then [P |Q]x is a p-term; If p is p-term and r ∈ R, then formula p = r is a pattern. Definition 3: Validity, Implicitness of Pattern. A pattern P is valid by M iff M |= P. If formula [P |Q]x = r is a pattern, it is called implicit Pattern by M iff it is valid by M and P = Q and Q = F. Any other valid pattern is called explicit pattern. Intuitively, an implicit pattern cannot be obtained by just querying the database using only one standard SQL statement. As mentioned in section 1,
546
Y. Xie and V.V. Raghavan
if a pattern can be easily deduced from the previously known ones, it should not be called previously unknown pattern. The fact that the previously known knowledge base can be represented formally using Bacchus’ logic formula, not only makes it possible to give a formal definition of what a previously unknown pattern is, but also provides a mechanism by which the KDD system can judge the novelty of a pattern to a greater degree of precision. Definition 4: Previously Unknown Pattern. Assume Pr is an efficient enough deductive program based on Bacchus’s probability logic. For any implicit pattern [P |Q]x = r, if no formula like [P |Q]x = r1 ∧ (r-e < r1 ) ∧ (r1 < r+e), where r1 can be any number that belongs to R, can be deduced from KB0 by Pr within c steps, this implicit pattern is called previously unknown pattern, which is denoted as KB0 | =P r(c) [P |Q]x = r. Parameter c is called complexity controller, which is utilized to balance the use of deduction and induction processes. Parameter e is called error controller, which is utilized to identify the same type of knowledge. Now, it is time for us to formalize the notion of usefulness of patterns. Intuitively, our idea of the ”potential usefulness”, which is based on the direct inference mechanism of Bacchus’s probability logic, can be described as follows: if a narrower reference class provides much more knowledge than any of the wider reference about a property, the statistical information of this narrower reference class will be potentially useful in deciding the probability of any individual, which belongs to these reference classes, having that property. Formally, let C={P1 , P2 , . . . , Pn } denote all the atomic concepts defined on O. The ∧-closure C ∗ is defined to be the minimum set containing all the concepts in C as well as the special concepts F and Ø and is closed under ∧. Now we define the binary relation on C ∗ , such that for any concept Qi ∈ C ∗ , we have Ø Qi ; for any pair of concepts Qi , Qj ∈ C ∗ , where Qi = Ø and Qj = Ø, we have: Qi Qj ⇔ [Qj |Qi ]x = 1
(1)
∗
The tuple < C , > is a complete lattice that we call concept lattice. Example 2: Let’s continue with the logic representation built in example 3.1. The set of all the atomic concepts is: C={Gr, Un, Sc, En, Bu, Ex, Gd, Fa}. And we have the following concept lattice < C ∗ , > (Fig. 1), each node of which represents a concept. Now, we define inside-attribute for each concept as follows: For concept F, every atomic concept is called inside-attribute of F; For concept Ø, inside-attribute has no definition; For any other concept ∧ki=1 Qi , where k ≥ 1, each Qi is called defining-attribute of this concept; for every other atomic concept P, if P ∧ (∧ki=1 Qi ) = Ø , P is called inside-attribute of this concept. Definition 5: Potentially Useful Pattern. Let Q be any concept except Ø on the concept lattice, and let concepts P1 , P2 , . . ., Pj be all the parent concepts of Q. Let A = ∧k (ak ) , where ak is an inside-attribute of Q, (of cause ak is also the inside-attribute of all Q ’s parent concepts). Now a valid pattern [A|Q]x =r
A Theoretical Framework for Knowledge Discovery in Databases
547
F Gr
Sc
...
Fa
En
G r ^S c
S c^F a
...
E n ^F a
U n ^E n
G r ^S c^E x
G r ^S c^F a
...
U n ^E n ^F a
U n ^E n ^G d
O
Fig. 1. Concept Lattice
is called positive pattern by M, if M |= r - max1≤i≤j [A|Pi ]x >s ; and it is called negative pattern by M, if M |= min1≤i≤j [A|Pi ]x - r > s , where M |= (0<s ) ∧ (s ≤1). Both positive and negative patterns are potential useful patterns. Parameter s is called significance controller.
4
Logic Induction Operator =|
In order to formalize the knowledge discovering process using Bacchus’ logic language, we define a logic induction operator =| that allows to formally characterize a discovery process, based on the concept lattice, through which all the potentially useful patterns can be found. Definition 6: Logic Induction Operator =| Given the logic semantic structure M, assume that < C ∗ , > is the concept lattice decided by M. The Logic Induction Operator, =|, determines the following inductive process to get a set of potentially useful patterns, which is denoted as PUP: traverse the concept lattice from F to Ø, for each concept node P (except F and Ø), do the following two steps: 1) check all of its inter-attributes Ai, if pattern [Ai |P ]x is a potentially useful pattern, put it in set PUP; 2) Loop: for any pair of potentially useful patterns decided by P, i.e. [ϕ|P ]x =r1 and [ψ|P ]x =r2 , if [ϕ ∧ ψ|P ]x = min(r1 ,r2 ) is a potentially useful pattern, put it into set PUP, Until no new potentially useful pattern is discovered. This process is denoted as M =| PUP. Theorem 1: Completeness of Logic Induction Operator =| Given M, a pattern P is a potentially useful pattern iff M =| PUP and P ∈ PUP. Definition 7: General Knowledge Discovery. Given a information table I and KB0 , a general knowledge discovery is any efficient inductive process that extracts a set of patterns P, such that P ∈ P iff M =| PUP, P ∈ PUP and KB0 | =P r(c) P, where M is the logic semantic structure built from I. Users can guide the knowledge discovery process by providing more constrains to get a subset of PUP, in order to satisfy their current needs. Several existing statistical knowledge discovery types can be viewed as user-guided knowledge discovery.
548
Y. Xie and V.V. Raghavan
Definition 8: User-guided Knowledge Discovery. Given a information table I and KB0 , a user-guided knowledge discovery is any efficient inductive process that extracts a set of patterns P, such that P ⊂ PUP, and for any P ∈ P, KB0 | =P r(c) P, where M is the logic semantic structure built from I and M =| PUP. The following existing statistical knowledge types can be viewed as special subsets of PUP: – Association rule P ⇒ Q [9]: [Q|P ]x = r is a positive pattern; extra constrains: r > threshold; – Negative rule P ⇒ ¬Q [10]:[Q|P ]x = r is a negative pattern; extra constrains: r < threshold; – Exception rule P ⇒ Q, P ∧ P’ ⇒ ¬Q [11]: [Q|P ]x = r1 is a positive pattern and [Q|P ∧ P ]x = r2 is a negative pattern; extra constrains: r1 > threshold1; and r2 < threshold2. – Peculiarity rule P ⇒ Q[12]:[Q|P ]x = r is a positive pattern; extra constrains: r > threshold1; [P ]x < threshold2.
5
Conclusion
In order to further improve the KDD process in terms of both the degree of automation achieved and types of knowledge discovered, we propose a theoretical framework for (statistical) knowledge discovery in database based on Bacchus’s probability logic. Within this framework, a formal definition of ”previously unknown, potentially useful pattern” is proposed, which provides a sound solution to overcome several deficiencies of current KDD systems with respect to novelty and usefulness judgment. Furthermore, a logic induction operator that describes a standard knowledge discovering process is proposed, through which all potentially useful pattern can be discovered. This logic induction operator provides a formal characterization of the ”discovery” process itself.
References 1. Frawley, W. J., Piatetsky-Shapiro, G. and Matheus, C.: Knowledge Discovery In Databases: An Overview. Knowledge Discovery In Databases. AAAI Press/MIT Press, Cambridge, MA (1991) 1–30 2. Bacchus, F.: Representing and Reasoning With Probabilistic Knowledge. MITPress, Cambridge, MA (1990) 3. Bacchus, F.: Lp, A Logic for Representing and Reasoning with Statistical Knowledge. Computational Intelligence 6 (1990) 209–231. 4. Yao, Y. Y. : On Modeling Data Mining with Granular Computing. Proc. of the 25th Annual International Computer Software and Applications Conference (COMPSAC 2001) 638–643 5. Lin, T. Y., Louie, E.: Modeling the Real World for Data Mining: Granular Computer Approach. Proc. of IFSA/ NAFIPS (2001) 6. Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers (1991)
An Improved Branch & Bound Algorithm in Feature Selection Zhenxiao Wang, Jie Yang, and Guozheng Li Institute of Pattern Recognition & Image processing, Shanghai Jiaotong University, Shanghai, China, 200030 [email protected] Abstract. The Branch & Bound (B&B) algorithm is a globally optimal feature selection method. The high computational complexity of this algorithm is a well-known problem. The B&B algorithm constructs a search tree, and then searches for the optimal feature subset in the tree. Previous work on the B&B algorithm was focused on how to simplify the search tree in order to reduce the search complexity. Several improvements have already existed. A detailed analysis of basic B&B algorithm and existing improvements is given under a common framework in which all the algorithms are compared. Based on this analysis, an improved B&B algorithm, BBPP+, is proposed. Experimental comparison shows that BBPP+ performs best. Keywords: Branch & Bound, Feature Selection, Minimum Solution Tree, Global Optimum, Machine Learning, Data Mining, Pattern Recognition
1 Introduction In the fields of data mining, machine learning and statistical pattern recognition, feature selection is often an essential part of the induction algorithm. Given a data set with labeled samples, the task of supervised learning is to design a classifier which assigns an unseen sample to its correct class. Taking feature selection as a preprocessing procedure prior to the classifier design provides many benefits: speeding up the training process, improving the data quality and thereof the efficiency of the induction algorithm, avoiding the risk of overfitting, and increasing the comprehensibility of the learned results. Feature selection as a traditional problem has a long history of research in pattern recognition activities and many efficient algorithms have been recommended. M.Dash and H.Liu made a good survey on most of the existing algorithms based on the framework proposed in their paper[3]. P.Pudil also did a lot of excellent work in summarizing feature selection algorithms[5], revising them, and putting forward new ones. This short paper will focus on a globally optimal feature subset selection algorithm, namely, the Branch & Bound (B&B) algorithm. First introduced by P.M.Narendra and K.Fukunaga[4] in 1977, the B&B algorithm is studied by many other researchers and some modifications have been made to improve the performance. B. Yu and B. Yuan suggested a more compact search tree: minimum solution tree[1], which largely simplifies the search space and therefore reducing the computational complexity. P. Somol and P. Pudil’s FBB (Fast Branch & Bound)[6] and BBPP (Branch & Bound Algorithm with Partial Prediction)[7] algorithms are two recent advances of the B&B. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 549–556, 2003. © Springer-Verlag Berlin Heidelberg 2003
550
Z. Wang, J. Yang, and G. Li
The paper is organized as follows. In section 2, we introduce the basics of the B&B algorithm and give the description of the algorithm. Previous work on the B&B algorithm is briefly introduced in section 3. Then in section 4 we describe how the BBPP algorithm could be modified to further improve the performance, and the detailed description of the revised BBPP, called BBPP+, is given. Finally we present the experimental results, in which all the different algorithms are compared on their computational complexities.
2 Basics of the B&B Algorithm B&B algorithm is the only feature subset selection method that guarantees global optimum without exhaustively exploring the entire search space. The feature selection problem which the B&B deals with is defined as: among all the original D features, select d (d < D, and is given beforehand) features that form an optimal feature subset under a given evaluation criterion. It is required that the adopted evaluation criterion fulfills the monotonicity property. The monotonicity property is defined as:
J( X s ) ≥ J( X t )
for any
X s ⊇ Xt
Here X represents a feature subset, and J(X) is the evaluation function adopted. The monotonicity property is the reason why many feature subset evaluation could be avoided without losing the global optimum. B&B algorithm is known as a complete, but not exhaustive, feature selection method. 2.1 Description of the Classical B&B Algorithm The B&B algorithm is basically a search algorithm. The search space is structured as a tree, called a search tree, which is dynamically constructed top-down during the running of the B&B algorithm. The search process begins at the root, the complete feature set, and continues by eliminating one feature each time to produce its successors. When leaf nodes are reached, the bound (the current known best evaluation value) is updated. Whenever a node’s evaluation value is found to be less than or equal to the bound, the sub-tree rooted from this node will be pruned, because there is no chance that a better target subset could exist in the sub-tree according to the monotonicity property, thus saving some computations. Take D = 5, d = 2 for an example, the topology of a search tree is built as Fig. 1.
Fig. 1. A search tree with D=5, d=2
An Improved Branch & Bound Algorithm in Feature Selection
551
The tree construction process is illustrated by the dashed arrows. As to the construction of the search tree, [4] and [1] have detailed explanation. Here we just point out several important facts: the total number of nodes in the search tree is
C Dd ++11 , the number of leave nodes is C Dd , and the number of one-degree nodes d +1 ( those who have only one child in the search tree) is C D −1 . Following is the description of the B&B algorithm. Some of the symbols and functions used are: D( the size of the complete feature set), d(the target number of features to select), X( the current node to be expanded); X >q( the number of successors of node X), XD( the complete feature set which includes all the features), X*( the current optimal feature subset), bound( the evaluation value of the current optimal subset X*), J(X) ( the adopted evaluation function), Num_features(X)( the size of subset X). The procedures of the B&B algorithm are as follows. 1. Initialization Let X = XD X >q = d+1 bound = 0; 2. ExpandNode(X) 3. Put out X*as the optimal feature subset ExpandNode(X) is a recursive procedure. It begins by the root node, and grows the search tree top-down. ExpandNode(X) consists of the following steps: a) If X is a leaf node, turn to step e). Else, if J(X)!bound then continue to step b), and if J(X) bound then turn to step f) b) Let n = Num_features(X), and each time remove one feature of subset X, so that n child nodes are generated. Name them as X1, X2… Xn c) Evaluate all the child nodes to get J(Xi) i =1, 2, …, n. Sort them in the ascending order:
J ( X j1 ) d)
e)
f)
J ( X j2 )
J ( X jn )
Let p= X >q. Take the first p nodes from the above sequence and make them the successors of node X. For each node of the p successors Xji (i = p, p-1, … 1 pay attention to the order here) let Xji >q = p i 1 then perform ExpandNode(Xji). After all p successors have been expanded in the given order, turn to step f). If X is better than X* J(X) > bound then let bound = J(X), X*=X; Return.
2.2 The Imperfection of the Classical B&B For the feature selection problem defined as selecting d features from the feature set of size D, an exhaustive search method should compute the evaluation function for all
C Dd target subsets. The search space of the B&B algorithm is a search d +1 tree as shown in Fig.1, which has C D +1 feature subset nodes. Only in very rare cases the possible
the B&B has to evaluate all the nodes in the search tree, but the principle of the B&B algorithm does not necessarily guarantee lower number of evaluation computations than that of exhaustive search. Furthermore, as suggested in [6], in certain situations
552
Z. Wang, J. Yang, and G. Li
weak performance may result from the facts that nearer to the root: a) evaluation computation is usually more time-consuming because of larger subsets, b) cut-offs are less frequent because of higher evaluation function values. Most of the search time is spent near the root, while what exactly matters is the evaluation values of leaf nodes. So the classical B&B leaves much room to improve in how to guarantee a considerable number of cut-offs and how to direct the search procedure more rapidly toward the target subsets.
3 Previous Work on Improving the B&B Several researchers put forward modifications to the classical B&B algorithms, including B. Yu and B. Yuan’s minimum solution tree [1], P. Somol and P. Pudil’s FBB (Fast Branch & Bound) [6] and BBPP (Branch & Bound Algorithm with Partial Prediction) [7], both based on the idea of evaluation prediction. 3.1 Minimum Solution Tree
There are C Dd ++11 nodes in the classical B&B search tree, among which C Dd +−11 nodes are one-degree. As illustrated in Fig. 2, the hollow nodes are one-degree nodes.
Fig. 2. Search Tree with 1-degree Nodes
Fig. 3. Search Tree without 1-dgree Nodes d +1
Their paper[1] explained that the rightmost one-degree nodes, C D −1 in total, could be “short traversed” to be leaf nodes. The “short traversed” search tree, called minimum solution tree, is shown in Fig. 3. Experiments show that the “short traversed” nodes account for a considerable proportion of the nodes who are not d +1
pruned, though a low proportion of all the C D +1 nodes. This fact indicates that a large proportion of computations will be saved in searching minimum solution tree instead of the original search tree. 3.2 FBB & BBPP Algorithms These two algorithms are both based on the mechanism: approximating values of evaluation function by predictions. For every single feature fi, Afi is used to estimate feature contributions to the evaluation value. Afi is used to predict the evaluation value of a feature subset after the feature f is removed from its parent, and is
An Improved Branch & Bound Algorithm in Feature Selection
553
dynamically updated. Afi could be interpreted as the average of decrease in evaluation value caused by the take-away of feature fi. The equation to compute Afi is:
Af i =
Af i • Sf i + J ( X ) − J ( X − { f i }) Sf i + 1
1
and let
Sfi = Sfi+1 Sfi is a counter which takes count of the times Afi is updated.
2
The rule of prediction is:
J p ( X − { f i }) = J ( X ) − γ • Af i
3
For detailed descriptions of the two algorithms, papers [6], [7] could be referred to. High performance is reported in those papers, and is further demonstrated by the comparison experiment of our effort.
4 An Improvement to the BBPP BBPP After a thorough analysis of all versions of B&B algorithm mentioned above, we found that there is still some chance of improvement. Especially for the BBPP algorithm, we believe that higher efficiency should be achieved if we revise two steps of the BBPP. Our assumption is proved by our experiment. The search tree we adopted here is the minimum solution tree[1]. The two points of modification are: 1) After selecting the children with the smallest predicted evaluation values to be the successors for a tree node, the BBPP algorithm computed all the true evaluation values of the successors. However, In fact the rightmost successor could be saved from evaluation computation. In a minimum solution tree, the rightmost node is “short traversed” to a leaf node, so that we can directly compute the evaluation function of the leaf node without evaluating the rightmost successor. If we do so, each time a node is expanded we omit a possible child node to evaluate. 2) In the BBPP algorithm, when choosing successors for a tree node, all possible child nodes are ordered by their predicted evaluation values, but the chosen successors are not reordered even after their true evaluation values are gained. Based on the heuristics that selecting successors with smallest evaluation values and putting them in the ascending order from left the right in the search tree may result in more opportunities of cut-off thereof lower overall computational complexity, we can expect better performance of the search after reordering nodes according to their true evaluation values. After making the two modifications, we reorganize the BBPP algorithm, and call it BBPP+. The description of BBPP+ is as follows (the symbols and functions used here could be looked up in Section 2.1). 1. Initialization Let X = XD X >q = d+1 bound = 0; 2. ExpandNode(X)
554
Z. Wang, J. Yang, and G. Li
3.
a) b) c)
Put out X* as the optimal feature subset ExpandNode(X) is a recursive procedure. It begins by the root node, and grows the search tree top-down. ExpandNode(X) consists of the following steps: If X is a leaf node, turn to step e). Else, if J(X)!bound then continue to step b), and if J(X) bound then turn to step f) Let n = Num_features(X), and each time remove one feature of subset X, so that n child nodes are generated. Name them as X1, X2… Xn Predict the evaluation values J(Xi) i =1, 2, …, n., according to equation (3). Sort them in the ascending order:
J ( X j1 )
J ( X j2 )
J ( X jn )
d)
Let p= X >q; Take the first p nodes from the above sequence as the successors of node X. Short traverse the rightmost node Xjp to the leaf node, compute the evaluation function J(Xt), and if J(Xt)> bound, let bound= J(Xt), X*=X ( update the bound and X*). Compute the true evaluation values for the remaining p-1 successors J(Xji) ( i = p-1, p-2 , …, 1), sort them in the ascending order as J(Xk1) J(Xk2) J(Xk(p-1) ), and update the values of Afi and Sfi using equations (1) and (2). For each node of the p-1 successors Xji( i = p-1, p-2 ,…, 1) let Xji >q = p i 1 then perform ExpandNode(Xji). After all p-1 successors have been expanded in the given order, turn to step f). e) If X is better than X* J(X)> bound then let bound = J(X), X*=X; f) Return. Each time when a node is expanded, the BBPP+ algorithm saves one computation of evaluation function, so the amount of reduction in computations is equal to the number of actually expanded non-leaf nodes in the minimum solution tree. If there is no cut-off during the search, then the amount of reduction in computations is equal to the total number of nodes in the tree minus the number of
C Dd ++11
C Dd +−11
C Dd , which means that if the number of evaluation d +1 d +1 computations in the BBPP+ is reduced to the very size of the tree: C D +1 C D −1 , the d +1 d +1 BBPP must compute 2h C D +1 C D −1 C Dd evaluation functions. We must leaf nodes:
point out that the numbers above are valid only when there is no cut-off (which is rarely the case actually), and true values in practice are smaller than those numbers. What we are emphasizing here is the possibility of the BBPP+’s less times of computations than that of the BBPP. The BBPP+ also depends much on the prediction scheme, so it is prone to fail when the approximation of the evaluation value deviate too much from the true value. The prediction mechanism we adopt is exactly what is used in [6], [7]. To a different evaluation criterion, a different prediction method may be more appropriate, so deeper understanding of the evaluation criterion will be of great help to more efficiency of prediction-based B&B algorithms.
An Improved Branch & Bound Algorithm in Feature Selection
555
5 Experimental Results We implement all the involved versions of the B&B algorithm, which includes the CBB( Classical B&B), the MBB( minimum solution tree B&B), the F_CBB( Fast B&B based on classical research tree), the F_MBB( Fast B&B based on minimum solution tree), the BBPP( Branch & Bound Algorithm with Partial Prediction), and the BBPP+( a modified BBPP). The dataset we use here is the same as used in [6], obtained from the Wisconsin Diagnostic Breast Center via the famous UCI repository [8]. The dataset is 30-dimensional, 2 classes, with 357 samples labeled “benign” and 212 “malignant”. As to the evaluation, we adopt the Bhattacharyya distance [2].The experiments are conducted on a PC with a CPU of Pentium 0+], and main memory of 256 0 RAM. The BBPP and BBPP+ algorithms are implemented with minimum solution tree as the search tree. Comparisons are drawn on the number of evaluation computations and the total running time. Different performance of the B&B algorithms is illustrated in Fig. 4.
(a)
(b) Fig. 4. Experiment Results
For 9Gboth the number of evaluation computations and the running time in seconds for the CBB are too large to be fit into the graphs. The reported performance of minimum solution tree in [1] is apparently demonstrated by the drops between the CBB curves and the MBB curves, and between the F_CBB curves and the F_MBB curves. The Fast B&B algorithm also showed its strength as we can see from the two pairs of comparison: from the CBB to the F_CBB, and from the MBB to the F_MBB. The BBPP further increases the efficiency, and the BBPP+ is the best of all. We can see an inconsistency between graph (a) and graph (b): The F_CBB has a less number of evaluation computations than the MBB in graph (a), while in graph (b) it is more time consuming for 1G. This inconsistency is caused by the overhead of the prediction: prediction does need a certain deal of computation. Though the F_CBB evaluates less subsets than the MBB, the computation cost of prediction makes it spend more time.
556
Z. Wang, J. Yang, and G. Li
6 Conclusion Based on a detailed analysis of the B&B algorithm and its improved versions, we further modify the BBPP (Branch & Bound Algorithm with Partial Prediction) algorithm[7]. The modified version is called the BBPP+ algorithm. After a thorough exploration into the BBPP algorithm we indicate the two points where there is still room to improve. The description of the modified algorithm is given in detail. Programs are designed to compare the performance of all the algorithms. Experiment results reproduce the reported advantages of the previous work on the B&B algorithm, and show the BBPP+ algorithm’s best performance among all. More delicate mechanisms of prediction will be much useful in further raising the performance.
References [1] [2] [3] [4] [5] [6]
[7]
[8]
B.Yu and B.Yuan, A more efficient branch and bound algorithm for feature selection. Pattern Recognition, 26:883–889, 1993. K.Fukunaga, Introduction to Statistical Pattern Recognition: 2nd edition. Academic Press, Inc., 1990. M.Dash , H.Liu, Feature Selection for Classification. Intelligent Data Analysis - An International Journal, Elsevier, Vol. 1, No. 3, 1997 P.M.Narendra, K.Fukunaga. A branch and bound algorithm fore feature subset selection. IEEE Transactions on Computers, 26:917–922, September 1977. P. Pudil, J. Novovicová,, P. Somol, Feature selection toolbox software package. ,Pattern Recognition Letters, No. 4, February 2002, pp. 487–492. P. Somol, P. Pudil, F. J. Ferri(Spain), J. Kittler (UK), Fast Branch & Bound Algorithm For Feature Selection. Invited paper for the 4th World Multiconference on Systemics, Cybernetics and Informatics, Proceedings, Orlando, Florida, 2000, pp. 646–651. P. Somol, P. Pudil, J. Grim, Branch & Bound Algorithm with Partial Prediction For Use with Recursive and Non-Recursive Criterion Forms. Accepted to the Int. Conf. on Advances in Pattern Recognition, Rio de Janeiro, 2001. Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science
Classification of Caenorhabditis Elegans Behavioural Phenotypes Using an Improved Binarization Method Won Nah and Joong-Hwan Baek School of Electronics, Telecommunication & Computer Engineering Hankuk Aviation University, Koyang City , South Korea {nahwon, jhbaek}@mail.hangkong.ac.kr
Abstract. Because of simple model organisms, Caenorhabditis (C.) elegans is often used in genetic analysis in neuroscience. The classification and analysis of C. elegans was previously performed subjectively. So the result of classification is not reliable and often imprecise. For this reason, automated video capture and analysis systems appeared. In this paper, we propose an improved binarization method using a hole detection algorithm. Using our method, we can preserve the hole and remove the noise, so that the accuracy of features is improved. In order to improve the classification success rate, we add new feature sets to the features of previous work. We also add 3 more mutant types of worms to the previous 6 types, and then analyze their behavioural characteristics.
1 Introduction Understanding the relationship between genes and the behaviour of C. elegans is a fundamental problem in neuroscience. Important application of quantitative image analysis to C. elegans neurobiology is to investigate molecular mechanisms of drug response. In [1], uncoordinated(‘Unc’) mutants are usually classified into a number of descriptive categories, including ‘kinky’, ‘coiled’, ‘shrinking’, ‘loopy’, ‘slow’, and ‘sluggish’ animals by a human observer. An experienced observer was able to subjectively distinguish worm types, but requirements for objective classification are now increasing. For this reason, automated classification systems using machine vision appeared for the purpose of objective classification. In the previous works [2],[3], classification was automated using the patterns from reliable egg-laying event timing data. In the previous work [4], a closing morphological operation was used to remove noise. However, this method causes some problems on binarization. That is, holes are filled up when the worm is coiled tightly with a small sized hole. Thus, it is difficult to get reliable features from the binary image obtained using the closing operation. In this paper, we propose a new binarization method, which recognizes the event of holes occurring and preserves the hole until the binarization is finished. Contrary to the previous work, we perform the thresholding and median filtering first, then detect the holes using the worm’s thickness. Finally, we remove unwanted holes, leaving the actual holes. In [4], wild type and its five mutant types (goa-1, nic-1, unc-36, unc-38, and egl-19) were classified using 94 features. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 557–564, 2003. © Springer-Verlag Berlin Heidelberg 2003
558
W. Nah and J.-H. Baek
In this paper, in order to improve the success classification ratio, we add new feature sets such as worm length to MER (Minimum Enclosing Rectangle) width ratio, centroid movement, and omega shape information to the previous features. We also add 3 more mutant types of worm: Unc-2, unc-29, and tph-1. C. elegans locomotion was tracked with a Zeiss Stemi2000-C Stereomicroscope mounted with a Cohu High Performance CCD video camera. A computer-controlled tracker (Parker Automation, SMC-1N) was used to put the worms in the center of optical field of the strereomicroscope during observation.
2 Improved Binarization Method Binarization and preprocessing are the most important factors for the successful classification and analysis of C. elegans. In this paper, we propose a new binarization method using hole detection. The method is performed with the following 3 procedures: Thresholding, median filtering, and hole detection. 2.1 Thresholding and Median Filtering First, we should find the background intensity level before the threshold decision. The background intensity of the snapped image has a constant intensity value, because light intensity was constant. It is highly unlikely that the four corner points of the image are parts of the worm’s body. So we use the maximum value of the four corner points as a background intensity value. To decide upon the threshold, a 5×5 moving window was used for scanning over the experimental image, and the mean and standard deviation of the 25 pixels inside the window were computed at every pixel position. The pixel is estimated experimentally to be on the worm’s body when the intensity value of the pixel is about 70% of the background intensity. Note that the background intensity level is higher than that of the worm’s body. Also, pixels on the worm’s body tend to have a larger variance in their intensity values than pixels on the background. So when the standard deviation of the pixels within the window is over 30% of the mean, we consider the pixel as a part of the worm’s body. A median filter has a superior effect for removing impulse noise [5]. Median filtering can preserve small sized holes and remove impulse noise, which is caused by reflecting on the worm’s body. In this paper, we use a 9×9 window for median filtering to remove impulse noise in the binary worm image. In binary image, median filter is easily performed by comparison between the number of ‘1’ in the window and the window size.
2.2 Hole Detection and Noise Removal Even after applying the median filter to the binary worm image, some noise occasionally remains on the worm’s body. To remove the remaining noise, we propose a method that can distinguish between hole and noise. First, we perform the inversion for the binary image. That is, we convert black pixels to white, and white to
Classification of Caenorhabditis Elegans Behavioural Phenotypes
559
black. Then, we perform labeling for the noise or hole regions, but not the worm’s body or the background. And the coordinates of the centroid of each region are computed. A hole is created when the worm loops or touches itself, while noise is located on the worm’s body. Therefore, we can determine whether the region is a hole or noise by measuring the total thickness of the body enclosing the region. In order to measure the thickness, we define 16 vectors. We traverse and count the number of pixels from the centroid of a region in each vector direction until we reach the background. Then we compute the thickness by multiplying the magnitude by the number of pixels traversed. The total thickness is the sum of two opposite direction thicknesses. So, among the 8 total thicknesses, if the minimum is less than 25, the region is considered as noise, because the thickness of the worms used in this work is not larger than 25. If a region is determined as noise, we fill the region with body pixels. Otherwise, we preserve the region as a hole. These procedures are repeated for all of the labeled regions. After detecting holes, we remove the remaining noise using a closing morphological operation. Even though the closing operation is performed, it is possible that other objects could exist, apart from the worm body and hole. A worm’s crawling tracks or eggs could cause the unwanted object. In order to remove the object, we perform labeling [6] and take only the largest object to be the worm’s body, while we remove the other labeled objects. After removing the isolated objects, we then restore the hole region onto the worm’s body. Some examples of binarization are shown in Fig. 1. Note that the hole shown in (d) seems like a rectangular shape, which is caused by the closing operation using a 3×3 structuring element.
(a)
(b)
(c)
(d)
Fig. 1. (a) C. elegans gray image. (b) After binarization with median filtering. (c) Test each labeled area except worm body using 16 vector. (d) After hole detection and noise removal.
3 Feature Extraction After binarization and preprocessing such as skeletoning, feature extraction is performed. In [4], 3 kinds of features are extracted: Large-scale movement, body size, and body posture. Features for the large-scale movement are global moving distance and reversal frequencies during some intervals. Features related to body size are
560
W. Nah and J.-H. Baek
worm’s area, length, thickness, and fatness. Features related to body posture are eccentricity, height and width of the MER, amplitude, and angle change rate. Then, combined features such as minimum, maximum, and average of each feature are computed. A total of 94 features are used in [4]. In order to improve classification success rate, we add a new feature set to the existing features. We compute the ratio of worm length and MER width (LNWDR). This measurement provides information on the worm’s straightness or waviness. We also use moving distance of normalized centroid (CNTMV). The coordinates of the worm’s centroid are normalized with the MER width and height. Normalization and moving distance are computed with the following equation: Cn = (
Cw Ch , ), width height
C = (C w , Ch ) .
(1)
CNTMV = C n ,t − C n ,t −1
We also use the number of frames with omega shape (NUMOMEGA) and the number of times the worm changes from non-omega shape to omega shape (OMEGACHG). These measurements give information about the regularity of the worm’s skeleton wave. Omega shape can be detected by the amplitude ratio, Ar. If Ar is 0.0, then we can decide that the worm’s posture is omega shaped. The amplitude ratio is evaluated by the following equation: Ar =
min( A, B ) . max( A, B )
(2)
Upper and lower amplitudes, and an example of an omega shaped worm are shown in Fig. 2.
B
A
(a) A: lower amplitude, B: upper amplitude
(b) Omega shape
Fig. 2. Skeleton’s amplitudes and omega shaped worm
We also subdivide the interval for measuring the reversals. The number of reversals is measured during 10, 20, 30, 40, 50, and 60 sec. Including the combined features of the new feature set, a total of 23 new features are added to the 94 previous features. The total number of features extracted in this paper is now 117.
4 Classification Using the CART The CART (Classification and Regression Tree) makes a binary classification tree using a learning sample [7]. The root node of the tree contains all the training cases; the worm types are equally mixed together in this node. The goal of the CART is to
Classification of Caenorhabditis Elegans Behavioural Phenotypes
561
successively subdivide the training set using binary splits in such a way that the data associated with the terminal nodes of the tree do not have a mix of worm types; rather, each node should be as pure as possible. In order to measure the impurity of a data set, the Gini index of diversity is applied. A simple rule is to assign the most popular class to each terminal node. To estimate the classification rate, the CART uses 10-fold cross-validation. To perform the 10-fold cross-validation, it splits the learning sample into ten equal parts. Nine tenths of the learning sample is used to create the tree and the remaining tenth is used to estimate the error rate of selected sub-trees. This procedure is performed until all the subsets are tested. The 23 new feature variables used to classify using the CART are shown in Table 1. Note that the 94 previous feature variables are not listed in the table. Please refer to [4]. Table 1. Additional feature variables used in the CART analysis (total of 117 variables)
CART Variable Name
Description 94 previous features
RV20MIN, RV20MAX, RV20AVG
Min, max, average number of reversals in 10 sec
RV40MIN, RV40MAX, RV40AVG
Min, max, average number of reversals in 20 sec
RV60MIN, RV60MAX, RV60AVG
Min, max, average number of reversals in 30 sec
RV100MIN, RV100MAX, RV100AVG
Min, max, average number of reversals in 50 sec
RV120MIN, RV120MAX, RV120AVG
Min, max, average number of reversals in 60 sec
LNWDRMIN, LNWDRMAX, LNWDRAVG CNTMVMIN, CNTMVMAX, CNTMVAVG
Min, max, average of worm length to MER width ratio Min, max, average of normalized centroid movement
NUMOMEGA
Number of frames the worm is in omega shape
OMEGACHG
Number of times the worm changes from non-omega shape to omega shape
5 Experimental Results C. elegans locomotion is tracked with a stereomicroscope mounted with a CCD camera. A computer-controlled tracker is used to put the worms in the center of the optical field of the stereomicroscope during observation. To record the locomotion of a worm, an image frame is snapped every 0.5 seconds for 5 minutes. So the video clip of each worm consists of 600 frames. All of the software for binarization and feature extraction is coded in C++ and implemented on a PC with a 1.7GHz CPU. In this experiment, we use 9 different worm types (wild, goa-1, nic-1, unc-36, unc-38, egl19, unc-29, unc-2, tph-1). Each worm type has 100 worms, except tph-1 which has 60 worms. So a total of 860 worms are used in this experiment. Primary features are
562
W. Nah and J.-H. Baek
extracted from each frame after binarization and preprocessing. Then, 117 features for a worm are computed from the primary features of 600 frames. Finally, the 860×117 variable set is fed into the CART for analysis and classification. The CART creates a maximal tree and then prunes it back to obtain an optimal one. For our case, the optimal tree has 42 terminal nodes. To reduce the complexity of the tree, we set the node complexity parameter to 0.007. The resulting classification tree, shown in Fig. 3, has only 13 terminal nodes. The total number of worms in each node is indicated below each node, and terminal nodes are highlighted. Each terminal node is classified to the type of worm that is in the majority. Node1 LNMFRMAX <= 1384.475 N=860
Egl-19: 94 other: 3
Node2 MVHLFAVG <= 15.435
N=97 Node12 ANCHRMAX <= 5.270
N=763 Node3 LNMFRMAX <= 813.295
Node4 TOTRV <= 22.500
N=102
N=471
Unc-29: 69 other: 23
N=225
N=92 Unc-36: 66 other: 23
N=96
N=94
Unc-2: 48 other: 18
N=405 Node8 LPSUM <= 85.000
Node7 CNTMVAVG <= 0.029 N=133
Tph-1: 35 other: 9 N=44
Node11 MVHLFMAX <= 15.130
N=180
Node9 AREAMAX <= 5026.000
N=35 N=145
N=89
Goa-1: 92 other: 2
N=66
Node5 CNTLRMIN <= 0.069 Node6 RV80MAX <= 3.500
N=190
Wild: 95 other: 1
N=573 Nic-1: 93 other: 9
Unc-2: 7 other: 2
Node10 LNMFRAVG <= 903.185
Unc-29: 9 other: 4
Tph-1: 15 other: 7
N=13
N=22
N=9 N=136 Unc-38: 84 other: 32
Tph-1: 13 other: 7
N=116
N=20
Fig. 3. Classification tree when node complexity parameter is 0.007
Cross-validated relative cost is an error measure based on a test sample. The term ‘cost’ simply means ‘misclassification rate’ in our application. Cross-validated relative cost verses the number of terminal nodes is shown in Fig. 4. A tree with 42 terminal nodes is optimal, because the cross-validated relative cost becomes worse again as the tree grows. Table 2 shows the cross-validation classification probability table. The success rates are listed along the shaded diagonal, while the off-diagonal entries represent the misclassification error rates. From this, we can see that wild, goa-1, nic-1 and egl-19 types have relatively high success classification rates compared to unc (uncoordinated mutants) types. This is due to the fact that unc-36, unc-38, unc-2, unc-29, and tph-1 have similar behavioural characteristics.
Classification of Caenorhabditis Elegans Behavioural Phenotypes
563
1.2
1.0
Cost
0.8
0.6
0.4
0.2
0.0 0
10
20
30
40
50
60
Number of terminal node
Fig. 4. The cross-validated relative cost
Table 2. Cross-validation classification probability table Predicted Wild
Goa-1
Nic-1
Unc-36
Unc-38
Egl-19
Unc-2
Unc-29
Tph-1
Wild
0.950
0.000
0.000
0.000
0.000
0.020
0.030
0.000
0.000
Goa-1
0.010
0.930
0.000
0.020
0.010
0.000
0.020
0.000
0.010
Nic-1
0.000
0.000
0.880
0.000
0.060
0.000
0.000
0.040
0.020
Unc-36
0.000
0.000
0.000
0.760
0.050
0.020
0.000
0.120
0.050
Unc-38
0.000
0.000
0.000
0.040
0.730
0.000
0.050
0.070
0.110
Egl-19
0.000
0.000
0.000
0.020
0.000
0.930
0.000
0.020
0.030
Unc-2
0.033
0.067
0.017
0.017
0.083
0.000
0.717
0.000
0.067
Unc-29
0.000
0.000
0.000
0.160
0.020
0.010
0.010
0.760
0.040
Tph-1
0.000
0.010
0.020
0.100
0.210
0.010
0.040
0.060
0.550
Actual
6 Conclusions Computer vision methods offer a number of clear advantages over real-time observation for the characterization of behavioral phenotypes. First, these approaches provide a specific, quantitative definition of a particular mutant phenotype, facilitating quantitative comparisons between different mutant strains. Second, a computerized imaging system has the potential to be much more reliable at detecting abnormalities that are subtle or manifested over long time scales. Finally, a
564
W. Nah and J.-H. Baek
computerized system makes it possible to comprehensively assay multiple aspects of behavior simultaneously, yielding a complex phenotypic signature that can be highly diagnostic of a specific molecular defect. Using our proposed binarization and hole detection methods, we can obtain more precise features compared to previous work. From the CART results, we can see that wild, goa-1, nic-1 and egl-19 types have relatively high success classification rates, but unc types and tph-1 have a low classification probability, which is due to the fact that unc types have very similar behaviours. For further work, more significant features for unc types and tph-1 are to be extracted. Acknowledgements. The authors are grateful to Prof. Pamela Cosman and Prof. William Schafer in University of California, San Diego, for their invaluable advice.
References 1. Hodkin, J.: Male phenotypes and mating efficienty in Caenorhabditis elegans. Genetics (1983) 43–64 2. Waggoner, L., et al.: Control of behavioral states by serotonin in Caenorhabditis elegans. Neuron (1998) 203–214 3. Zhou, G.T., Schafer, W.R., Schafer, R.W.: A three-state biological point process model and its parameter estimation. IEEE Trans On Signal Processing (1998) 2698–2707 4. Baek, J.H., et al.: Using machine vision to analyze and classify Caenorhabditis elegans behavioral phenotypes quantitatively. Journal of Neuroscience Methods, Vol. 118. (2002) 9– 21 5. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall Inc., New Jersey (2002) 6. Jain, R., Kasturi, R., Schunck, B.G.: Machine Vision. McGraw-Hill Inc., New York (1995) 7. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Chapman & Hall Inc., New York (1984)
Consensus versus Conflicts – Methodology and Applications Ngoc Thanh Nguyen and Janusz Sobecki Department of Information Systems, Wroclaw University of Technology, Poland. {thanh,sobecki}@pwr.wroc.pl
Abstract. This paper is dedicated to applications of consensus methods in solving conflicts defined by Pawlak [12,13]. The most characteristic feature of consensus methods surveyed in this paper is that the structures of opinions of conflict participants are multi-value and multi-attribute and the basis of consensus determining consists of distance functions between tuples. In this work an overview of consensus methods and their applications in the scope of conflict solving is presented.
1
Introduction
Consensus theory has arisen in social science and has a root in choice theory. The fundamental problem of choice theory can be formulated as follows: For given set X (of alternatives, objects) being a subset of an universe U the choice relies on selection of a subset of X. This choice could be done on the basis of some criteria. If it is performed in a deterministic way, then its repeating for the same set X should give the same result. Thus one can say that a choice function exists. The foundation problem of consensus theory differs from this one of choice theory in the assumption that the choice result (i.e. consensus) must not be a subset of the set presented for choice, but naturally it should be a subset of universe U . Furthermore, a consensus of set X must not have the same structure as the elements of X, what is assumed in choice theory. At the beginning of consensus researches the authors dealt with simple structures of the elements of universe U , such as linear or partial orders. Later, more complex structures such as n-trees, partitions, hierarchies, etc., have been also investigated. Most often homogeneous structures of the universe’ elements are assumed. In this paper an overview of applications of consensus methods to solving conflicts in distributed environments, is presented. Section 2 includes a history of consensus theory. In Section 3 the notions referring to conflicts are presented. The proposed methods for solving conflicts using consensus are described in Section 4. Some conclusions are included in Section 5.
2
Consensus Methods – An Overview
In general, consensus theory (similarly as data mining) deals with problems of data analysis in order to extract valuable information. However, consensus G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 565–572, 2003. c Springer-Verlag Berlin Heidelberg 2003
566
N.T. Nguyen and J. Sobecki
methods differ from data mining methods as regards to their aims. The task of data mining is to discover patterns in existing data, what means that a data mining method is used when one is interested in searching some ”cause-effect” relations. On the other hand, consensus methods enable to determine a version of data, which, at first, should at best represent a set of previously given data, and, at second, should be a good compromise acceptable for parties that are in conflict because of their authorship of the original data. According to Barthelemy et al. [2,3] problems considered by consensus theory can be classified into the following two classes: – problems, in which a certain and hidden structure is searched, – problems, in which inconsistent data related to the same subject are unified. The first class consists of problems of searching a structure of a complex or internally organized object. This object can be a set of elements and the searched structure to be determined can be a distance function between these elements. Data that are used to uncover this structure usually come from experiments, observations, reflect this structure, however, not necessarily in a precise and correct way. The second class consists of problems that appear when the same subject is represented (or voted) in a different way by experts (or agents or sites of a distributed system). In such a case a particular method is desired that makes it possible to deduce from the set of given alternatives only one alternative to be used in further processing. For example, this class consists of the well known the alternatives ranking problem [1,15] or the committee election problem [9]. Figure 1 below represents the structure of a general approach used to solve consensus problems from the second class.
Profile X of objects of type T1
- Algorithm for determining consensus
- Object o of type T2 (consensus)
Fig. 1. The scheme for consensus determining
As it can be noticed, a consensus determined for a particular profile X does not need to be of the same type as the objects of this profile. An illustrative example is the alternatives ranking problem, in which objects belonging to a profile are linear orders on a set of alternatives, while a related consensus is a single alternative. Moreover, consensus problems from the second class can be solved in three different approaches called axiomatic, constructive, and optimization ones. Within the axiomatic approach some axioms are used to specify the conditions, which should be fulfilled by consensus functions. For the alternatives ranking problem 7 rational requirements for consensus choice have been defined in [1] and called Unrestricted domain (UD), Pareto (P ), Independent of irrelevant alternatives (IIA), No dictator (ND), Neutrality (N ), Consistency (C ) and
Consensus versus Conflicts – Methodology and Applications
567
Condorcet criterion (Cc). It turned out, however, that no consensus choice functions exist that satisfy all of the above mentioned requirements simultaneously. Namely, Arrow [1] proved that if each ranking belonging to a profile consists of more than 2 elements, then all consensus choice functions fulfilling the conditions UD, P and IIA do not satisfy the condition ND. Later, Young and Levenglick [15] showed that the unique choice function satisfying the conditions N, C, Cc, UD and P is the Kemeny median [1], which requires the consensus to be as near as possible to the profile’ elements. In work [7] the authors have used axioms for determining consensus functions for such structures of objects as semilattices or weak hierarchies. In the majority of cases the most popular function is simply a median. Within the constructive approach consensus problems are solved on the levels of the microstructure and macrostructure of a universe U of objects. The microstructure of U is understood as a structure of its elements. Such a structure may be, for example, a linear order of a set (in the alternatives ranking problem), or a set of candidates (in the committee election problem). The macrostructure of U is understood as a structure of U in general. Such a structure may be, for example, a preference relation or a distance (or similarity) function between objects from U . As it has been stated above, in a consensus problem objects of a profile should have the same structure, but their consensus may be of a different structure. In the representation choice problem [8] it is assumed that the structure of each representative is the same as the structure of previously given profile elements. Most often the macrostructure of the set U is given by a binary relation on U or by a distance function defined on the basis of the microstructure of its particular elements. The following microstructures have been studied in detail: linear orders [1,6]; semilattices [2]; n-trees [5]; ordered partitions and coverings [9]; non-ordered partitions [3]; and weak hierarchies [7]. A large number of works was dedicated to develop heuristics based on the Kemeny median for determining consensus of ranking collections, which is a NP-complete problem. Within the optimization approach defining consensus choice functions is often based on some optimality rules. These rules can be classified as global optimality rules, Condorcet’s optimality rules or maximal similarity rules.
3
Conflicts in Distributed Environments
A distributed system, in general, is understood as a set of autonomous units (computers, agents, etc.) connected with one another by a network, which create an integral environment. The units of a distributed system are called its sites. Processing information is done independently in the sites of the system, this means that [4] each site together with available resources (programs, data) is autonomous in this process. For example, a branch of a bank stores and processes information about account balances and transactions of its clients. It often happens that several sites of a distributed system process common objects. This means that these sites store and process data about the same objects or events. As the reasons of these “duplicates” one can mention:
568
N.T. Nguyen and J. Sobecki
– The need of data verification: the credibility of data which are generated by a site about some subject may not be large, especially as these data are incomplete and uncertain. Then a few sites are used to generate more credible data about this subject; – The need of data replication: for larger safety of data and quicker access to them, data that refer to some subjects, are often stored and processed simultaneously in several sites of the system. With the above assumption one often has to deal with data conflicts which rely on such phenomenon that referring to the same subject, several sites store inconsistent (or even contradictory) information. As an example one can mention a meteorological system, the sites of which are placed in different regions of a country. The regions occupied by these sites often overlap. Thus, it is possible that they may generate different weather forecasts referring to the same region. In this way a conflict exists and its result is such that in working out a weather forecast for the whole country the proper forecast for a given region is not known. In conclusion, the reason of conflicts in distributed systems follows from the independency and autonomy of their sites. A conflict takes place if referring to the same subject several sites generate inconsistent data. The two following questions can be generated: 1. What is the inconsistency of data? and 2. How to solve these kind of conflicts? Before giving the answer to the first question, we propose to consider the following example. Let the sites of a distributed metrological system be placed in different regions of a country. In each site there exists an agent whose task is based on monitoring weather phenomena in its regions and making on the basis of the observations the forecasts, for example, for the next day. We assume that these forecasts refer to the degree and timestamps of occurrences of rain, snow, sunshine, temperature and wind. Let us consider the following sets of data: Z1 and Z2 . These sets given by agents a1 and a2 respectively, consist of data referring to the timestamps of rain in region r in the next day. Assume for example that Z1 = {[3a.m.–5a.m.], [3p.m.–7p.m.]} and Z2 = {[8a.m.–12a.m.]}. Notice that these sets are different. Thus we can say about the inconsistency of the data included in these sets. However, the inconsistency exists if we assume that the knowledge of agents is exhaustive in the sense that, for example, according to the opinion of agent a2 , in next day it will be raining from 8a.m. to 12a.m. and will not be raining in the rest of day. If, however, the rest of this day is treated as the ignorance of agents (that is it does not know if it will be raining or not), then one can say that the data given by these agents are not inconsistent, but they complement one another. From the above example it follows that the existence of data conflicts is dependent on the interpretation of these data. Generally, one can distinguish the following 3 constrains of a conflict: – Conflict body: specifies the direct participants of the conflict. – Conflict subject: specifies to whom (or what) the conflict refers and its topic. – Conflict content: specifies the opinions of the participants on the conflict topic.
Consensus versus Conflicts – Methodology and Applications
569
Referring to conflict structures one can distinguish 2 kinds of conflicts of data semantics: 1. One-value conflicts: for representing the opinion of a participant for given subject only one elementary value is needed. 2. Multi-value conflicts: a participant needs more than one elementary value for representing his opinion. The simplest conflict takes place when two bodies have different opinions on the same subject. In works [12,13] Pawlak specifies the following elements of a conflict: a set of agents, a set of issues, and a set of opinions of these agents on these issues. The agents and the issues are related with one another in some social or political context. Each agent for each issue has three possibilities for presenting his opinion: (+) — yes, (-) — no, and (0) — neutral. Then we say that a conflict should take place if there are at least two agents whose opinions on some issue are different. In Pawlak’s approach the body of conflict is a set of agents, the subject is a set of contentious issues and the content is a collection of tuples representing the participants’ opinions. Information system tools [14] seem to be very good for representing conflicts. One should pay attention to the difference between the semantics of Pawlak’s neutral opinion and the semantics of agent’s uncertainty defined in this model. Namely, neutrality that most often appears in voting, may not mean uncertainty, while the second notion represents the fact that an agent is not competent to give an opinion for given subject.
4
The Roles of Consensus in Solving Conflicts
Consensus methods were used for conflict solving in ancient Greece. Consensus has usually been understood as a general agreement in situations where some bodies have not been agreed on some matter. In early periods of democracy (political or social) consensus was a useful tool for solving conflicts1 , which was the most often determined by the majority rule. We now try to analyze what functions consensus should fulfill in solving conflicts in distributed environments. Before the analysis we should consider what is represented by the conflict content (i.e. the opinions of conflict participations). We assume that the opinions included in the conflict content represent unknown solution of some problem. The following two cases my take place: 1. This solution is independent from the opinions of the conflict participants. 2. This solution is dependent on the opinions of the conflict participants. In the first case the independence means that the solution of the problem exists but it is not known for the conflict participants. The reasons of this phenomenon may follow from many aspects, among others, from the ignorance of the conflict participations or the random characteristics of the solution which may make the solution impossible to be calculated in a deterministic way. Thus the content of the solution is independent from the conflict content and the conflict 1
(. . . ) consensus — that is general agreement in matters of opinions or testimony — has been a powerful tool for political or social change (. . . ) [5].
570
N.T. Nguyen and J. Sobecki
participations for some interest have to “guess” it. In this case their solutions have to reflect the proper solution but it is not known if in a valid and complete way. As an example of this kind of conflicts we can consider different forecasts given by meteorological stations referring to the same region for a period of time. The problem is then relied on determining the proper forecast which is unambiguous and really known only when the time comes, and is independent from given forecasts. In the second case this is the opinions of conflict participants, which decide about the solution. As the example, let us consider votes at an election. The result of the election is determined only on the basis of these votes. In general this case has a social or political character and the diversity between opinions of the participants most often follow from differences of choice criteria or their hierarchy. In both cases the natural solution of data conflict relies on determining a version of data on the basis of given versions. This final version should satisfy the following conditions: – It should best reflect the given versions, and – It should be a good compromise which could be acceptable by the conflict participants. The first condition is rather more suitable for the first case presented above because the versions given by the conflict participations reflect the “hidden” and independent solution but it is not known if in what degree. Thus in advance each of them is treated as partially valid and partially invalid (what its part is valid and what its part is invalid — it is not known). The degree in which an opinion is treated as valid is the same for each opinion2 . This degree may not be equal to 100%. The reason for which all the opinions should be taken into account is that it is not known how large is the degree. It is known only to be greater than 0 and smaller than 100%. In this way the consensus should best reflect these opinions. In other words, it should best represent them. The second condition refers to the second case in which the problem solution is dependent on the opinions of conflict participants. Thus consensus not only should best represent the opinions but also should reflect them in the same degree (with the assumption that each of them is treated in the same way). It should be “acceptable compromise” what means that any of opinions should neither be “harmed” nor “favored”. Consider the following example: From a set of candidates (denoted by symbols A, B, C, . . . ) 3 voters have to choose a committee (as a subset of the candidates’ set). In this aim each of voter votes on such a committee which in his opinion is the best one. Assume that the votes are the following: {A, B, C}, {A, B, C} and {D}. Let the distance between 2 sets of candidates is equal to the cardinality of their symmetrical difference. If the consensus choice is made only by the first condition then committee {A, B, C} should be determined because the sum of distances between this one and the all the votes is minimal. However one can note that it prefers the first 2 votes while 2
However, some weights may be introduced to distinguish the credibility of the participants, and as the consequence, the credibility of their opinions.
Consensus versus Conflicts – Methodology and Applications
571
totally ignoring the third (the distances from this committee to the votes are: 0, 0 and 4 respectively). If now as the consensus we take committee {A, B, C, D} then the distances would be 1, 1 and 3. In this case the consensus is neither too far from the votes nor “harms” any of them. It has been proved that these conditions in general may not be satisfied simultaneously [10]. It has been shown that the choice according to the criterion of minimization of the sum of squared distances between consensus and the profile’ elements gives a consensus more uniform than the consensus chosen by minimization of the distances’ sum. Therefore, the criterion of the minimal sum of squared distances is also very important. However, the squared distances’ minimal sum criterion often generates computationally complex problems (NP-hard problems), which demand working out heuristic algorithms. In works [8,9,10,11] a methodology for consensus choice and its applications in solving conflicts in distributed systems is presented. It could be partitioned into 2 parts. In the first part general consensus methods which may effectively serve to solving multi-value conflicts are worked out. For this aim a consensus system, which enables describing multi-value and multi-attribute conflicts is defined and analyzed (it is assumed that the attributes of this system are multivalue). Next the structures of tuples representing the contents of conflicts are defined as distance functions between these tuples. Finally the consensus and the postulates for its choice are defined and analyzed. For defined structures algorithms for consensus determination are worked out. Besides the problems connected with the susceptibility to consensus and the possibility of consensus modification, are also investigated. The second part concerns varied applications of consensus methods in solving of different kinds of conflicts which often occur in distributed systems. The following conflict solutions are presented: reconciling inconsistent temporal data; solving conflicts of the states of agents’ knowledge about the same real world; determining the representation of expert information; creating an uniform version of a faulty situation in a distributed system; resolving the consistency of replicated data and determining optimal interface for user interaction in universal access systems. An additional element of this work is the description of multiagent systems AGWI aiding information retrieval and reconciling in the Web, for which implementation the platform IBM Aglets is used. The consensus choice methodology was also applied for adaptive user interface construction of multimodal web-based information systems [11]. The user population of such systems is very differentiated what results in their different information needs and interaction preferences. To meet their requirements, many web-based systems applies the interface personalization mechanisms. The personalization process is usually associated with collecting some user and usage data that could be used to model the user. Quite many users, however, are reluctant to personalize their interfaces, in these cases adaptive methods could be applied.
572
5
N.T. Nguyen and J. Sobecki
Conclusions
This paper includes results worked out by the researchers (among others the authors of this work) in applications of consensus methods to solving conflicts in distributed environments. It is known that the relational structures for conflict representing are useful and effective, but in many practical cases are not sufficient. Therefore, there is a needed to investigate other structures, for example, logical or object oriented ones. Besides, a solid foundation of conflict theory is also needed to be worked out.
References 1. Arrow K.J.: Social Choice and Individual Values. Wiley New York (1963). 2. Barthelemy J.P., Janowitz M.F.: A Formal Theory of Consensus. SIAM J. Discrete Math. 4 (1991) 305–322. 3. Barthelemy J.P., Leclerc B.: The Median Procedure for Partitions, DIMACS Series in Discrete Mathematics and Theoretical Computer Science 19 (1995) 3–33. 4. Coulouris G, Dollimore J., Kindberg T.: Distributed Systems, Concepts and Design. Addison-Wesley (1996). 5. Day W.H.E.: Consensus Methods as Tools for Data Analysis. In: Bock, H.H. (ed.): Classification and Related Methods for Data Analysis. North-Holland (1988) 312– 324. 6. Fishburn P.C.: Condorcet Social Choice Functions. SIAM J. App. Math. 33 (1977) 469–489. 7. McMorris F.R., Powers R.C., The Median Function on Weak Hierarchies. DIMACS Series in Discrete Mathematics and Theoretical Computer Science 37 (1997) 265– 269. 8. Nguyen N.T.: Using Distance Functions to Solve Representation Choice Problems. Fundamenta Informaticae 48(4) (2001) 295–314. 9. Nguyen, N.T.: Consensus Choice Methods and their Application to Solving Conflicts in Distributed Systems. Wroclaw University of Technology Press (2002) (in Polish). 10. Nguyen, N.T.: Consensus System for Solving Conflicts in Distributed Systems. Journal of Information Sciences 147 (2002) 91–122. 11. Nguyen N.T., Sobecki J. (2003): Using Consensus Methods to Construct Adaptive Interfaces in Multimodal Web-based Systems. To appear in: Journal of Universal Access in the Information Society 2(4), Springer-Verlag. 12. Pawlak Z.: An Inquiry into Anatomy of Conflicts. Journal of Information Sciences 108 (1998) 65–78. 13. Pawlak Z.: Anatomy of conflicts, Bull. EATCS 50 (1993) 234–246. 14. Skowron A., Rauszer C., The Discernibility Matrices and Functions in Information Systems. In: E. Sowi˜ nski (Ed.), Intelligent Decision Support, Handbook of Applications and Advances of the Rough Sets Theory, Kluwer Academic Publishers (1992) 331–362. 15. Young H.P., Levenglick A.: A Consistent Extension of Condorcet’s Election Principle. SIAM J App. Math. 35 (1978) 285–300.
Interpolation Techniques for Geo-spatial Association Rule Mining Dan Li, Jitender Deogun, and Sherri Harms Department of Computer Science and Engineering University of Nebraska-Lincoln, Lincoln NE 68588-0115
Abstract. Association rule mining has become an important component of information processing systems due to significant increase in its applications. In this paper, our main objective is to find which interpolation approaches are best suited for discovering geo-spatial association rules from unsampled points. We investigate and integrate two interpolation approaches into our geo-spatial association rule mining algorithm. We call them pre-interpolation and post-interpolation approaches.
1
Introduction
As part of an NSF supported Digital Government Research project, we are developing a Geo-spatial Decision Support System (GDSS), with an initial focus on drought risk management. The system focuses on the information needs of the user, and provides users with critical ongoing information for drought management. One of our project objectives is to discover human-interpretable patterns and rules associated with ocean parameters, atmospheric indices and climatic data. These rules will reflect the influence of ocean parameters (e.g. SOI, NAO, PDO) upon climatic and drought indices (e.g. SPI, PDSI). Association rule mining, one of the most important Knowledge Discovery in Databases (KDD) techniques in data mining, help identify such relationships. The term association rule has been used to describe a specific form of rules like X ⇒ Y , where X is called rule antecedence while Y its consequence [1]. Usually, cost and time considerations do not allow data to be sampled at all points in a region, therefore, spatial interpolation has been widely used in geographical information systems (GIS). It has potential to find the function that will best represent the entire area. Such a function will predict data values at unsampled points given a set of spatial data at sample points.
2
Preliminaries
A typical association rule has the form of X ⇒ Y , where X is an antecedent episode, Y is the consequent episode, and X ∩ Y = ∅. Here, an episode is a
This research was supported in part by NSF Digital Government Grant No. EIA0091530 and NSF EPSCOR, Grant No. EPS-0091900.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 573–580, 2003. c Springer-Verlag Berlin Heidelberg 2003
574
D. Li, J. Deogun, and S. Harms
collection of events in a particular order occurring close enough in time. The time interval within which the events occur is called a sliding window. The frequency of an episode is the number of windows in which the episode occurs. Support and confidence are two widely used metrics in measuring the interestingness of association rules. The support of a rule X ⇒ Y is denoted by sup(X ⇒ Y ). It indicates the percentage of episodes in the dataset that contain both X and Y . We can see that support is simply a measure of its statistical significance. We use conf (X ⇒ Y ) to denote the possibility that an episode contains Y given that it contains X. It is defined as conf (X ⇒ Y ) = sup(X ⇒ Y )/sup(X). Since the number of association rules is usually very large, several algorithms on discovering representative association rules have been reported recently [3], [7]. A set of representative association rules is a minimal set of rules from which all association rules can be derived. In an earlier paper, we present REAR algorithm [2] which allows the user to constrain the search to user specified target episodes, and to find the episodes quickly and without distraction of the other non-interesting episodes. We integrate two basic interpolation methods, i.e. IDW and Kriging, into our REAR algorithm. This makes it possible to discover relationships between environmental and climatic variables for unsampled sites that are not covered by sample sites. The IDW interpolation assumes that each input point has a local influence that diminishes with distance. In IDW, the points closer to the unsampled point are assigned greater weight than those that are further away. Shepard’s method gives the simplest form of IDW interpolation. The equation used in this method is: F (x, y) =
n
wi fi (xi , yi )
(1)
i=1
where n is the number of sampled points in the data set, wi are weight functions assigned to each sample point, and fi (xi , yi ) are the data values at each sample point with coordinate (xi , yi ). (x, y) denote the coordinates of the interpolated (unsampled) points. In (1), the weight function is calculated as hi −p wi = n −p j=1 hj
(2)
where p is the power parameter, and hi is the Euclidean distance from the sample point to the interpolated point. It may be noted that p controls the significance of the surrounding points upon the interpolated value. A higher power results in less influence from distant points. As we can see, weight functions are normalized so that the sum of weights of all sample points equals 1, and the closer a sample points is to the interpolated point, the greater the weight is. In comparison to the IDW method, Kriging interpolation uses a different kind of weight function that does not only depend on the distance but also on the geographic orientation of the sample point with respect to the interpolated node. A detailed description of Kriging method by Oliver et al. can be found in [5].
Interpolation Techniques for Geo-spatial Association Rule Mining
3
575
Interpolation Techniques
We investigate two different approaches (pre-interpolation and postinterpolation) for integrating spatial interpolation techniques into our association rule mining algorithm. In the pre-interpolation approach, we apply appropriate interpolation methods to obtain datasets for query points, then, working on interpolated data, association rules are discovered for these points. In the postinterpolation approach, we first discover association rules for sample points, then propose an interpolation algorithm to discover association rules for query points. 3.1
Pre-interpolation Method
In the pre-interpolation method, the datasets of each sample site are taken as the operands of interpolation algorithm. The basic steps in this approach are: (1) Data Interpolation: applying the appropriate spatial interpolation algorithms to get the data values for unsampled sites. (2) Data processing: normalizing and clustering data; transforming it into a format useful for data mining. (3) Data Mining: using rule discovery algorithm to generate association rules with user specified minimum frequency, window width, and minimum confidence value as input parameters. In the data interpolation procedure, we first apply basic Shepard’s IDW and Kriging interpolation methods to obtain new datasets. For IDW, since the sample points which are further away from query point contribute little to the interpolated value, we only choose sample points that are “close enough” to the interpolation point considering Euclidean distance. In the IDW method, the weight functions merely depend on the Euclidean distances between the interpolation point and a collection of sample points. In a geo-spatial system, however, we need to consider geographic or other properties that may influence the interpolated data value. For instance, in our GDSS project, we need to take climatic region information into consideration when we interpolate weather related datasets, because it is most likely that stations within the same climatic region share the similar climatic pattern [2]. The division of climatic regions is mainly determined by terrain and hypsographic properties. To integrate such information into interpolation algorithms, we propose a new version of Shepard’s IDW method by modifying the weight functions. Higher weights are assigned to the sample points that are in the same climatic region as interpolation point and lower weights to other points. We propose to change the distance functions as k × hi , if in the same climatic region; hi = (3) hi , otherwise. where k (0 < k ≤ 1) is a constant that can be determined empirically. It controls the significance of climatic regions upon the interpolated value. The smaller the value of k is, the more weight it contributes to the data value. Alternatively, we can modify the weight functions by changing the value of power parameter in equation (2). The inclusion of climatic region information renders more accuracy to the interpolated data values, as shown in the experimental results.
576
D. Li, J. Deogun, and S. Harms
After data is generated for unsampled sites by interpolation tools, it is discretized and clustered into seven meaningful categories, i.e., extremely dry, severely dry, moderately dry, near normal, moderately wet, severely wet, and extremely wet. This step relies on domain-expert involvement for proper normalization and transformation. For each index, we assign seven continuous integers which correspond to seven categories. After the data processing step, the data is well prepared for knowledge discovery. 3.2
Post-interpolation Method
In the post-interpolation method, we first discover association rules for sample sites, and then interpolate these rules to find association rules for unsampled sites. The guiding principle for this approach is that the relationships between environmental and climatic indices for sites that are spatially close may be quite similar. The basic steps in this approach are as follows: (1) Data processing: (same as in pre-interpolation method). (2) Data Mining: using rule discovery algorithms to generate association rules for sample sites. (3) Rule Interpolation: developing rule interpolation algorithms to interpolate rules from sample sites to obtain association rules for unsampled sites. Rule clustering is an important operation in rule interpolation. Suppose a rule or several similar rules occur at the surrounding sites, with high possibility, then this or a similar rule may occur at the interpolated point. We say rules are similar if they indicate the associations between the same indices. In a geo-spatial decision support system, it is reasonable to make such assumption because the effect of some indices on other indices can be similar within a certain geographical area. Thus, the sites within this area may share some similar association rules. Based on this guiding principle, we group antecedent and consequent episodes based on different indices that appear in rule sets. Table 1 shows the result after clustering1 . Since the overall goal of our system is to manage drought risk, drought-monitoring experts are particularly interested in drought-relevant rules. However, by nature, droughts occur infrequently. That means a huge number of rules are irrelevant to drought. Constraint-based data mining method is applied to our system to provide user specified target episodes quickly and without the distractions of the other non-interesting rules. As mentioned in Section 3.1, the data is clustered into seven categories. To explore drought-relevant episodes, we set the constraints to the first three drought categories. The first three tuples in this table show the classified data values of three drought-relevant categories for different indices. The last row gives us the values after clustering. The purpose for clustering is to keep the rules that occur in two or more sample sites and reflect the relationships between the same indices, though they may be in different drought-intensity categories. The rule interpolation algorithm, shown in Figure 1, is divided into five cases. This algorithm shows how to interpolate two association rules which occur at two 1
Here SPIx denotes x-month SPI, that means the SPI value is calculated for a selection of time scales, covering the last x months.
Interpolation Techniques for Geo-spatial Association Rule Mining
577
Table 1. Rule clustering.
Extremely dry Severely dry Moderately dry After clustering
SPI3 SPI6 SPI9 PDSI SOI NAO PDO 1 8 15 22 35 36 43 2 9 16 23 34 37 44 3 10 17 24 33 38 45 50 51 52 53 54 55 56
different sample sites2 . In the first case, two input rules have the same antecedent episodes, and there are some common events in two consequent episodes. In this case, the output of interpolated antecedent and consequent episodes is taken directly from these two rules, and the confidence of the new rule is calculated by IDW data interpolation method. The guiding principle in this case lies in the “downward-closed” property of support and confidence, which means every subepisode is at least as frequent as its superepisode [4]. In the second case, two input rules have the same antecedent, but different consequent episodes. However, they have common indices in their consequent episodes when clustering operation is considered. In this case, we first get these common indices, then get the corresponding data value for each index by IDW interpolation. The ROUND function is used to transform a data value to an integer so that it can be used by knowledge discovery algorithm.
Input: Two association rules r1 : A1 ⇒ C1 (conf 1), r2 : A2 ⇒ C2 (conf 2). Output: An interpolated rule r : A ⇒ C (conf ). Case 1: A1 = A2 = A, C1 ∩ C2 = C = ∅, then output A ⇒ C, conf = IDW (conf 1, conf 2); Case 2: A1 = A2 = A, C1∩C2 = ∅, but clustering(C1)∩clustering(C2) = CG = ∅, then let C1 and C2 be the subset of C1 and C2 such that clustering(C1 ) = clustering(C2 ) = CG . Let C = ROU N D(IDW (C1 , C2 )), output A ⇒ C, conf = IDW (conf 1, conf 2); Case 3: A1 = A2, but |A1| = |A2|, clustering(A1) = clustering(A2), and C1 ∩ C2 = C = ∅, then let A = ROU N D(IDW (A1, A2)), output A ⇒ C, conf = IDW (conf 1, conf 2); Case 4: A1 = A2, but |A1| = |A2|, clustering(A1) = clustering(A2), and C1 ∩ C2 = ∅, but clustering(C1) ∩ clustering(C2) = CG = ∅, then let A = ROU N D(IDW (A1, A2)), let C1 and C2 be the subset of C1 and C2 such that clustering(C1 ) = clustering(C2 ) = CG . Let C = ROU N D(IDW (C1 , C2 )). Output A ⇒ C, conf = IDW (conf 1, conf 2); Case 5: Otherwise, do nothing. Fig. 1. Rule interpolation algorithm.
2
It can be extended to the interpolation of three or more rules.
578
D. Li, J. Deogun, and S. Harms
In the third case, two antecedent episodes have different drought-intensity events, but the number of events are equal in these two episodes. Additionally, we could find a one to one match between two antecedent event sets by using the clustering operation, i.e., clustering(A1) = clustering(A2). The situation for two consequent episodes is same as its in the first case. To get interpolated antecedent episode, we apply IDW and ROUND functions to calculate the value of each index. The fourth case is a combination of the second and the third cases. Example 1. Consider two association rules r1 : 34, 44 ⇒ 10, 15 (conf1 = 0.8); r2 : 33, 43 ⇒ 8 (conf2 = 0.6). Suppose the Euclidean distances from the query point to these two sample sites are 3 and 5 respectively. Since we have A1 = A2, |A1| = |A2|, and clustering(A1) = clustering(A2) = {54, 56}; C1∩C2 = ∅, but clustering(C1)∩ clustering(C2) = {51} = ∅, it matches the fourth case in the rule interpolation algorithm. Thus, we have interpolated antecedent episode A = {ROU N D(IDW (34, 33)), ROU N D(IDW (44, 43))} = {34, 44}, and the interpolated consequent episode C = {ROU N D(IDW (10, 8))} = {10}. Here, the power parameter p equals 2. The confidence of this rule can be calculated by IDW method based on the distances from two sample points to the interpolation point. Finally, we have rule r : 34, 44 ⇒ 10 (conf = 0.79). It is clear from this example that the clustering function gives the ability to interpolate two rules that reflect the associations between the same input parameters.
4
Quality Metrics
To evaluate the performance of our interpolation methods, we borrow two widely used quality metrics in information retrieval, i.e., precision and recall [6]. They are defined as follows: Definition 1. Precision is defined as the percentage of actual rules correctly discovered among all discovered rules by a certain interpolation algorithm, i.e., precision =
#actual rules discovered #discovered
Definition 2. Recall is defined as the percentage of actual rules discovered by an interpolation method to the number of actual rules discovered with sample datasets, i.e., #actual rules discovered recall = #actual rules From the definitions we can see that, the value of recall shows to what percentage actual rules have been successfully discovered, and the value of precision shows among all discovered rules, how many of them are useful. We choose testing datasets from the datasets for sample points, and apply REAR rule mining algorithm to get actual rule sets. Pre-interpolation and post-interpolation methods are used to find all discovered rule sets. We also define Exact match and fuzzy
Interpolation Techniques for Geo-spatial Association Rule Mining
579
match when we compare new rule sets with actual rule sets. The consideration of fuzzy match gives the ability to discover more similarity between generated rules and actual rules at the cost of accuracy. Definition 3. Two rules r1 : A1 ⇒ C1 and r2 : A2 ⇒ C2 are called exact match if |A1| = |A2|, |C1| = |C2|, and for every event x, x ∈ A1 iff x ∈ A2, and for every event y, y ∈ C1 iff y ∈ C2. Two rules are called fuzzy match if they are exact match after clustering operation.
5
Experimental Results
Results shown in this section are based on the experiments on the datasets for Ne0.8 braska, USA, from 1951 to 2000. There are 0.6 more than three hundred weather stations in 0.4 Nebraska. For our experiments, we choose 98 0.2 of them as our data sources. Among these 98 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 weather stations, we randomly select 80 sta(b) Mean Squared Error tions as our sample points, and the other 18 stations are considered as unsampled sites. Fig. 2. Comparison of three dataWe first compare three pre-interpolation based interpolation methods. methods, i.e., basic Shepard’s IDW, modified Shepard’s IDW with the consideration of climatic region information, and Kriging3 . Interpolated data values are measured by and mean squared error (MSE) comparing with actual datasets. We calculate MSE for each station. Figure 2 shows MSE results for the 18 interpolated stations. As can be seen, the consideration of climatic region information provides a better interpolation result than original IDW method, and overall, Kriging gives less mean error than two IDW approaches. Next, we compare pre-interpolation with post-interpolation methods. The REAR algorithm is applied to generate association rules. Precision and recall are calculated based on exact match and fuzzy match. We generate a total of 272 actual rules when we run the rule mining algorithm for the 18 test stations based on actual datasets. The numbers of rules that are generated by applying different interpolation algorithms are shown in the second column in Table 2. We can see that, post-interpolation method gives higher precision while preinterpolation method (especially Kriging) provides higher recall. This trend is same for both exact and fuzzy match rules. The reason for this trend is that rules are generated by the intersection of two or more rule sets. The operation of intersection dramatically limits the number of output rules. This illustrates a trade-off in the post-interpolation approach. 1.2
Original IDW Modified IDW Kriging
MSE
1
3
The experiments are based on the following input parameters: for IDW method, we select 4 nearest neighbors for each interpolation point, p = 2; for modified IDW, k = 1/2; for Kriging, sill = 0.953, nugget = 0.049, and we choose general exponential variogram model and ordinary Kriging method.
580
D. Li, J. Deogun, and S. Harms Table 2. Comparison of pre-interpolation and post-interpolation methods. Method # Rules # Exact Original IDW 348 64 Modified IDW 238 76 Kriging 703 157 Post-interpolation 84 29
6
Precision 32.91% 31.76% 41.38% 40.69%
Recall # Fuzzy 32.08% 105 38.02% 127 51.42% 196 16.09% 49
Precision 50.1% 54.16% 50.82% 63.25%
Recall 52.6% 59.25% 70.79% 24.87%
Conclusion
Knowledge discovery is one of the most active aspects in the information processing field. In this paper, we focus on enhancing data mining techniques in the context of Geo-spatial Decision Support System (GDSS) [2] by integrating spatial interpolation techniques into the KDD process. For spatio-temporal data mining in our GDSS datasets are collected from a variety of sources. However, it is impossible to collect data for all sites of interest. To find association rules for unsampled sites, we integrate pre-interpolation and post-interpolation methods to data mining algorithms. IDW and Kriging methods are applied to interpolate sample datasets. To obtain more accuracy, we modify the weight function in Shepard’s IDW to accommodate some geographic features. Precision and recall are considered as two quality metrics from data mining point of view. Based on our experiments, we discover that post-interpolation method provides higher precision while pre-interpolation approach gives higher recall. Among three preinterpolation methods, Kriging method outperforms others. The performance of post-interpolation is weakened due to the intersection operation. As our future work, we intend to discover a hybrid method that takes advantage of both pre-interpolation and post-interpolation approaches.
References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, pages 478–499, Santiago, Chile, 1994. 2. S. Harms, D. Li, J. Deogun, and T. Tadesse. Efficient rule discovery in a geo-spatial desicion support system. In Proceedings of the 2002 National Conference on Digital Government Research, pages 235–241, Los Angeles, California, USA, May 2002. 3. M. Kryszkiewicz. Representative association rules. In Lecture Notes in Artificial Intelligence, volume 1394, pages 198–209. Proceedings of the Practical Applications of Knowledge Discovery and Data mining [PAKDD 98], Springer-Verlag, 1998. 4. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. In Proc. of the 1st Inter. Conf. on KDD, pages 210–215, Canada, 1995. 5. M. Oliver and R. Webster. Kriging: a method of interpolation for geographical information system. INT. J. Geographical Information Systems, 4(3):313–332, 1990. 6. V.V. Raghavan, G.S. Jung, and P. Bollman. A critical investigation of recall and precision as measures of retrieval system performance. ACM Transactions on Information Systems, 7(3):205–229, July 1989. 7. J. Saquer and J. S. Deogun. Using closed itemsets for discovering representative association rules. In Proceedings of the Twelfth International Symposium on Methodologies for Intelligent Systems [ISMIS 2000], Charlotte, NC, October 11-14 2000.
Imprecise Causality in Mined Rules Lawrence J. Mazlack University of Cincinnati Cincinnati, OH 45221 [email protected] Abstract. Causality occupies a central position in human reasoning. It plays an essential role in commonsense decision-making. Data mining hopes t o extract unsuspected information from very large databases. The results are inherently soft or fuzzy as the data is generally both incomplete and inexact. The best known data mining methods build rules. Association rules indicate the associative strength of data attributes. In many ways, the interest i n association rules is that they seem to suggest causal, or at least, predictive relationships. Whether it can be said that any association rules express a causal relationship needs to be examined. In part, the utility of mined association rules depends on whether the rule is causal or coincidental. This paper explores some of the factors that impact causality in mined rules.
1 Introduction Data has been gathered in databases to support concerns such as billing, inventory control, and record keeping. The rate and amount of data gathered have become staggering. Data mining’s goal is to bridge the gap between data generation and data understanding through intelligent analysis and human oriented visualization. Potentially, the low-level transactional data can be a source of new and potentially useful strategic information about customers, products, trends and organizations. Data mining attempts to discouver previously unknown relationships of value. Data mining results are inherently soft or fuzzy as the data is generally both incomplete and inexact. Precise results cannot be reasonably expected. Data mining is secondary analysis. Data mining is secondary analysis as the data were not collected to answer questions now posed. Data is examined to discouver patterns beyond those that were hypothesized before the data was collected. For example, perhaps we are examining long distance telephone call records. The records were collected for billing. Secondary analysis tries to recognize calling patterns involving things such as: call length, time-of-day, calling-plan, from where-to-where, etc. There are several different kinds of data mining results. The most common are conditional rules or association rules. For example: • Conditional rule: IF Age < 20 THEN Income < $10,000 with {belief = 0.8}
• Association rule: Customers who buy beer and sausage also tend to buy mustard with {confidence = 0.8} in {support = 0.15}
Of the two, conditional rules are the older. They can be extracted from an inductively learned graph. At first glance, decision structures of this kind seem to imply a causal or cause-effect relationship. For example, from above, A customer’s purchase of both sausage and beer SEEMS to cause the customer to also buy mustard. In point of fact, association rules are not designed to describe causal relationships. All that can be said is that they describe the strength of co-occurrences. Sometimes, the relationship might be G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 581−588, 2003. Springer-Verlag Berlin Heidelberg 2003
582
L.J. Mazlack
causal; for example, if someone eats salty peanuts and then drinks beer, there could well be a causal relationship. On the other hand, if a dog barks in Kabul and a cat meows in Beijing, there is most likely not a causal relationship. Causality occupies a position of centrality in commonsense human reasoning. In particular, it plays an essential role in human decision-making by providing a basis for choosing that action that is likely to lead to a desired result. Considerable amounts of effort have been spent examining the question of causation. Hume [1] defined necessary and sufficient conditions: "We may define a cause to be an object, followed by another, and where all the objects similar to the first, are followed by objects similar to the second." In general, this reflects our commonsense understanding of causation. This paper focuses on discouvering causality in data mining discouvered rules.
2 Causality, General Centuries ago, in their quest to unravel the future, mystics aspired to decipher the cries of birds, the patterns of the stars and the garbled utterances of oracles. Kings and generals would offer precious rewards for the information soothsayers furnished. Today, though predictive methods are different from those of the ancient world, the advanced knowledge that these techniques attempt to provide is ever more valued. From weather reports to stock market prediction, and from medical prognoses to social forecasting, superior insights about the shape of things to come have become increasingly prized commodities [2]. There are newly discovered limitations. Godel’s theorem, Heisenberg’s uncertainly principle, chaos theory and other mathematical insights have shown how our knowledge of the world might never be complete because observers, are integral parts of the observed. Furthermore, as studies of general relativity topologies have demonstrated, advanced information about the future could lead to paradoxical situations. Democritus, the Greek philosopher, once said: “Everything existing in the universe is the fruit of chance and necessity.” This seems self evident. Both randomness and causation are in the world. The example that Democritus gave was the poppy. Whether the poppy seed lands on fertile soil or on a barren rock is chance. If it takes root, however, it will grow into a poppy, not a geranium or a Siberian Husky. The history of science is, in a sense, the study of causation. There is a hidden agenda that almost says randomness is forbidden. When Western scientists encounter randomness, they assume it is illusory, that, deep down, causation rules -- and thus our capacity to predict. This belief was eroded in the early 20th century, when quantum physics restored the luxury of randomness in individual particles. Still, the bias remains that causation is fundamental, and some physicists swear that someday even quantum theory will give way to an underlying determinism. Many users of data mining make the implicit assumption that discovered results represent causal knowledge; or at the least, statistically predictive. This bias is critical to the way mined results are evaluated. We need to recognize the degree causality exists.
3 Association Rules Association rules represent positive associations between attributes. Commonly, the rules are developed from categorical data expressed as a 0/1 Boolean matrix. The format of an association rule is: When <event1> occurs, <event2> will also occur with in <support> of cases.
Imprecise Causality in Mined Rules
583
For example:
Customers who buy beer and sausage also tend to buy mustard with {confidence = 0.8} in {support = 0.15} of cases
This also can be understood as:
80% of customers who buy beer and sausage also buy mustard; 15% of all customers buy beer and sausage
The problem of finding association rules was introduced by Agrawal [3]. Discouvering association rules often follows methods based on the Apriori approach [4]. A number of authors have considered ways of efficiently developing association rules [5] [6] [7] [8] [9] [10]. A general data mining problem is how to recognize interesting results. Many rules can be generated from a large collection of data. With association rules, the heuristic use of minimum support and confidence thresholds is a way of pruning rules. Most often, these thresholds are user established. User specification is not an entirely satisfactory way of setting thresholds; for one thing, users might be naive. Support is the ratio of the records having positive values for the attributes of (X∪Y) to the number of all records. Confidence is the ratio of the records having positive values for all attributes of (X∪Y) to the number of records having positive values for X. Support is a symmetric metric. The chance that event2 will occur if event1 occurs has the same value as the chance that event1 will occur, given that event2 occurs. Confidences are also symmetric. Support is a less formal and more heuristic metric than statistical dependence from probability theory. It is a heuristic as the strict application of probability theory requires that event1 and event2 are known to be independent variables. Association rules may or may not be describing independent events.
4 Causality Variations In many ways, the interest in association rules is that they offer the promise (or illusion) of casual, or at least, predictive relationships. Causality is a central concept in many branches of science and philosophy. In a way, the term “causality” is like “truth” -- a word with many meanings and facets. Some definitions are extremely precise and brittle. Some involve a reasoning style that can best be supported by fuzzy logic. The classic approach to determining if there is a causal connection is to perform randomized, controlled experiments. Randomized experiments may remove reasons for uncertainty whether or not a relationship is causal. Data mining is a secondary analysis of previously collected data. This eliminates the possibility of experimentally varying the data to search for causal relationships. Large existing data sets are typically the subject of data mining. Even if some experimentation is possible, the amount of experimentation will be small in contrast to the amount of existing data to be mined. A largely unexplored aspect of mined association rules is how to determine when one event actually causes another to happen. When association rules are built, we determine covariability (co-variation, correlation, co-dependence, statistical association). Given that A and B are variables, there appears to be a deterministic or statistical covariability between A and B. Is this covariability a causal relation? More generally, when is a relation a causal relation? Differentiation between covariability and causality presents a difficult problem, especially in the context of data mining.
584
L.J. Mazlack
4.1 Types of Causality There are at least three ways that things may be said to be related: • Coincidental: Two things happen to describe the same object and have no determinative relationship between them. • Functional: There is a generative relationship. • Casual: One thing causes another thing to happen. There are at least three causal types: •Chaining: In this case, there is a temporal chain of events, A 1 ,A 2 ,...,A n . To what degree, if any, does A i (i=1,...,n-1) cause A n? A special case of this a backup mechanism or a preempted alternative. Suppose we have a chain of casual dependence, A1 causing A2; suppose that if A1 does not occur, A2 still occurs, now caused by the alternative cause B1 (which only occurs if A1 does not). • Conjunction (Confluence): In this case, we have a confluence of events, A1,...,An, and a resultant event, B. To what degree, if any, did or does Ai cause B? A special case of this is redundant causation. Say that either A1 or A2 can cause B; and, both A1 and A2 occur simultaneously. What can be said to have caused B?
• Network
Recognizing true causality is a difficult thing. Four examples may help in understanding the difficulty: • Example 1: Simultaneous Plant Death: My rose bushes and my neighbor’s rose bushes both die. Did the death of one cause the other to die? (Probably not, although the deaths are associated.) • Example 2: Drought: There has been a drought. My roses and my neighbor’s rose bushes both die. Did the drought cause both rose bushes to die? (Most likely.) • Example 3: Traffic: My friend calls me up on the telephone and asks me to drive over and visit her. While driving over, I ignore a stop sign and drive through an intersection. I am hit by another driver. I die. Who caused my death? -- Me? -- The other driver? -- My friend? -- The intersection designer? -- Fate? • Example 4: Umbrellas: A store owner doubles her advertising for umbrellas. Her sales increase by 20% What caused the increase? -- Advertising? -- Weather? -- Fashion? -Chance? • Example 5: Poison: (Chance increase without causation) Fred and Ted both want Jack dead. Fred poisons Jack’s soup; and, Ted poisons his coffee; and, each act increases Jack’s chance of dying. Jack eats the soup but (feeling rather unwell) leaves the coffee, and dies later. Ted’s act raised the chance of Jack’s death but was not a cause of it. Zadeh [11] has suggested that it is difficult to precisely define even what is meant by causality; or, at least to know what is meant by causality. He frames this in the context of being able to unambiguously define causality within the structure of classical logic. Zadeh [12] states that a satisfactory definition must have the following elements:
Imprecise Causality in Mined Rules
585
• The definition is general in the sense that it is not restricted to a narrow class of systems or phenomena. • The definition is precise and unambiguous in the sense that it can be used as a basis for logical reasoning and/or computation. • The definition is operational in the sense that given two events A and B; the definition can be employed to answer the questions: - Did or does or will A cause B or vice-versa? - If there is a causal connection between A and B, what is its strength? • The definition is consonant with our intuitive of causality perception and does not lead to counterintuitive conclusions. Perhaps, this may be true in the context of a restricted formal language. However, the problem remains that we have a commonsense understanding of causality and need to deal with it in terms of the quality of out decision making. Hisdal [13] in response to Zadeh is somewhat less pessimistic. She states that it is often possible to say with certainty that an event E1 is NOT the cause of an event E2. For example, let E1, E2 be two events occurring at times t 1 , t2 respectively. If t1 > t2, then E1 is not the cause of E2. Nikravesh [14] suggests in response to Zadeh that (a) It might not be necessary for there to be a precise definition; and, (b) There are inherent limits with translating words that are understood in natural language to precise statements. 4.2 Classical Dependence Statistical Independence: Statistical dependence is often confused with causality. Such reasoning is not correct. Two events E1, E2 may be statistical dependent because both have a common cause E0. But this does not mean that E1 is the cause of E2. For example, lack of rain (E0) may cause my rose bush to die (E1) as well as that of my neighbor (E2). This does not mean my rose’s dying caused my neighbor’s rose’s death, or conversely. However, the two events E1, E2 are statistically dependent. The general definition of statistical dependence is: Let A, B be two random variables that can take on values in the domains {a1,a2,...,ai} and {b1,b2,...,bj} respectively. Then ai is said to be statistically independent of B iff prob (ai|bj) = prob(ai) for all bj and for all ai. The formula prob(ai|bj) = prob(ai) prob(bj) describes the joint probability of ai AND bj when A and B are independent random variables. Then follows the law of compound probabilities prob(ai,bj) = prob(ai) prob(bj|ai) In the absence of causality, this is a symmetric measure. Namely, prob(ai,bj) = prob(bj,ai) Causality vs Statistical Dependence A causality relationship between two events E1 and E2 will always give rise to a certain degree of statistical dependence between them. The converse is not true. A statistical dependence between two events may; but need not, indicate a causality relationship between them. We can tell if there is a positive correlation if prob(ai,bj) > prob(ai) prob(bj)
586
L.J. Mazlack
However, all this tells us that it is an interesting relationship. It does not tell us if there is a causal relationship. Following this reasoning, it is reasonable to suggest that association rules developed as the result of link analysis might be considered causal; if only because of a time sequence is involved. In some applications, such as communication fault analysis Hatonen [15], causality is assumed. In other potential applications, such as market basket analysis1, the strength of time sequence causality is less apparent. For example, if someone buys milk on day1 and dish soap on day2, is there a causal relationship? Perhaps, some strength of implication function could be developed. Some forms of experimental marketing might be appropriate. However, how widely it might be applied is unclear. For example, a food store could carry milk (E1,m=1) one month and not carry dish soap. The second month the store could carry dish soap (E2,m=2) and not milk. On the third month, it could carry both milk and dish soap (E1,m=3) (E2,m=3). That would determine both the independent and joint probabilities (setting aside seasonality issues). Then, if prob(E1,m=3) prob(E2,m=3) > prob(E1,m=1) prob(E2,m=2) there would be some evidence that there was a causal relationship. 4.3 Acyclic Graphs Other authors have suggested that sometimes it is possible to recognize causal relations through developing acyclic graphs. Pearl [16] and Spirtes [17] make the claim that it is possible to infer causal relationships between two variables from associations found in observational (non-experimental) data without substantial domain knowledge. Spirtes claim that directed acyclic graphs can be used if (a) the sample size is large and (b) the distribution of random values is faithful to the causal graph. Robins [18] argues that their argument is incorrect. Their dispute deals with sample size as well as the crisp statement that “there exist no unmeasured cofounders.” Without going deeply into their debate, it would appear that part of the problem is with this statement’s crispness. Also, some of the discussion is based on Bayesian analysis. Possibly fuzzy functions would help. Only experimentation can tell. Lastly, Scheines [19] claims that only in some situations will it be possible to determine causality. Developing directed acyclic graphs is computationally expensive. The amount of work increases geometrically with the number of attributes. Full Bayesian networks are based on directed acyclic graphs and are similarly complex in that their complexity is at least exponential to the number of attributes [20]. To reduce the amount of work, the data can be sampled and sets of directed acyclic graphs constructed for each sample. The task then becomes finding the “best” graph set [21]. Work is progressing in these areas; but has yet to result in a robust solution. Using acyclic graphs is not uncommon in data mining web data. Considering acyclic graphs in already formed graphs is also not unusual. The question is whether they can be profitably used in mining directly from collected data. It would seem that it would be reasonable to experimentally test to see if acyclic graphs can be usefully used to test causality in large data sets. Perhaps, possible experi-
1
Time sequence link analysis can be applied to market basket analysis when the customers can be recognized; for example through the use of customer “loyalty” cards in supermarkets or “cookies” in e-commerce.
Imprecise Causality in Mined Rules
587
ments might also include different levels of granularity. Possibly, granularity might be increased by either concept hierarchies or clustering. 4.4 Probabilistic Causation Probabilistic Causation designates a group of philosophical theories that aim to characterize the relationship between cause and effect using the tools of probability theory. A primary motivation for the development of such theories is the desire for a theory of causation that does not presuppose physical determinism. The success of quantum mechanics, and to a lesser extent, other theories employing probability, has brought some to question determinism. Thus, many philosophers were interested in developing a theory of causation that does not presuppose determinism. The central idea behind probabilistic theories of causation is that causes raise the probability of their effects. One notable feature has been a commitment to indeterminism, or rather, a commitment to the view that an adequate analysis of causation must apply equally to deterministic and indeterministic worlds. Mellor [22] argues that indeterministic causation is consistent with the connotations of causation. Hausman [23], on the other hand, defends the view that in indeterministic settings there is, strictly speaking, no indeterministic causation, but rather deterministic causation of probabilities. Following Suppes [24] and Lewis [25], an approach has been to replace the thought that causes are sufficient for, or determine, their effects with the thought that a cause need only raise the probability of its effect. This shift of attention has raised the thorny issue of which kind of analysis of probability, if any, is up to the job of underpinning an of indeterministic causation. It can be argued that the philosophy of Probabilistic Causation provides the intellectual motivation for using Directed Acyclic Graphs.
5 Epilogue Data mining holds the promise of extracting new information from very large databases. Association rules are a linearly complex, user understandable way of doing data mining. The results are particularly useful when it is important that both the rules themselves and the methodology behind the rules’development are understandable. Causality is a central concept in many branches of science and philosophy. In a way, the term “causality” is like “truth” -- a word with many meanings and facets. Some of the definitions are extremely precise. Some of them involve a style of reasoning that can best be supported by fuzzy logic. The more imprecise definitions generally arise from oral statements There are several open questions across all types of association rules. A fundamental question is determining whether or not an association is causal. Currently, association rules only measure the relative co-occurrence of values. It would be of substantial value if the degree of causality of an association could be known. A deep question is when anything can be said to cause anything else. And if it does, what is the nature of the causality? There is a strong motivation to attempt causality discouvery in association rules. In many ways, the interest in association rules is that they offer the promise (or illusion) of casual, or at least, predictive relationships. The research concern is how to best approach the recognition of causality or non-causality in mined rules resulting from secondary analysis.
588
L.J. Mazlack
References 1. D. Hume [1748] An Enquiry Concerning Human Understanding 2. P. Halpern [2001] The Pursuit Of Destiny, Perseus, Cambridge, Massachusetts 3. R. Agrawal, T. Imielinski, A. Swami [1993] “Mining Association Rules Between Sets Of Items In Large Databases,” Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD93), P. Buneman, S. Jajodia (eds.), Washington, DC, May, 1993, 207-216 4. R. Agrawal, R. Srikant “Fast Algorithms for Mining Association Rules,” VLDB Proceedings, 1994, 487499 5. J. Han, Y. Fu [1995] “Mining Optimized Association Rules For Numeric Attributes,” VLDB Proceedings, Zurich, 1995 6. H. Mannila, H. Toivonen, A.I. Verkamo [1994] “Efficient Algorithms For Discovering Association Rules,” AAAI Workshop on Knowledge Discovery and Databases, U.M. Fayyad, R. Utharusamy (eds.), Seattle, Washington, July, 1994, p 181-192 7. M. Klemetinen, H. Mannila, P. Ronkainen, H. Toivonen, A.I. Verkamo [1994] “Finding Interesting Rules from Large Sets Of Discovered Association Rules,” CIKM-1994 8. A. Savasere, E. Omiecinski, S. Navathe [1995] “An Efficient Algorithm For Mining Association Rules In Large Databases,” VLDB Proceedings, Zurich, p 144-1555 9. H. Toivonen [1996] “Sampling Large Databases For Association Rules,” Proceedings Of The 22nd VLDB Conference 10. S. Brin, R. Rastogi, K. Shim [1999] “Mining Optimized Gain Rules For Numeric Attributes,” ACM SIGKDD Proceedings, San Diego, 1999, 135-144 11. L.A. Zadeh [2000] “Abstract Of A Lecture Presented At The Rolf Nevanilinna Colloquium, University of Helsinki,” reported to: Fuzzy Distribution List, [email protected], August 24, 2000 (on line only) 12. L.A. Zadeh [2001] “Causality Is Undefinable,” Abstract Of A Lecture Presented At The BISC Seminar, University of California, Berkeley, reported to: Fuzzy Distribution List, [email protected], January 16, 2001 (on line only) 13. Hisdal [2000] “BISC: Re: Definability, Causality, Statistical Dependence, Physics vs Math,” BISC Distribution List, [email protected], reported to: Fuzzy Distribution List, [email protected], August 29, 2000 (on line only) 14. M. Nikravesh [2000] “BISC: Causality,” BISC Distribution List, [email protected], reported to: Fuzzy Distribution List, [email protected], December 8, 2000 (on line only) 15. K. Hatonen, M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen (1996) "Knowledge Discovery From Telecommunication Network Alarm Databases," Conference On Data Engineering (ICDE’96) Proceedings, New Orleans, 115-122 16. J. Pearl, T. Verma [1991] “A Theory Of Inferred Causation,” Principles Of Knowledge Representation And Reasoning: Proceedings Of The Second International Conference, J. Allen, R. Fikes, E. Sandewall, Morgan Kaufmann, 441-452 17. P. Spirtes, C. Glymour, R. Scheines [1993] Causation, Prediction, and Search, Springer Verlag, New York 18. R. Robins, L. Wasserman [1999], “On The Impossibility Of Inferring Causation From Association Without Background Knowledge,” in Computation, Causation, and Discovery (eds.) C. Glymour, G.F. Cooper, AAAI Press/MIT Press, Menlo Park, 305-321 19. R. Scheines, P. Spirtes, C. Glymour, C. Meek [1994] Tetrad II: Tools For Causal Modeling, Lawrence Erlbaum, Hillsdale, N.J. 20. D. Heckerman, C. Meek, G. Cooper [1997] A Bayesian Approach To Causal Discovery, Microsoft Technical Report MSR-TR-97-05 21. P. Spirtes [2001] “An Anytime Algorithm for Causal Inference,” AI and Statistics 2001 Conference (AISTATS’2001) 22. D. H. Mellor [1995] The Facts of Causation, Routledge, London 23. D. Hausman [1998] Causal Asymmetries, Cambridge University Press, Cambridge 24. Suppes, Patrick [1970] A Probabilistic Theory of Causality, North-Holland Publishing Company, Amsterdam 25. D. Lewis [1986] "Causation" and "Postscripts to Causation," in Philosophical Papers, Volume II, Oxford University Press, Oxford, 172-213
Sphere-Structured Support Vector Machines for Multi-class Pattern Recognition Meilin Zhu, Yue Wang, Shifu Chen, and Xiangdong Liu State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing , China, 210093 [email protected], [email protected], [email protected], [email protected]
Abstract. Support vector machines (SVM) are learning algorithms derived from statistical learning theory. The SVM approach was originally developed for binary classification problems. For solving multi-class classification problem, there are some methods such as one-against-rest, one-against-one, alltogether and so on. But the computing time of all these methods are too long to solve large scale problem. In this paper SVMs architectures for multi-class problems are discussed, in particular we provide a new algorithm called spherestructured SVMs to solve the multi-class problem. We show the algorithm in detail and analyze its characteristics. Not only the number of convex quadratic programming problems in sphere-structured SVMs is small, but also the number of variables in each programming is least. The computing time of classification is reduced. Otherwise, the characteristics of sphere-structured SVMs make expand data easily. Keywords: support vector machines, multi-class pattern recognition, spherestructured, kernel function, quadratic programming
1 Introduction Support Vector Machines (SVMs) has been recently proposed by Vapnik and his coworkers[1] as a very effective method for general purpose pattern recognition. Intuitively, given a set of points belonging to two classes, a SVMs finds the hyperplane that separates the largest possible fraction of points of the same class on the same side, while maximizing the distance from either class to the hyperplane. Successful applications of SVMs algorithms have been reported in various fields, for instance in the context of optical pattern and object recognition, text categorization, time-series prediction, gene expression profile analysis, DNA, protein analysis and many more. The basic theory of SVM is described for two-class classification. A multi-class pattern recognition system can be obtained by generalizing two-class SVM. The following three strategies can be applied to build kclass classifiers: One is the one-against-rest classifiers to classify between each class and all the remaining; The other is one-against-one classifiers[2] to classify between each pair; The third is all-together classifiers[3] to directly considering all data in one optimization formulation.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 589–593, 2003. © Springer-Verlag Berlin Heidelberg 2003
590
M. Zhu et al.
All methods above have advantages and limitations. And all of them use SVM method with optimal separating hyperplane. Namely, in Multi-class classification problem SVM separates the data space with multi hyperplanes. Individual class is closed in some region surrounded by several hyperplanes. So in this paper we propose to separate the data in the individual class with hyperspheres. Accordingly, data space is constituted by several hypershperes which seems the set of soap bubbles in 3dimensional space. This algorithm has advantages in both complexity and scale.
2 The Method of Sphere-Structured SVM Mathematically we can abstract the k-class problem as follows: Given the elements of the sets, A m , m = 1, L , k , in the n-dimensional real space R n , construct a discriminate function that separates these points into distinct regions. Each region A m contains l m points x im , i = 1, L , l m , belonging to all or almost all of the same set. For every set A m , m = 1, L , k , we try to find the sphere ( a m , R m ), described by center a m and the square of radius R m , with minimum the square of radius R m and which contains all (or most of) the points x im , i = 1, L , l m . Because this description may become very sensitive to the most outlying objects in the target data set, we allow for some data points outside the spheres. Analogous to [1] we introduce slack variables ξ im and we obtain the constraints:
xim − a m
2
≤ R m + ξim i = 1, L , l m
(1)
ξ ≥ 0 i = 1, L , l (2) We minimize the square of radius R m of the sphere and the size of the slack variables: F ( R m , a m , ξim ) = R m + C m ∑ ξim (3) m
m i
i
For given constant C m , which gives the trade-off between the two error terms: volume of the sphere and the number of target objects rejected. Then we can transform the original problem to comparatively simple dual problem[4]: m m m m m m m (4) max L = ∑αi (xi ⋅ xi ) − ∑αi α j (xi ⋅ x j ) α im
i
i, j
∑ α = 1 s.t. i 0 ≤ α im ≤ C m m i
Given a test point x , we compute x − a m which satisfy x − a m
2
(5)
2
, m = 1, L , k , d is the number of m ≤ R m . So there are three cases as follows:
Case 1: d=0. It means that x is not in any spheres t hen we find the nearest sphere by:
Sphere-Structured Support Vector Machines for Multi-class Pattern Recognition
(x − a p ) ⋅ (x − a p ) − R p = min{( x − a m ) ⋅ ( x − a m ) − R m | m = 1, L , k}
591
(6)
m
So x belong to the p th class. Case 2: d=1. It means that the point x only belongs to one sphere, so x belongs to the class represented by this sphere. Case 3: d>1. It means that x located at the margin of several spheres. We use planes to divide the intersecting parts of the spheres. We give the algorithm as follows: Step 1: for x , I is the set of the serial number of the spheres including the point x . Step 2: ∀i, j ∈ I , x is in both the i th and the j th sphere. Through the point of intersection between the two spheres A and B, a plane is drawn plumbing the line which connects the two spheres’ centers a i a j . We compute the projection of a i x on the a i a j , then decide which side of the plane x should be. Step 3: we compare the two projections, the equivalent expression is:
2(x ⋅ a j ) − 2(x ⋅ ai ) + (ai ⋅ ai ) − (a j ⋅ a j ) − Ri + R j
(7) th
If the value above is greater than zero, then x belongs to j class. Otherwise, x belongs to ith class. Suppose x belongs to ith class; Step 4: I = I \ { j} , go to step 2,until the I = 1 , the last element of I is the class which x should belong to.
3 The Application of Kernel Function Normally, data is not spherically distributed, even when the most outlying objects are ignored. To make a more flexible method, the object vectors x can be transformed to a higher dimensional feature space. As explained the inner products in above equations can be substituted by a kernel function K ( x, y ) , when this kernel satisfies Mercer theorem. The dual problem is now given by:
αi α j k(xi , x j ) maxL = ∑i αi k(xi , xi ) − ∑ i, j m
m
m
m
m
m
m
(8)
αim
Taking different kernel functions K's result in other types of feature spaces and thus on differently shaped domain descriptions. This can make the description more flexible and more accurate than the very rigid spherical shape.
4 Characteristic Analysis Compared with all kinds of multi-class SVM algorithms, sphere-structured classification method has the following merits: (1) Large data scale: Because in the sphere-structured classification algorithm, each quadratic program only has the samples in a single class, the capacity of managing data is improved greatly. As we have known, there exist some disadvantages when SVM handles the large-scale recognition problems. Sphere-
592
M. Zhu et al.
structured classification method just compensate for these disadvantages. From the point of data scale, sphere-structured classification method excels in other multi-class SVM methods. (2) Low complexity: Other multi-class SVM methods need more than k quadratic programs, increasing the computation time. While the sphere-structured classification method needs only k quadratic programs with simple constraints, it can easily be generalized and improved, or transformed to other optimal problems. (3) Easy expandability: In all kinds of multi-class SVM, when adding a new class (for example, add a new face image in the face database), the original classification system will be destroyed. It’s necessary to compare the new class with previous classes, compute several quadratic programs once more to find new support vector. However, sphere-structured classification method will not disturb the previous class, so the previous computation is still effective. What we need to do is to compute a quadratic program based on the new class itself and some other simple computations. From this point, sphere-structured classification method can be manipulated more easily. It has no repeated work and owns stronger ability to expand.
5 Experiments and Conclusion In this section, we present computational results compare one-against-rest, oneagainst-one, all-together and sphere-structured SVM. Several experiments on realworld datasets, such as Iris, Wine and Glass from UCI repository, are reported. To enable compare, for each algorithm C = ∞ was chosen, and the Gaussian kernel K ( x, y ) = exp − x − y (2σ 2 ) was used. It is not necessary for the same value σ to be used for all methods. Better solutions may result with different choices of σ . 10-fold cross validation was used to estimate generalization on future data. The results of error rate and computing time are summarized in table 1. From the table we can get that the sphere-structured SVM generalized well and the computing time (unit: second) is shortest among all the methods.
(
)
Table 1. The compare of multi-class classification Name
Iris
One-against-rest
One-against-one
All- together
Sphere SVM
error
time
error
time
error
time
error
time
0.013
21.25
0.013
7.25
0.013
11.33
0.013
1.40
Wine
0.006
31.46
0.006
10.51
0.022
15.54
0.011
1.96
Glass
0.286
126.48
0.238
25.47
0.257
114.94
0.319
2.46
In this paper we proposed a sphere-structured SVM for Multi-class, which can be applied in large-scale multi-class pattern recognition with fast computing speed and has good expansibility. This paper also gives the principle of sphere-structured method in detail and compares it with other multi-class algorithm.
Sphere-Structured Support Vector Machines for Multi-class Pattern Recognition
593
References [1] [2] [3]
[4]
V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995. U. Kreßel. Pairwise classification and support vector machines. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods. The MIT Press, 1999. J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSDTR-98-04, Royal Holloway, University of London, Department of Computer Science, 1998. Zhu Meilin, Liu Xiangdong, Chen Shifu, Solving Multi-class Pattern Recognition with Sphere-structured Support Vector Machines, Journal of Nanjing University, 2003, 39(2).
HIPRICE-A Hybrid Model for Multi-agent Intelligent Recommendation ZhengYu Gong, Jing Shi, and HangPing Qiu
Computer department of NanJing University of Science and Technology, 210007 Nanjing, China [email protected]
Abstract. We explore the acquisition of user profiles by unobtrusively monitoring the browsing behaviour of users by applying supervised machine-learning techniques coupled with an ontological representation to extract user preferences. The HIPRICE recommender system is presented and an empirical evaluation of this approach is conducted. The performance of the integrated systems is measured and presented as well.
1 Introduction In response to the challenge of information overload, we have sought to develop useful recommender systems--systems that people can use to quickly identify content that will likely interest them. Recommender systems suggest items of interest to users based on their explicit and implicit preferences, the preferences of other users, and user and item attributes. The two major classes of adaptive recommendation service are content-based (recommending items based on some analysis of their content) and collaborative (recommending items based on the recommendations of other users). We attempt to establish a hybrid recommender system, by combining content-based information filtering and collaborative filtering techniques – the system is known as HIPRICE.
2 The Framework of HIPRICE HIPRICE is a hybrid recommendation system, combining both content-based and collaborative filtering techniques. A machine-learning algorithm classifies browsed URLs overnight, and saves each classified page in a central page store. Explicit feedback and browsed topics form the basis of the interest profile for each user. System monitors user browsing behavior via a proxy server, logging each URL browsed during normal work activity. A set of recommendations is computed, based on correlations between user interest profiles and classified page topics. Any feedback offered on the recommendations is recorded when the user looks at them. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 594–597, 2003. © Springer-Verlag Berlin Heidelberg 2003
HIPRICE-A Hybrid Model for Multi-agent Intelligent Recommendation
595
Fig. 1. The HIPRICE recommender system
2.1 Ontology in HIPRICE
An ontology is a conceptualization of a domain into a human-understandable, but machine-readable format consisting of entities, attributes, relationships, and axioms [1]. Ontologies can provide a rich conceptualization of the working domain of an organization, representing the main concepts and relationships of the work activities and commonly used objects. In this paper we use the term ontology to refer to the classification structure and instances within the knowledge base. Upon start-up, the ontology provides the recommender system with an initial knowledge classification, which can overcome the new-system-cold-start problem. 2.2 Page Classification
We chose to use a nearest neighbor technique for the classification requirements. Nearest neighbor algorithms degrades well, with the next closest match being reported if the correct one is not found. We use the boosting technique AdaBoostM1 [2], which works well for multi-class problems if the boosted classifier is strong enough. The AdaBoostM1 algorithm’s class probability value for a web page (somewhere between 0 and 1) is used as the classification confidence. 2.3 Learning User’s Interest In HIPRICE In the patterns, we can confirm the users have more interest in finding some useful information about the subject and not any other. These patterns can be regarded as the common interest frequent navigation patterns according to the subject. In HIPRICE, each user has a user-interest model which contains objects of subject that user is interested in, each object having its pertinent keywords and a weight. At first, the weights are provided by the user when he first comes to use the system, and the subject information is retrieved from the common-domain knowledge base endued
596
Z. Gong, J. Shi, and H. Qiu
at the beginning. When HIPRICE learns more and more knowledge on the user’s interests, the weights are differentiated from each other. The more interested the object of subject or topic is, the higher its weight. In order to save space, objects that are no longer interested in by a user should be removed from the user interest model. In this way, the weights are reduced after a period of time. Therefore, if the user shows no sign of interest, which can cause the weight to decrease, the object with its weight that is under a threshold will be removed. We use a keyword weight vector to keep track of a users profile. Formally, the interest model for a user u is represented as a weight vector Wu :
Wu = ( wu ,1 , wu , 2 , L wu ,k , L , wu ,d )
(2)
where wu ,k is the weight of the kth term in the model and || Wu ||= 1 . d is the number of terms used for describing the models. Formally, it is the same as the number of terms for representing documents. The subsequent retrieval and user-feedback process expands and updates weights of the model terms, in an adaptive process. Step 1 Computing the observed user interest In HIPRICE, an interface agent (IA) presents the retrieval results to the user. It also observes the user’ s behavior by “looking over his (or her) shoulder” to learn his interests. The IA first presents the user with a number of recommended documents; these documents are collected according the initial user model. Once a total of M documents have been chosen and presented, the user either reads or browses (or ignores) the documents. IA analyzes the user’s behaviors on the documents filtered. We capture five statistical factors from the user action: reading time (rt), bookmarking (bm), scrolling (sc), saving (sa), and following up (fl) on the hyperlinks in the filtered documents. The total weight of feedback, representing the level of interest the user has on the document i, is computed as:
o (i ) = ∑ c v f v ( i )
(3)
v∈F
where F= {bm, fl, sa, rt, sc} is the set of implicit feedback factors, and cv were the weight for each factor as initially determined by the statistical distribution of the data accumulated before. Step 2 Computing the changing weight values in the user model If a page p which the user visited has the keyword k in the user’s interest model, then k ‘s weight is updated as:
r ( i ) * o (i ) wi (t + δ ) = d (δ ) * + wi (t ) D
(4)
where r(i) is the degree of relevance of the ith keyword k in the keyword vector and the page p ; This is computed as the TFIDF score of the k and the document. δ is a time interval. o is the total score of implicit feedback through observing the user’s interaction with the page. w(t) is the old weight value of the ith keyword k , and
HIPRICE-A Hybrid Model for Multi-agent Intelligent Recommendation
597
w(t + δ ) is the new keyword weight after updating; The subscript i is used to denote the ith keyword. D is a constant for normalizing the weights; Finally, d (δ ) is a damping factor in the formula. he weight of every topic in terms of keyword k in the interest model will decrease with time. The weight update rule is expressed as
d (δ ) = 1 −
δ ∆
(5)
where ∆ is a normalizing factor to keep the d() function between zero and one. Step 3 Content Based Recommendation Algorithm In HIPRICE, content-based recommendations are formulated from the correlation between the users current topics of interest and the papers classified as belonging to those topics. Given a set of n web pages and m classes provided by the ontology, first form an n by m matrix C, which cij is the classification confidence of page i to class using a machine learning algorithm such as AdaBoost. Given the m classes, a user’s interest model can be formed as an m array W, whose element wi is the interest value of a user to that class i. A paper is only recommended if it appears in the users browsed URL log, ensuring that recommendations have not been seen before. For each user, the top three interested topics are selected with a minimum support threshold of 10 recommendations made in total. Papers are ranked in order of the recommendation confidence before being presented to the user according to the following formula: V = C n ×m × Wm×1 (6) where C n×m is computed using a classification algorithm, and Wm×1 is the user profile in formula (2).
3 Conclusions In this paper we showed how to combine the content information of Web pages with user’s interests in order to make recommendations. The next step for this work is to run more trials and perform rigorous statistical analysis on the results. As the subjects increase in number, we can become increasingly confident of the power of the recommendation. Additionally, visualizing the profile knowledge will allow users to build a better conceptual model of the systems, helping to produce a feeling of control and eventually trust in the systems.
References [1] Guarino.N Giaretta.P. Ontolgies and knowledge bases: towards a terminological clarification. Towards Very Large Knowledge Bases: Knowledge Building and Knowledge sharing. Pages:25–32,1995 [2] Freund.Y, Schapire.R.E, Experiments with a new boosting algorithm, Proceedings of the thirteenth international conference on machine learing,1996
A Database-Based Job Management System Ji-chuan Zheng, Zheng-guo Hu, and Liang-liang Xing Department of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, 710072, P.R. China, Tel:(086)029-8495821 ext.510 Fax:(086)029-8495772 [email protected], [email protected], [email protected]
Abstract. By combining database and job management technology, this paper designs a database-based job management system (JMS), called DB-Based JMS. The system architecture is described. The functions and relationships of the components in this system are defined. Aiming at network computing environment, two kinds of DB-Based JMS cluster model are provided. Their working modes are detailed, and their advantages and disadvantages are compared. In addition, scheduling granularity is also discussed.
1 Introduction In the information era, it’s unavoidable that jobs will be associated with database. For example, mobile corporations send notices to customers with short messages at specified time, automatically notify administrator by e-mail when system error occurs, and so on. Modern commercial database management systems (DBMS), such as Oracle 9i[1] and SQL Server 2000[2], have preliminary functions of job defining and executing. These job management functions are very limited. They can only express simple job dependencies. They are mainly used to backup data and/or notify user. JMS is a complex middleware which involves a lot of technologies, such as data accessing, distributed dealing, high availability offering, security assuring, high quality communication, transaction processing, failure recovering, etc. To traditional JMS, all of these functions should be implemented in itself, so it takes a long time to develop a new JMS. In addition, traditional JMS, for instance, NQS-based JobCenter[3] system and leading product LSF[4], adopt OS file system to store their data directly. Although effective, it’s not convenient to manage and not good to share data. Usually traditional JMS can only be accessed within LAN, so the number of users is small. Despite some of JMS supporting WEB accessing, many of them employ low-efficient method of CGI. Job management system (JMS) is widely used for its powerful functions of job scheduling, controlling, tracking and statistic. Especially, load-balancing JMS, making use of distributed computing resources sufficiently, caters for the current tide and trending of network computing. It will have a brighter future. Hence, it’s necessary to exploit a new method to develop JMS. In fact, many problems mentioned above can be successfully solved by up-to-date DBMS.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 598–602, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Database-Based Job Management System
599
The main advantages of adopting database technology include: 1) It’s easy to integrate with web. Generating web pages from database can be accomplished by using mature tools. 2) SQL can be used to search all kinds of information about JMS. Probably the searching statement is not created by user directly, the implementation can be simplified. 3) It’s convenient for scheduler to choose proper job and machine because of central managing of jobs and cluster nodes. 4) Data, including job and job network (JNW) definition, schedule rules, calendar rules, etc., can be shared in the whole enterprise area. 5) DB-based JMS is apt to integrate with other database systems. More kinds of jobs can be provided. Section 2 describes the system model which includes single machine model and cluster model. Section 3 discusses the scheduling granularity. The last section is the conclusion of this paper.
2 System Model The DB-based JMS we designed is a hierachical system. The lowest layer is operating system (OS). It serves DBMS, NQS/DQS and JMS. DBMS and NQS/DQS offer services to JMS. The implementation of JMS depends on the functions of OS, DBMS and NQS/DQS. The reason why we adopt NQS/DQS is to keep compatible with old JMS systems. If we develop a new JMS, then an equivalent module of NQS/DQS can be placed into JMS layer. JMS API layer provides an interface between JMS and JMS applications. The top layer includes client applications and managing applications. It connects to JMS by using JMS API. 2.1 Single Machine Model Since JMS possesses many functions, it has a lot of components. The DB-Based JMS we designed employs classic C/S structure and closely integrates with Internet technology. The web accessing method is detailed in reference [7]. Fig. 1 shows the main components of JMS and their relationships. &OLHQW
6HUYHU$JHQW
-RE -1: 'HILQH
5HV /RDG &ROOHFWRU
6HDUFKHU
&HQWUDO'%
([HFXWH 6XEV\V
&RPPDQG 'LVSDWFKHU
4XHXH6\V
$UFKLYH 6XE6\V
-1:(QJLQH
Fig. 1. Single machine mode of DB-Based JM
6FKHGXOHU
&DOHQGDU 6FKHGXOHU
146'46
600
J.-c. Zheng, Z.-g. Hu, and L.-l. Xing
Here, many kinds of involved data, including system setting, job definition, job executing history record, queue properties, resources and loads information, the status of executing job, etc. are stored in central database. In our system, we introduce the concept of job network (JNW). A JNW is a set of jobs, which describes the logical dependencies of jobs and determines the executing order of jobs. Job&JNW Define module defines and modifies the definition of job and JNW, writes corresponding information into database and checks the validity of JNW. JNW engine module controls the executing of JNW, which ensures the executing order of jobs according to their dependencies. The status of the executing JNW is maintained by JNW engine. Other modules are similar to the corresponding modules in the traditional JMS. Detailed descriptions of the modules can be found in reference [8]. The meanings of the lines in Fig. 1 are stated as follows. L1, client sends request to server and receives result from server; L2, server agent passes the client request to command dispatcher; L3, the definition of job and JNW are expressed in XML in order to expand functions and exchange information; L4, information of job, JNW, executing record, configuration is searched; L5-7, L9-11, configuration or rules are set; L8, controlling the executing JNW; L12-20, depositing and fetching corresponding data; L21, JNW Engine gets information of resources and loads from Res&Load Collector; L22, submitting jobs to object queues; L23, archiving JNW according to archiving rules; L24, submit JNW to JNW Engine; L25, scheduling JNW to run according to calendar rules. 2.2 Cluster Model of DB-Based JMS Several DB-Based JMS can be grouped into a cluster so as to balance network load and provide high availability. A cluster has a unique name as an identifier. The name is often in domain forms, for example, cluster1.nwpu.edu.cn. For a client, the cluster is the same as a single computer. The client will connect to cluster through the cluster published name. According to the amount of DBMS, the cluster can be categorized into two types. One type has only one DBMS, called UniDB-JMS-Cluster, the other type has several DBMS, called MultiDB-JMS-Cluster. The two types will be described. UniDB-JMS-Cluster has only one DBMS. For example, an UniDB-JMS-Cluster consists of three JMS. JMS0 is the master; JMS1 and JMS2 are computing nodes. These three nodes share one DBMS. Data is stored in a central database so that management is easy. The cluster provides service through cluster name or cluster IP. When cluster name is used, a domain resolution is usually added to DNS server. For instance, suppose the IP addresses of the three nodes are 192.200.9.185(JMS0), 192.200.9.186(JMS1), 192.200.9.187(JMS2) respectively, the IP address of cluster is 192.200.9.188. At this time, the master node has two addresses, which are internal IP and cluster IP. When client connects to cluster, it will use cluster name or cluster IP. While cooperating, each node is autonomous. They can serve client directly. In order to transfer jobs across the network to achieve load balance, cluster administrator must configure the relative nodes, for instance, specifying the objective of one queue in a node to other queues in another node. It is necessary for a node to negotiate with master when it wants to join a cluster. There are two approaches. One is that a node applies for joining a cluster actively, and master accepts request and modifies configuration; the other is that master invites a
A Database-Based Job Management System
601
node to join in, and the node makes a choice whether join in or not. There are two styles for node to exit from cluster. One is node issues a request and negotiates with master; the other one is cluster master kicks out a node forcibly. Cluster master can monitor all the nodes. If one node fails, then jobs won’t be transferred to it. If master fails, then all the nodes will choose a new master according to the configuration. When the old master recovers, the new master will return to its original position as a normal node. This kind of cluster is fit for high speed LAN or small LAN, in which only one DBMS can be used. Its main advantages include 1) Easy to manage because of data concentration. 2) Easy to share data. 3) Easy to implement this kind of cluster by using IFS[9] of Oracle 9i. IFS is capable of storing all kinds of data, such as JMS system information, application information, even executive binary code. The disadvantages include 1) If DBMS fails, the whole system will fail. Certainly, we can adopt redundant component to improve the availability. For instance, we can use MSCS[10] to support DBMS. 2) Extensibility is poor. If the number of nodes increases greatly, the load on DBMS will aggravate apparently. MultiDB-JMS-Cluster is fit for scheduling enterprise-wide jobs where different department has different database. We can also call this cluster a grid, which is the cluster of cluster. Here, a node can be either a normal node or a cluster. This kind of cluster has three main advantages. 1) Data is distributed, which can satisfy the enterprise requirement. 2) Extensibility is strong. 3) Data can be replicated directly by using the data distributing mechanism of commercial DBMS, such as the instance view of Oracle and Publisher/Subscriber of SQL Server. However, in order to improve the efficiency and extensibility, some data have to be copied to different DBMS. Maintaining the consistency of replicated data will consume some resources. This is the only disadvantage of this model.
3 Scheduling Granularity The DB-Based JMS we designed provides two kinds of granularity of workload balancing. One is coarse granularity with the scheduling unit of JNW. The other is medium granularity with the scheduling unit of job. In our system, the submitting unit is JNW. Firstly, JNW is submitted to JNW engine, where it will be decomposed into unit jobs. The executing order of these unit jobs will be ensured by JNW engine. Secondly, these decomposed jobs will be submitted to NQS/DQS or other queue systems. Lastly, queue systems will transfer jobs to executor. Here, JNW scheduling is coarse granularity. Job scheduling is medium granularity.
4 Conclusion Based on database technology, a distributed, heterogeneous, high available, extensible JMS is designed in this paper. Two kinds of cluster model of DB-Based JMS are presented. They can meet the needs of application. Coarse and medium granularity of workload balancing mechanism makes it realizable to make full use of computing resources in enterprise.
602
J.-c. Zheng, Z.-g. Hu, and L.-l. Xing
References 1.
Scott Urman, Oracle 9i Advanced PL/SQL Programming, Chinese edition, China Machine Press, 2001 2. Mark Spenik “Mircosoft SQL Server 2000 DBA Survival Guide ”, Sams, 2001 3. JobCenter User’s Guide, Version 10.1, NEC Corporation, 2001 4. LSF JobScheduler ADMINISTRATOR’S GUIDE, version 4.0, Platform Computing Corporation, 2000 5. LSF JobScheduler USER’S GUIDE, version 4.0, Platform Computing Corporation, 2000 6. LSF Analyzer GUIDE, version 4.0, Platform Computing Corporation, 2000 7. Zheng Ji-chuan, Hu Zheng-guo, “Design and Implementation of Web-based JMS”, Computer Applications, Vol.23 No.1, Science Press, 2003 8. Li Zhong-liang, A Job Management System Based on JDL[M], Xi’an, Northwestern Polytechnical University, 1999 9. Steve Vandiver, Kelly Cox, “Oracle 9i Application Server Portal Handbook”, Oracle Press, 2002 10. Microsoft Cluster Server Basics, http://www.nwnetworks.com/mscsbasics.htm
Optimal Choice of Parameters for a Density-Based Clustering Algorithm1 Wenyan Gan1 and Deyi Li2 1
Nanjing University of Science and Technology, Nanjing 210007, China [email protected] 2 Institute of Electronic System Engineering, Beijing 100039, China [email protected]
Abstract. Clustering is an important and challenging task in data mining. As a kind of generalized density-based clustering methods, DENCLUE algorithm has many remarkable properties, but the quality of clustering results strongly depends on the adequate choice of two parameters: density parameter σ and noise threshold ξ. In this paper, by investigating the influence of the two parameters of DENCLUE algorithm on the clustering results, we firstly show that an optimal σ should be chosen to obtain good clustering results. Then, an entropybased method is proposed for the optimal choice of σ. Further, noise threshold ξ is estimated to produce a reasonable pattern of clustering. Finally, experiments are performed to illustrate the effectiveness of our methods.
1 Introduction DENCLUE algorithm [1] is a kind of generalized density-based clustering methods, which has many outstanding properties, such as capability of discovering clusters of arbitrary shapes, good scalability for the size of databases, high dimensions and large amounts of noisy data, etc. But the quality of clustering results strongly depends on the careful choice of two parameters: density parameter σ and noise threshold ξ, which seriously restricts its wide application. In this paper, by investigating the influence of the two parameters of DENCLUE algorithm on the clustering results, we firstly show that an optimal σ should be chosen to obtain good clustering results. Then, an entropy-based method is proposed for the optimal choice of σ. Further, noise threshold ξ is estimated to produce a reasonable pattern of clustering. Finally, some experiments are performed to demonstrate the effectiveness of our method.
2
DENCLUE Algorithm
Given n objects D={x1,… xn} in a d-dimensional data space F , the basic idea of DENCLUE algorithm [2] can be formalized as follows, d
[Def. 1] The influence of a data point y at a point x is a function
f By : F d → R+ .
[Def. 2] The density at point x is defined as the sum of the influences of all data points,
1
Supported by the National Natural Science Foundation of China under Grant No. 69975024.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 603–606, 2003. © Springer-Verlag Berlin Heidelberg 2003
604
W. Gan and D. Li
n
f BD ( x) = ∑ f Bxi ( x )
(1)
i =1
If a Gaussian influence and density function are considered, then we have,
f
D Gauss
( x) =
n
∑e
−
x − xi 2 2⋅σ 2
(2)
i =1
[Def. 3] Density attractors are local maxima of the overall density function. A point x * is density attracted to a density attractor x if there exist a set of points x0, x1, , d * xk F , such that x0=x, xk= x and the gradient of xi-1 is in the direction of xi , 0 ξ ) is the subset C⊆D which is density-attracted by x*; [Def. 5] A multi-center-defined cluster consist of a set of center-defined clusters which are linked by a path with significance ξ, that is, the density at each point along the path is no less than ξ. Obviously, there exists two important parameters, namely density parameter σ and noise threshold ξ. Density parameter σ determines the influence of point in its neighborhood and therefore determines the number of density-attractors, while noise level ξ describes whether a density-attractor is significant. In DENCLUE algorithm, the developers believe that σ is closely related to the hierarchical levels and a hierarchy can be built up by the variance of σ from σ to σ , where σmax is the smallest σ that min max yields only one density-attractor and σmin is the largest σ that yields n densityattractors. But, there exists a problem, namely whether the hierarchy generated by variance of density parameter exactly correspond to the inherent hierarchical structure of original data distribution? Hierarchical methods usually work by grouping data objects into a tree of clusters that splits the dataset recursively into small subsets. At the merging scale, each small cluster fits itself in whole inside a large cluster and datum is not permitted to change cluster membership once assignment has been made. However, in DENCLUE algorithm, since density-attractors are local maxima of the density function, i.e. the points satisfying ∇f B (x) = 0 , there will exist two possible types of merging of clusters as σ increases from σmin to σmax, namely pitchfork merging and saddle-node merging [3]. In a pitchfork merging, two cluster centers smoothly merge into one supercluster center, while, in a saddle-node merging, a cluster center suddenly disappears and the data points assigned to it are siphoned into different cluster centers. Clearly, datum will change cluster membership in the latter case. Moreover, from the theoretical point of view, in order to obtain a meaningful hierarchy, it is prerequisite that the number of density-attractors monotonously decreases with σ increasing from σmin to σmax. However, one simple example [4] has shown that it does not hold even in the twodimensional case. Thus, we consider that parameter σ has no inherent relation with the hierarchical levels and an optimal σ should be chosen to obtain good results. D
Optimal Choice of Parameters for a Density-Based Clustering Algorithm
605
3 Optimal Choice of the Parameters σ and ξ In the information theory [5], entropy is a measure of uncertainty. For the density distribution with a given σ, if the density of data points equals to one another, we are most uncertain about the underlying distribution and the entropy is the largest. On the other hand, if data points have a highly skewed density distribution, the uncertainty and the entropy will be smallest. Thus, we can obtain an optimal σ by minimizing an entropy measure base on density distribution. [Def. 6] Given n data points D={x1,… xn} in the data space, the density entropy based on density distribution can be defined as n
H = −∑ i =1
where Z =
∑
n i =1
f BD ( xi ) f D (x ) log( B i ) Z Z
(3)
f BD ( xi ) is a normalization factor. For any σ∈[0,+∞], density en-
tropy H satisfies:
H0;
H
log(n), and H=log(n)⇔ f BD ( x1 ) = L = f BD ( xn ) .
Clearly, if σ is too small or too large, H will approach the maximum entropy and at a certain σ, H will achieve a global minimum, which corresponds to an optimal σ. In practical applications, the optimal choice of σ can be executed over a sampling subset of original data and the global optimization of the density entropy H can be performed by employing simulated annealing method. Once σ is known, the results of clustering depend on noise threshold ξ. Since practical databases always contain large amounts of noisy data, we estimate ξ as follows:
ξ = DN ⋅ c ⋅ 2πσ 2
d
(4)
where d is the number of dimensions, c is a constant, 0
4 Experiments and Conclusion To illustrate the effectiveness of our approach, we perform two experiments on a personal computer with 256MB of RAM and P 933MHz CPU. Data set 1 contains 1000 data points, in the shape of nested rings in figure 1(a), while data set 2 contains 2000 data points, in the shape of parallel lines in figure 2(a). To show the invariance of our approach with respect to noise, data set 1 contains 6.7% uniformly distributed noisy data, while data set 2 contains 12.5% noisy data. The results of clustering are shown in figure 1(b), 1(c), and figure 2(b), 2(c) respectively. Obviously, the resulting clusters reflect correctly the underlying distribution of the original data.
606
W. Gan and D. Li
(a)
(b)
(c)
Fig. 1. Data set 1 and its arbitrary-shape clusters for σ=0.008 and ξ=0.632
(a)
(b)
(c)
Fig. 2. Data set 2 and its arbitrary-shape clusters for σ=0.018 and ξ=2.367
In this paper, by investigating the influence of the two parameters of DENCLUE algorithm on the clustering results, an entropy-based method is proposed for the optimal choice of density parameter σ and further noise threshold ξ is estimated to produce a reasonable pattern of clustering. Experimental results show that, with the parameters chosen by our approach, DENCLUE algorithm can obtain good results consistent with the underlying data distribution.
References 1. 2. 3. 4. 5. 6.
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers (2001). Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia datath bases with noise. In Proceedings of the 4 International Conference on Knowledge Discovery and Data mining (1998) 58–65. Yee Leung, Jiang-She Zhang, etc.: Clustering by scale-space filtering. IEEE Trans. Pattern analysis and Machine Intelligence, VOL. 22. (2000) 1396–1410. Linderberg, T.: Scale-space for discrete signals, IEEE Trans. Pattern analysis and Machine Intelligence, VOL.12. (1990) 234–254. Jan C A van der Lubbe: Information Theory. Cambridge University Press (1997). Dixon, W. J., Kronmal, R. A.: The choice of origin and scale for graphs. Journal of the Association for Computing Machinery, VOL.12. (1996) 259–261.
An Improved Parameter Tuning Method for Support Vector Machines Yong Quan and Jie Yang Inst. of Image Processing & Pattern Recognition, Shanghai Jiaotong Univ. Shanghai 200030, People’s Republic of China Correspondence: Room 2302B, No.28, Lane 222, fanyu Road, Shanghai City, 200052, People’s Republic of China [email protected] Abstract. Support vector machines (SVMs) is a very important tool for data mining. However, the problem of tuning parameters manually limits its application in practical environment. In this paper, under analyzing the limitation of these existing approaches, a new methodology to tuning kernel parameters, based on the computation of the gradient of penalty function with respect to the RBF kernel parameters, is proposed. Simulation results reveal the feasibility of this new approach and demonstrate an improvement of generalization ability. Keywords: support vector machine, tuning parameter, data mining
1 Introduction In this paper, we will discuss the generalization error bound of Support Vector Machine (SVM) and use the bound to adaptively predict the kernel parameters which yield the best generalization for SVM.
1.1 Bound on Generalization Error Since for SVM there has an upper bound on the VC-dimension[1], the VC-dimension can be used to get an upper bound on the expected error on an independent test set in terms of the training error. Then according to Vapnik[2], the following bound holds: define
1 T = R2 w l where data.
2
(1)
w is the weight and R is the radius of the smallest sphere that contains all
1.2 Computing
R
R , the radius of the smallest sphere ~ enclosing the training data in feature space. From [3], R can be expressed as ( k is In order to compute (1), one needs to compute defined in [1] ): R
2
=
l
∑
i =1
~ λ 0i k ( x i , x i ) −
l
∑
~ λ 0i λ 0j k ( x i , x j )
i , j =1
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 607–610, 2003. © Springer-Verlag Berlin Heidelberg 2003
(2)
608
Y. Quan and J. Yang
2 Adaptive Tuning Kernel Parameters with Penalty Function
(
In this paper, we take gaussian kernel k ( x i , x j ) = exp − x i − x kernel function. From (1) we can see that
2 j
2σ
2
) as the
T is differentiable with respect to C and
σ , and both C and σ are subject to positive values. So general constrained optimization methods can be used to tuning these parameters automatically. Here we adopt penalty function method as the optimization method. 2
2
2.1 Constructing Penalty Function Chapelle has put forward a parameter selection method for kernel[4]. The method directly optimize T with respect to kernel parameters using Newton method. As we all know, Newton method is a unconstrained optimization method. It puts no constraint on kernel parameters. So in practice, kernel parameters often step into invalid area and lead to incorrect results. For example, C often becomes negative value when tuning with Newton method directly. To solve this problem, we treat the optimization of T as a constrained problem with respect to kernel parameters and apply penalty function to update kernel parameters at each time. Through introducing penalty term, we define the penalty function as following. Minimize T (θ ) subject to c i (θ ) ≥ 0 , i = 1, K , m (3) Introducing penalty terms h (c i (θ )) , the penalty function can be considered as the form: P µ = T (θ
)+
µ
m
∑
i=1
h (c i (θ
))
(4)
m is the number of constrains, h (c i (θ )) is the penalty terms, θ T = (C , σ 2 ) is the parameter vector, µ is penalty constant which determines the trade off between T and penalty terms. Solve (3), we only consider the case of h(ci (θ )) = 1 ci (θ ) where
for simplicity. Then (4) becomes P µ = T (θ
)+
1 1 µ + σ 2 C
(4) has the property that when C or
σ2
becomes zero,
(5)
Pµ becomes infinite. It
has already been proven[5] that if C and σ are initialized to valid values, all values yielded during optimization process of (5) are valid. Thus we can take advantage of Newton method to gain the optimization solution. 2
2.2 Updating Kernel Parameters with Penalty Function Before updating kernel parameters with penalty function, we should consider the following statement: Suppose function T (θ ) is given by T (θ ) = min θ F (θ , α (θ )) . Let α (θ ) denote the solution of the minimization problem, i.e.,
∂F ∂α = 0 at α = α (θ ) . Hence,
An Improved Parameter Tuning Method for Support Vector Machines
609
∂T ∂F ∂F ∂α ∂F = + = ∂θ ∂θ ∂α ∂θ ∂θ Thus the gradient of T with respect to θ can be obtained simply by differentiating F with respect to θ as if α has no influence on θ . Adopting Newton method, kernel parameters of (5) updating process can be gained,
θ k +1 = θ k − [∇ 2 Pµ (θ k )] ∇Pµ (θ k ) −1
where
∇ 2 Pµ (θ k ) is Hesse matrix. ∂ 2 P µ (θ k ) 2 ∇ 2 P µ (θ k ) = 2 ∂ C ∂ P µ (θ k )
Let
T
(
∂ σ
2
)
∂C
∂ 2 P µ (θ ∂C ∂σ ∂ 2 P µ (θ ∂ 2σ
2
k 2 k
) )
,
∇ P µ (θ
(6)
k
∂ P µ (θ k ) = ∂ P∂ C(θ µ k ∂ σ 2
Λ0 = α 10 , K, α l0 , similar to (2), we can get T T ~ 2 w = 2W (a) = 2Λ0 1 − Λ0 YKYΛ0
) )
(7)
(8)
Combination of (1), (2),(5), (7) and (13) yields the adjusting results. 2.3 Termination Criterion Termination criterion also makes great influence on the accuracy of optimization and computing time. In practice, a suitable termination criterion is
Pµ (θ k + 1 ) − Pµ (θ k
) ≤ 10 − 4 Pµ (θ k )
3 Computational Experiments and Discussion In this paper, we only focus on SVM with RBF kernel. For the given estimator, goodness is evaluated by comparing the true minimum of the test error with the test error at the optimal kernel parameter set found by minimizing the estimate. We did the simulations on four benchmark datasets: Banana, Tree, Image and Splice. Detailed information concerning these datasets can be found in [6].
Fig. 1. T contour plots for dataset Banana (a) and dataset Tree (b). + denotes points generated by the tuning algorithm starting from the initial conditions
610
Y. Quan and J. Yang
From figure 1, we can see how these parameters varies during tuning period.
(
)
Here, we initialize pair C , δ as (3,0.5) and (1,3) respectively. The optimization process generates a sequence of points in the space of kernel parameters. Successive points attempted by the process are located ‘not-so-far-off’ from each other and converge to the optimal parameter values. These figures verify that the tuning method is feasible to the tuning of kernel parameters. 2
4 Conclusion In this article we present a new method of kernel parameters estimation that is especially useful for solving RBF kernel parameters selection problem. It is suitable for problems with up to multiple variables. We demonstrate that using the technique, we can not only predict optimal values for the parameters of the kernel but also evaluate relative performances for different values of the parameters. However, there are still many possible improvements in the future research. First, as we all know, Newton method requires the criterion function is differentiable. However, most of criterion functions, which give more accurate prediction to the test error, are nondifferentiable. So we should find a new method that can take advantage of nondifferentiable criterion functions. Second, we should find a pre-estimate method to estimate the possible area that optimal kernel parameters may exist so that we can reduce the search space of parameters.
References [1] [2]
[3] [4] [5] [6]
Vapnik V. Statistical learning theory. John Wiley, New York, 1998 Joachims T. Estimating the generalization performance of a svm efficiently. In proceedings of the inernational conference on machine learning. Morgan Kaufman, 2000 Schölkopf B. Support vector learning. R. Oldenbourg Verlag, Munich, 1997 Chapelle O, Vapnik V and Bousquet O et al. Choosing multiple parameters for support vector machines. Machine Learning. 2002, 46: 131–159 Ben-Daya M, Al-Sultan K.S. A new penalty function algorithm for convex quadratic programming. European Journal of Operational Research. 1997, 101(1): 155–163 Gunnar. http://ida.first.gmd.de/~raetsch/data/benchmarks.htm
Approximate Algorithm for Minimization of Decision Tree Depth Mikhail J. Moshkov Faculty of Computing Mathematics and Cybernetics of Nizhny Novgorod State University 23, Gagarina Av., Nizhny Novgorod, 603950, Russia [email protected]
Abstract. In the paper a greedy algorithm for minimization of decision tree depth is described and bounds on the algorithm precision are considered. This algorithm is adapted for application to data tables with both discrete and continuous variables, which can have missing values. To this end we transform given data table into a decision table. Under some natural assumption on the class N P the considered algorithm is close to unimprovable approximate polynomial algorithms for minimization of decision tree depth. Keywords: data table, decision table, decision tree, depth
1
Introduction
Decision trees are widely used in different applications as algorithms for task solving and as a way for knowledge representation. Problems of decision tree optimization are very complicated. In 1983 the paper [3] was published, which, in particular, contained descriptions of algorithms for decision tree construction (in this paper decision trees were named conditional tests). In the algorithms different measures for time complexity of decision trees (depth, weighted depth and others) and different measures for uncertainty of decision tables were used. Bounds on precision of these algorithms were obtained. Bounds on precision of polynomial approximate algorithms for set covering problem published by U. Feige in 1996 [2] show that some of considered in [3] algorithms are, apparently, close to unimprovable polynomial approximate algorithms for construction of decision trees with minimal depth. The considered in [3] algorithms are applicable only for decision tables, in which all attributes take values from fixed set {0, 1, . . . , s}, there are no missing values, and there are no equal rows. In this paper we choose one of the most simple greedy algorithms from [3], and adapt this algorithm for application to data tables with both discrete and continuous variables, which can have missing values, and with equal rows, which can be classified in different ways. To this end we transform given data table into a decision table. We consider bounds on precision of the algorithm. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 611–614, 2003. c Springer-Verlag Berlin Heidelberg 2003
612
M.J. Moshkov
The results of the paper were obtained partially in the frameworks of joint research project of Intel Nizhny Novgorod Laboratory and Nizhny Novgorod State University.
2
Data Tables and Checks
A data table D is a rectangular table with t columns, which correspond to variables x1 , . . . , xt . The rows of D are t-tuples of variable x1 , . . . , xt values. Values of some variables in some rows can be missed. The table D can contain equal rows. The variables are separated into discrete and continuous. A discrete variable xi takes values from an unordered finite set Ai . A continuous variable xj takes values from the set R of real numbers. Each row of the table D is labelled by an element from a finite set C. One can interpret these elements as values of a new variable y. The problem connected with the table D is to predict the value of y using variables x1 , . . . , xt . To this end we will not use values of x1 , . . . , xt directly. We will use values of some checks depending on variables from the set {x1 , . . . , xt }. A check is a function f depending on variables xi1 , . . . , xim ∈ {x1 , . . . , xt } and taking values from the set E = {0, 1, ∗}. Let r be a row of D. If values of all variables xi1 , . . . , xim are definite in r then for this row the value of f (xi1 , . . . , xim ) belongs to the set {0, 1}. If value of at least one of variables xi1 , . . . , xim is missed in r then for this row the value of f (xi1 , . . . , xim ) is equal to ∗. Consider some examples of checks. In the system CART [1] the checks are considered (in the main) each of which depends on one variable xi . Let xi be a continuous variable, and a be a real number. Then the considered check takes value 0 if xi < a, takes value 1 if xi ≥ a, and takes value ∗ if the value of xi is missed. Let xi be a discrete variable, which takes values from the set Ai , and B be a subset of Ai . Then the considered check takes value 0 if xi ∈ / B, takes value 1 if xi ∈ B, and takes value ∗ if the value of xi is missed. It is possible to consider checks depending on many variables. For example, let ϕ be a polynomial depending on continuous variables xi1 , . . . , xim . Then the considered check takes value 0 if ϕ(xi1 , . . . , xim ) < 0, takes value 1 if ϕ(xi1 , . . . , xim ) ≥ 0, and takes value ∗ if the value of at least one of variables xi1 , . . . , xim is missed. Let F = {f1 , ..., fk } be a set of checks which will be used for prediction of the variable y value. We will say that two rows r1 and r2 are equivalent relatively F if each check fi from F takes the same value on r1 and r2 . The considered equivalence relation divides the set of rows of the table D into equivalence classes S1 , . . . , Sq . The rows from an equivalence class Sj are indiscernible from the point of view of values of checks from F . For j = 1, . . . , q we denote by C(Sj ) the set of elements from C which are labels of rows from Sj . The maximal information which we can obtain about a row r ∈ Sj using checks from F is the following: the element from C, which is the label of r, belongs to the set C(Sj ). For any r ∈ Sj we denote by C(r) the set C(Sj ).
Approximate Algorithm for Minimization of Decision Tree Depth
613
Now we can formulate exactly the problem P (D, F ) of prediction of the variable y value: for a given row r of the data table D we must recognize the set C(r) using values of checks from F .
3
Decision Trees
As algorithms for the problem P (D, F ) solving we will consider decision trees with checks from the set F . Such decision tree is finite directed tree with the root in which each terminal node is labelled either by a subset of the set C or by nothing, each non-terminal node is labelled by a check from the set F . Three edges start in each non-terminal node. These edges are labelled by 0, 1 and ∗ respectively. The functioning of a decision tree Γ on a row of the data table D is defined in the natural way. We will say that the decision tree Γ solves the problem P (D, F ) if for any row r of D the computation is finished in a terminal node of Γ , which is labelled by the subset C(r) of the set C. The depth of a decision tree is the maximal length of a path from the root to a terminal node of the tree. We denote by h(Γ ) the depth of a decision tree Γ . By h(D, F ) we denote the minimal depth of a decision tree with checks from F , which solves the problem P (D, F ).
4
Decision Tables
We will assume that the information about the problem P (D, F ) is represented in the form of a decision table T = T (D, F ). The table T has k columns corresponding to the checks f1 , . . . , fk and q rows corresponding to the equivalence classes S1 , . . . , Sq . The value fj (ri ) is on the intersection of a row Si and a column fj , where ri is an arbitrary row from the equivalence class Si . For i = 1, . . . , q the row Si of the table T is labelled by the subset C(Si ) of the set C. The table T will be called degenerate if it has no rows or all rows from T are labelled by the same subset of the set C. We denote by R(T ) the number of unordered pairs of rows from T , which are labelled by different subsets from C. Let i ∈ {1, . . . , k} and δ ∈ E. We denote by T (i, δ) the subtable of the table T that consists of rows each of which on the intersection with column fi has element δ.
5
Algorithm U for Decision Tree Construction
For decision table T = T (D, F ) we construct a decision tree U (T ), which solves the problem P (D, F ). We begin the construction from the tree that consists of one node v, which is not labelled. If T has no rows then we finish the construction. Let T have rows
614
M.J. Moshkov
and all rows be labelled by the same subset of C. Then we mark the node v by this subset and finish the construction. Let T have rows, which are labelled by different subsets of the set C. For i = 1, . . . , k we compute the value Qi = max{R(T (i, δ)) : δ ∈ E}. We mark the node v by the check fi0 , where i0 is the minimal i for which Qi has minimal value. For each δ ∈ E we add to the tree the node v(δ), draw the edge from v to v(δ), and mark this edge by element δ. For the node v(δ) we will make the same operations as for the node v, but instead of the table T we will consider the table T (i0 , δ), etc.
6
Bounds on Algorithm U Precision
If T is a degenerate table then the decision tree U (T ) consists of one node. The depth of this tree is equal to 0. Consider now the case, when T is a nondegenerate table. Theorem 1. Let the decision table T = T (D, F ) be nondegenerate. Then h(U (T )) ≤ h(D, F ) ln R(T ) + 1. Note that R(T ) is at most square of the number of rows in the data table D. One can show that the algorithm U has polynomial time complexity. Using results of U. Feige [2] on precision of approximate polynomial algorithms for set covering problem it is possible to show that if N P ⊆ DT IM E(nO(log log n) ) then for any ε, 0 < ε < 1, there is no polynomial algorithm, which for a given decision table T = T (D, F ) constructs a decision tree Γ such that Γ solves the problem P (D, F ) and h(Γ ) ≤ (1 − ε)h(D, F ) ln R(T ). Using Theorem 1 we obtain that if N P ⊆DT IM E(nO(log log n) ) then the algorithm U is close to unimprovable approximate polynomial algorithms for minimization of decision tree depth.
References 1. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth & Brooks (1984) 2. Feige, U.: A threshold of ln n for approximating set cover (Preliminary version). Proceedings of 28th Annual ACM Symposium on the Theory of Computing (1996) 314–318 3. Moshkov, M.Ju.: Conditional tests. Edited by S.V. Yablonskii. Problemy Cyberneticy 40. Nauka Publishers, Moscow (1983) 131–170 (in Russian)
Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Technique for Understanding Data and Knowledge Structure Julio J. Vald´es National Research Council of Canada Institute for Information Technology 1200 Montreal Road, Ottawa ON K1A 0R6, Canada [email protected]
Abstract. This present paper introduces a virtual reality technique for visual data mining on heterogeneous information systems. The method is based on parametrized mappings between heterogeneous spaces with extended information systems and a virtual reality space. They can be also constructed for unions of heterogeneous and incomplete data sets together with knowledge bases composed by decision rules. This approach has been applied successfully to a wide variety of real-world domains and examples are presented from genomic research and geology.
1
Introduction
In this paper a Virtual Reality approach is introduced for the problem of understanding heterogeneous, incomplete and imprecise data [9]. The notion of data is not restricted to classical data bases, but also to logical relations and other forms of structured knowledge. Examples are decision rules generated by inductive methods [7], rough set algorithms [6], and others. The role of visualization techniques in the knowledge discovery process is well known. Several reasons make Virtual Reality (VR) a suitable paradigm: it is flexible (allows the choice of different representation models acording to human perception preferences), allows immersion (the user can navigate inside the data, interact with the objects, etc), creates a living experience, and is broad and deep (The user may see the whole world and/or concentrate on specific details). Moreover, the user needs no mathematical knowledge and only minimal computer skills.
2
The Virtual Reality Space
In the present case, heterogeneous and incomplete information systems will be considered [10]. They have the form S =< U, A > where U is the universe and A the set of attributes, such that each a ∈ A has a domain Va and an evaluation function fa but here the Va are not required to be finite (Table 1). G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 615–618, 2003. c Springer-Verlag Berlin Heidelberg 2003
616
J.J. Vald´es
Table 1. An example of a heterogeneous data base. Attributes are from different domains(nominal, ordinal, ratio, fuzzy, images, time-series and graphs), also containing missing values (?).
A heterogeneous domain is defined as a Cartesian product of a collection of ˆ n = Ψ1 × · · · × Ψn , where n > 0 is the number of information source sets (Ψi ): H sources to consider. As an example, consider the case of a domain where objects are characterized by attributes given by continuous crisp quantities, discrete features, fuzzy features, graphs and digital images. Let R be the reals with the ˆ = R ∪ {?} to be a source set and usual ordering, and R ⊆ R. Now define R extend the ordering relation to a partial order accordingly. Now let N be the set of natural numbers and consider a family of nr sets (nr ∈ N+ = N − {0}) given ˆ nr = R ˆ1 × · · · × R ˆ n (nr times) where each R ˆ j (0 ≤ j ≤ nr ) is constructed by R r 0 ˆ ˆ as R, and define R = φ (the empty set). Now, if Oj is a family of ordinal source sets (with the corresponding ordering relation), Nj a family of nominal variables, Fj a collection of fuzzy sets, Gj of graphs, and of digital images, Ij , and the same procedure is applied, a heterogeneous domain is constructed ˆn = R ˆ nr × O ˆ no × N ˆ nm × Fˆ nf × Gˆng × Iˆ ni . Other kinds of heterogeneous as H domains can be constructed in the same way, using the appropriate source sets. In more general information systems the universe is endowed with a set of relations of different arities. Let t =< t1 , . . . , tp > be a sequence of p natural integers, called type, and Y =< Y, γ1 , . . . , γp > the extended information system will be Sˆ =< U, A, Γ >, endowed with the relational system U =< U, Γ >. A virtual reality space is a structure composed by different sets and functions defined as Υ =< O, G, B, m , go , l, gr , b, r >. O is a relational structure defined as above (O =< O, Γ v > , Γ v =< γ1v , . . . , γqv >, q ∈ N+ and the o ∈ O are objects), G is a non-empty set of geometries representing the different objects and relations (the empty or invisible geometry is a possible one). B is a nonempty set of behaviors (i.e. ways in which the objects from the virtual world will express themselves: movement, response to stimulus, etc. ). m ⊂ Rm is a metric space of dimension m (the actual virtual reality geometric space). The other elements are mappings: go : O → G, l : O → m , gr : Γ v → G, b : O → B, r is a collection of characteristic functions for Γ v , (r1 , . . . , rq ) s.t. ri : γiv ti → {0, 1}, according to the type t associated with Γ v . The representation of an extended
Virtual Reality Representation of Information Systems and Decision Rules
617
information system Sˆ in a virtual world requires the specification of several sets and a collection of extra mappings: Sˆv =< O, Av , Γ v >, O in Υ , which can be done in many ways. A desideratum for Sˆv is to keep as many properties from Sˆ as possible. Thus, a requirement is that U and O are in one-to-one correspondence (with a mapping ξ : U → O). The structural link is given by a ˆ n → m . If u =< fa (u), . . . , fa (u) > and ξ(u) = o, then l(o) = mapping f : H 1 n f (ξ(< fa1 (u), . . . , fan (u) >)) =< fav1 (o), . . . , favm (o) > (favi are the evaluation functions of Av ). It is natural to require that Γ v ⊆ Γ , thus having a virtual world portraying selected relations from the information system. Function f can be constructed as to maximize some metric/non-metric structure preservation criteria as is typical in multidimensional scaling [1], or minimize some error measure of information loss [8], [4].
3
Examples
Clearly, a VR environment can not be shown on paper, and only simplified, grey level screen snapshots from two examples are shown just to give an idea. The VR spaces were kept simple in terms of the geometries used. The f transform used was Sammon error, with ζij given by the Euclidean distance in Υ and δij = (1 − sˆij )/ˆ sij , where sˆij is Gower’s similarity [3]. For genomic research in neurology, time-varying expression data for 2611 genes in 8 time attributes were measured. Fig-1(a) shows the representation in Υ of the information system and the result of a previous rough k-means clustering [5]. Besides showing that there is no differentiated class structure in this data, the undecidable region between the two enforced classes is perfectly clear. The rough clustering parameters were k = 2, ωlower = 0.9, ωupper = 0.1 and threshold = 1. The small cluster encircled at the upper right, contains a set of genes discovered when examining the VR space and was considered interesting by the domain experts. This pattern remained
Fig. 1. VR spaces of (a) a genomic data base (with rough clusters), and (b) a geologic data base with decision rules build with rough set methods.
618
J.J. Vald´es
unnoticed since it was masked by the clustering procedure (its objects were assigned to the nearby bigger cluster). When data sets and decision rules are combined, the information systems are of the form S =< U, A {d} >, Sr =< R, A {d} > (for pthe rules), where {d} is the decision attribute. Decision rules are of the form , i=1 (Aτi = vητii ) → (d = v d j ) , where the Aτi ⊆ A, the vητii ∈ Vτi and v d j ∈ Vd . The sˆij used for δij in A, was given by: sˆij = 1˘ ωij a∈A˘ (ωij · sij ), where: A˘ = Au if i, j ∈ U , a∈ A Ar if i, j ∈ R and Au Ar if i ∈ U and j ∈ R. The s, ω functions are defined as: sij = 1 if fa (i) = fa (j) and 0 otherwise, ωij = 1 if fa (i), fa (j) = ?, and 0 otherwise. The example presented is the geo data set [2]. The last attribute was considered the decision attribute and the rules correspond to the very fast strategy, giving 99% accuracy. The join VR space is shown in Fig-1(b) where objects are spheres and rules cubes, respectively. According to RSL results, Rule 570 is supported by object 173, and they appear very close in Υ . Also, data objects 195 and 294 are very similar and they appear very close in Υ .
References 1. Borg, I., Lingoes, J.: Multidimensional Similarity Structure Analysis. SpringerVerlag 1987. 2. Gawrys, M., Sienkiewicz, J. : Rough Set Library User’s Manual (version 2.0). Inst. of Computer Science. Warsaw Univ. of Technology (1993) 3. Gower, J.C.: A General Coefficient of Similarity and Some of its Properties. Biometrics Vol.1 No. 27 (1973) pp. 857–871 4. Jianchang, M., Jain, A. : Artificial Neural Networks for feature Extraction and Multivariate Data Projection. IEEE Trans. On Neural Networks. Vol. 6, No. 2 (1995) pp. 296–317 5. Lingras, P., Yao, Y. : Time Complexity of Rough Clustering: GAs versus K-Means. Third. Int. Conf. on Rough Sets and Current Trends in Computing RSCTC 2002. Malvern, PA, USA, Oct 14–17. Alpigini, Peters, Skowron, Zhong (Eds.) Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence Series) LNCS 2475, pp. 279–288. Springer-Verlag , 2002 6. Pawlak, Z. : Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Dordrecht, Netherlands. (1991) 7. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Series in Machine Learning (1992) 8. Sammon, J. W. A non-linear mapping for data structure analysis. IEEE Trans. Computers, C-18, 401–408, (1969) 9. Vald´es, J.J: Virtual Reality Representation of Relational Systems and Decision rules: an exploratory tool for understanding data structure. In TARSKI: Theory and Application of Relational Structures as Knowledge Instruments. Meeting of the COST Action 274, Book of Abstracts. Prague, Nov. 14–16, (2002) 10. Vald´es, J.J: Similarity-based Heterogeneous Neurons in the Context of General Observational Models. Neural Network World. Vol 12., No. 5, (2002) pp. 499–508
Hierarchical Clustering Algorithm Based on Neighborhood-Linked in Large Spatial Databases Yi-hong Dong Department of Computer Science, Ningbo University, Ningbo 315211, China [email protected]
Abstract. A novel hierarchical clustering algorithm based on neighborhoodlinked is proposed in this paper. Unlike the traditional hierarchical clustering algorithm, the new model only adopts two steps: clustering primarily and merging. The algorithm can be performed in high-dimensional data set, clustering the arbitrary shape of clusters. Furthermore, not only can this algorithm dispose the data with numeric attributes, but with boolean and categorical attributes. The results of our experimental study in data sets with arbitrary shape and size are very encouraging. We also conduct an experimental study with web log files that can help us to discover the use access patterns effectively. Our study shows that this algorithm generates better quality clusters than traditional algorithms, and scales well for large spatial databases.
1 Introduction A traditional hierarchical method[1] is made up of partitioning method in different layers, which uses iterative partition between layers. The layers are constructed by merge or split approaches. Once a group of objects are merged or split, the process at the next step will operate on the newly generated cluster. It will neither undo what was done previously, nor perform object swapping between clusters. BIRCH does not perform well to the non-spherical in shape, while CURE can not deal with data with boolean or categorical attributes. We suggest a new clustering algorithm-HIerarchical Clustering Algorithm based on NEighborhood-Linked(Hicanel) which can find clusters of arbitrary shape and size, identify the data set with both boolean and categorical attributes. It is a fast and high efficient clustering algorithm.
2 Hierarchical Algorithm Based on Neighborhood-Linked Comparing with the common hierarchical method, Hicanel method uses only two steps: clustering primarily to the objects in data set by partitioning method to get several sub-clusters, then merges sub-clusters linked compactly. This merging step is a one-step process other than layer by layer.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 619–622, 2003. © Springer-Verlag Berlin Heidelberg 2003
620
Y.-h. Dong
Fig. 1. An Example
We can easily detect clusters of points when looking at the sample sets of points depicted in Figure 1a. In our method, we regard the cluster with arbitrary shape as the combination of many sub-clusters. The data set is divided into some sub-clusters with similar size(Figure 1b), the combination of which is the cluster we want to discover. Either convex shape or concave can we regard as the combination of many subclusters, so we can distinguish any clusters with arbitrary shape and size. In conclusion, we divide the data set into many sub-clusters, then merge them into some clusters. It is easy to find sub-clusters, but the key to get the clusters is to merge the subclusters effectively and efficiently. Hicanel algorithm proposed merges the compact sub-clusters after analyzing the neighbor and links among sub-clusters. Definition 1:( ¦-neighborhood of a point) ¦-neighborhood of a point is an area with point p as center and r as radius which is called neighborhood p, denoted by Neg(p). To the data of numeric attribute, it is define by Neg(p)={q D|dist(p,q)<=U}, where dist(p,q) is Euclid distance; To the data of categorical attribute, it is define by Neg(p)={q D|sim(p,q)>=U}, where sim(p,q) is Jaccard coefficient. Definition 2: (Link) Point x is a link between neighborhood p and neighborhood q, if and only if x is not only in the neighborhood of point p but in that of point q. We use the signal link(p,q) to denote this relationship. Link(p,q)= {x|x Neg(p),x Neg(q)}. The whole data sets can be well described by joint constitutive graph of links after clustering primarily as following: node represents the centroid of every sub-cluster, edge represents the link of the sub-clusters, and weight represents the link number of the sub-clusters. Database should be rescanned once more to judge the distance between every object and each centroid of sub-clusters. If the object is in several ¦ neighborhoods, weights between these nodes should be added by increment. The composition is finished after whole data sets rescanned. After every object being scanned, we are going to cut off the graph. When the cutting finishing, the unconnected graph is formed and the connected branches, each of which represents a cluster, are the clustering result. Figure 1c is the joint constitutive graph of the subclusters for the data sets of Figure 1a after clustering primarily. We can easily discover that node O1,O2,…,O8 form one cluster and O9 and O10 belong to the other. Unsupervised clustering algorithm[2](UC) is used to clustering primarily in our algorithm which can identify the clusters with similar shape and size to search sub2 2 clusters. Total time complexity is O(2nk+nk +k ) including the complexity of UC algorithm. K is regarded as a constant owing to k<
Hierarchical Clustering Algorithm
621
3 Experimental Results Fitting for multi-dimensional data set, not only can Hicanel handle the data set with boolean and categorical attributes, but with numeric attributes. All experiments have been run on PIV1.6G machine with 256M of RAM and running VC++6.0.
Fig. 2. Result Between Two Algorithm in the Same Sets
Data Set with Numeric Attribute: We experimented with the data set illustrated in Figure 2a, which consists of some irregular figures, containing concave shaped clusters. Figure 2b and 2c show the clusters found by k-means and Hicanel respectively. As expected, k-means cannot distinguish the clusters with concave shape. In contrast, Hicanel algorithm successfully discovers the clusters with default parameter. Web Log Mining: Web log mining is finding the user access patterns in web log. The experimental data set used was obtained from “Ningbo Online”(www.nbol.net). The server logs were approximately 52M in size taken in June 2002. After preoperating the log files and data generalized by attribute-oriented induction[3], Table 1 shows the result of Hicanel algorithm in five data sets. Some significative clusters were found after analyzing the effective clusters in result, i.e., groups interested in real estate, groups interested in food and beverage, etc. Table 1. Data sets in experiment
Data set 10M 20M 30M 40M 50M
Record 72757 144433 214792 286793 354763
Session 690 1391 2058 2782 3410
Page 110 184 210 239 244
Effective clusters 7 11 22 35 43
4 Conclusion In this paper, we proposed a novel method of hierarchical clustering in large databases. Not only can we dispose the data with numeric attributes, but with boolean and
622
Y.-h. Dong
categorical attributes. The results of our experimental study in data sets with arbitrary shape and size are very encouraging. Furthermore, there is significant application in web mining that can help us to discover the user access patterns effectively.
References 1. Jiawei Han Micheline Kamber Data Mining:Concepts and Techniques. San Mateo, CA:Morgan Kaufmann,2000 2. LI Xiao-Li, LIU Ji-Min, SHI Zhong-Zhi. A Chinese web page classifier based on support vector machine and unsupervised clustering. Chinese J. Computers 24(1),2001 3. Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational databases. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, Cambridge, MA:AAAI/MIT Press,1991.
Unsupervised Learning of Pattern Templates from Unannotated Corpora for Proper Noun Extraction Seung-Shik Kang and Chong-Woo Woo School of Computer Science, Kookmin University & AITrc, Seoul 136-702, Korea {sskang, cwwoo}@kookmin.ac.kr, http://nlp.kookmin.ac.kr/index.html
Abstract. This paper describes an approach to extracting proper nouns in the very large text corpora without using the lexicon or cue word dictionary. At first, we train the pattern for extracting the proper nouns by applying the initial proper names into the unannotated corpora that does not have any tags yet. And then we continuously apply the pattern templates into the corpora in order to extract new proper nouns until certain period.1
1 Introduction As the information in our recent society increases explosively, the previous information search system tends to extract many duplicated search results. Extracting the desirable outcomes from the search results would be very difficult and also very timeconsuming work. The Information Extraction (IE) is such a research area that extracts predefined subjects or interested results from the pool of extremely large set of information. The first step for starting the IE is an extraction of the Named Entities, which could be the Nouns that can represent the selected document. Among the Nouns, the Proper Noun could represent the contents of the document better than the other vocabularies. Previous work in this area has taken place in the context of the Message Understanding Conferences(MUCs), in which the ‘Named-Entity’ and ‘Coreference’ has been explored [1]. The research about this area largely depends on the annotated corpora and cue word dictionary. The study on extracting the proper nouns began from the Named-Entity contest of MUC-6 and MUC-7 conferences. And another study on extracting the Named-Entity is the Information Retrieval and Extraction Exercise(IREX) workshop in Japan [2]. In Japan, there were approaches to solve this problem; one research extracts the NamedEntity by learning the contextual relationships between the morphemes, and other approach extracts it by studying the cue word in the tagged corpus [3]. Unlike the previous studies, another approach uses the contextual rules and spelling rules in the unannotated corpus [4]. 1
This work was supported by the Korea Science and Engineering Foundation(KOSEF) through the Advanced Information Technology Research Center(AITrc).
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 623–626, 2003. © Springer-Verlag Berlin Heidelberg 2003
624
S.-S. Kang and C.-W. Woo
2 Proper Noun Learning System The proper noun learning system of this research is composed of two major components, the ‘Pattern Creator’, and the ‘Named-Entity Finder’ as shown in Fig. 1. The ‘Pattern Creator’ creates pattern templates by using the Named-Entity collections. And the ‘Named-Entity Finder’ finds the proper nouns from the text corpora by using the pattern collection, which is created by the ‘Pattern Creator’. The ‘Pattern creator’ creates pattern templates from the morphologically analyzed corpus by using some proper nouns that are given as initial seeds. The created pattern again extracts proper nouns from the same corpus. If the extracted proper nouns do not belong to the named-entity set, then they are added as new proper nouns, and the named-entity set gets expanded. The expanded set further expands the pattern templates by creating and adding new patterns. The process repeats until there are no more patterns appearing, or just continues until pre-determined number of times. After done with learning on the pattern creation, the system extracts all the proper nouns that are found in the text corpora by using the final pattern set.
Fig. 1. The structure of the proper noun learning system
The pattern creator module basically adopts a model that is used in the named-entity chunking/tagging and considers preceding or subsequent morphemes as context information. When the word phrase W0 is found from the text, which matches it’s root with the word that belongs to the named-entity set, then it uses the context information represented by the root of the preceding word phrase W-1 and the subsequent word phrase W1 of the word phrase W0. When the root of the word phrase W0 is not a noun or verb(one of the exclamation, adverb, or adjective), then we do not use it as context information, since it cannot be any important information in the text. The pattern creation rules are summarized as follows. (1) From the text, find the word phrase W0, which has a root that match with the Named Entity in the Named Entity set. (2) Examine the root of the preceding word phrase W-1 of the matched word phrase W0 is noun. If the root is a noun, then the root of the preceding word phrase W-1 is used for preceding context information of pattern. Otherwise, the preceding context information of pattern has null value. (3) Examine the root of the subsequent word phrase W1 of matched word phrase W0 is noun or verb. If the root is a noun or a verb, the root of the subsequent word phrase W1 is used for subsequent context information of pattern. Otherwise, the subsequent context information of pattern has null value. (4) Create a pattern only when the root of the matched word phrase W0 is unregistered. (5)
Unsupervised Learning of Pattern Templates from Unannotated Corpora
625
Add the created pattern in the pattern set (Do not add the pattern if created pattern already exists in the pattern set). The Named Entity Finder extracts the person names as follows. (1) Examine a word phrase in the text has the root of the preceding/subsequent word phrase that matches the preceding/subsequent context information of the pattern template. (2) If one of the pattern templates is matched, and the noun phrase is an unregistered noun phrase, then we regarded it as a proper noun and add it into the Named Entity set. For the annotation of proper nouns, the NE tagging system annotates proper nouns in the text corpora by using the automatically created pattern templates. In the pattern template, Named Entity is described in a single word. So, a multi-word proper noun should be combined into a noun phrase. For example, when the person names consists of two or more nouns, our system combines them into one word by NP chunking for proper nouns. This process satisfies the singleton limitation of the pattern description rules.
3 Experiments and Results In order to create a pattern and extract proper nouns, all the sentences in the text corpora must go through the morphological analysis. Our system used the morphological analyzer, HAM version 5.0.0a [5]. The size of the text corpora in this experiment is 520K that are borrowed some parts from the Korean balanced corpus ‘Sejong Test Collection 1998’. We selected well-known person names that appear in the text corpora frequently for initial proper names. In this experiment, we have used 2, 10, 20 of the initial person names that has the Named Entity type of ‘PERSON’. Then, we run the Named Entity extractor until there are no more new proper nouns. In the pattern learning process, we automatically removed a pattern that has no information on both side and we also removed a pattern that includes a common noun instead of a proper noun. Fig. 2 shows the extracted pattern finds the proper nouns with the precision rates according to the learning iteration from this experiment.
Fig. 2. Precision rates on the repeated learning
Generally, the created pattern shows above 93% of precision rates for extracting the proper nouns. Also the reason for showing high precision rate in case of using the 20
626
S.-S. Kang and C.-W. Woo
over 10 initial proper nouns is that the number of the extracted proper nouns on every iteration shows higher increasing rates than the number of wrong proper nouns. Unlike the high precision rate, the recall rate shows different outcomes according to the inputs on the initial proper nouns. The total number of the person names in the text that used in this experiment is about 500, and the number of total appearance is about 1,273. If the initially used person names are from 2 to 20, then after 3 to 4 iteration, we could find about 84 to 100 person names, which is about 20%. The reason is that initially used the person names are from specific area and the text came from the diverse area also, which was extracted from the balanced text corpus randomly. For example, in the beginning of our experiment, we have used some well-known person names like ‘Kim Dae-Jung’ and ‘Clinton’. And we could find most of the politically related person names, such as, ‘Kim Young-Sam’, ‘Park Chung-Hee’, ‘Yeltsin’, ‘Gorbachov’, and so on. But we could not find any other areas of the person names, such as ‘Shakespeare’ or ‘Hemingway’. Therefore, if the extracted proper noun shows high frequency inside the text, then the recall rate becomes higher. But in case of the frequency is low, then the recall rate becomes low.
4 Conclusion This paper proposed a pattern learning method of extracting Named Entity for the proper nouns in the text, which will be utilized for the Information Extraction. The experiment has been performed on the condition that person names are not registered in the dictionary, and we extracted person names by using the learning of NE patterns, which are composed of the preceding and subsequent word patterns. The results showed 93% precision rate and could extract not only the domestic but also the foreign person names by only using the learned patterns. But the approach is limited to extracting the person names of only the learned area by the initially used person names.
References th
1. MUC, Proc. of 7 Message Understanding Conference(MUC-7), (1998) 2. Borthwick, A.: A Japanese Named Entity Recognizer Constructed by a Non-speaker of Japanese. In Proc. of the IREX Workshop (1999) 187–193 3. Yangaber, R., W. Lin, and R. Grishman: Unsupervised Learning of Generalized Names. In th Proc. of the 19 International Conference on Computational Linguistics, (2002) 1135–1141 4. Stevenson, M. and R. Gaizauskas: Improving Named Entity Recognition using Annotated Corpora. LREC Workshop on Information Extraction meets Corpus Linguistics (2000) 5. Kang, S.: Korean Morphological Analyzer. http://nlp.kookmin.ac.kr/ (2000)
Approximate Aggregate Queries with Guaranteed Error Bounds Seok-Ju Chun1, Ju-Hong Lee2, and Seok-Lyong Lee3 1
Department of Internet Information, Ansan College 752, Il-Dong, SangRok-Ku, Ansan, Korea [email protected] 2 School of Computer Science and Engineering, Inha University 253, YongHyun-dong, Nam-Ku, Inchon, Korea [email protected] 3 School of Industrial and Information System Engineering Hankuk University of Foreign Studies, Yongin-si, Kyounggi-do, Korea [email protected] Abstract. It is very important to provide analysts with guaranteed error bounds for approximate aggregate queries in many current enterprise applications such as the decision support systems. In this paper, we propose a general technique to provide tight error bounds for approximate results to OLAP range-sum queries. We perform an extensive experiment on diverse data sets, and examine the effectiveness of our proposed method with respect to various dimensions of the data cube and query sizes.
1 Introduction The pCube [5] and the MRA-tree [4] are solutions that provide the progressive refinement of an approximate answer with absolute error bounds. The pCube provides early feedbacks with absolute error bounds for queries on data cubes. However the approach provides loose error bounds for approximate results to OLAP range-sum queries because it uses only a trivial bound technique. The MRA-tree also uses a tree structure to get a quick response, meanwhile weakening the requirement to return an exact answer. The MRA-tree produces a better answer to the query in less amount of time compared to the pCube. However, a bounding technique used in this approach is the same as that used in the pCube. Recently, we proposed the ‘∆-tree’ [2] to manage updates efficiently in the dynamic OLAP environment. This drastically reduces the update cost in run-time. In addition, by taking advantages of the hierarchical structure of the ∆-tree, we proposed a hybrid method to provide either an approximate result or a precise one to reduce the overall cost of queries. In this paper, we propose a general technique to provide tight error bounds for approximate results to OLAP range-sum queries. The proposed technique is very effective and directly applicable to various approximation techniques that use a tree structure such as the pCube and the MRA-tree. We conduct an extensive experiment on diverse data sets, and examine the effectiveness of our proposed method with respect to various dimensions of the data cube and query sizes. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 627–630, 2003. © Springer-Verlag Berlin Heidelberg 2003
628
S.-J. Chun, J.-H. Lee, and S.-L. Lee
2 Tree-Based Index Structure: The ∆-Tree The ∆-tree is a modified version of the R*-tree [1] to store the updated values of a data cube and to support efficient query processing. Whenever the data cube cell is updated, the difference (∆) between the new and old values of the data cube cell and its spatial position are stored into the ∆-tree. We define the ∆-tree formally as follows: Definition 2.1 (the ∆-tree) 1. A directory node contains (L1, L2,..., Ln), where Li is the tuple about the i’th child node Ci and has the form (Σ∆, M, cpi, MBRi). Σ∆ is the sum of Σ∆ values of children nodes (∆ values) of Ci when Ci is a directory node(data node). cpi is the address of Ci and MBRi is the MBR enclosing all entries in Ci . M has the form (µ1, µ2, ... , µd) where d is the dimension and µj is the weighted mean position of the j’th dimension of MBRi which is defined as follows: Let f ( k1 , k 2 ,..., k d ) be the value of an updated position (k1, k2,..., kd) in MBRi with 1≤kj≤nj, where nj is the number of cells of the j’th dimension of MBRi. For 1≤m≤nj, Let Gj (m) =
n1
m
nd
k1=1
k j =1
kd =1
∑...∑...∑ f (k1,...,k j ,...,kd ) . Let a be the largest
such that G j (a) ≤ G j (n j ) 2 , and b be the smallest such that G j (b) ≥ G j (n j ) 2 . Then µ j =
a +b . That is, µj divides the hyper space MBRi such that each subspace has 2
about a half of the sum of updated values in MBRi . 2. A data node is at the level 0 and it contains (D1, D2,..., Dn), where Di is the tuple about i’th data entry and has the form (Pi, ∆i). Pi is the position index and ∆i is the difference of the changed cell. Ho, et al [3] has presented an elegant algorithm for computing range-sum queries in data cubes which we call the prefix sum (PS) approach. The approach uses an additional cube called a prefix sum cube (PC), to store the cumulative sum of data. The essential idea of this approach is to pre-compute many prefix sums of the data cube, which can be used to answer ad hoc queries at rum-time. Each cell of PC contains the sum of all cells up to and including itself in the data cube. We use both the PC and the ∆-tree in order to answer the range-sum query. When a range-sum query Q is given, Q we use the PC and the ∆-tree for obtaining the answer of Q. Let sum be a function Q that returns the answer of Q, sumPC be a function that returns the answer which is calculated from PC, and sum∆Q be a function that return the answer which was found Q from the ∆-tree. Then, the answer will be: sum = sumPC + sum∆Q . When processing a Q
range-sum query, we also can obtain an approximate answer by searching the ∆-tree partially. That is, searching is performed from the root to an internal node of the level l instead of a leaf node. +
−
Let Σ∆ and Σ∆ be the sum of positive ∆ values and the sum of negative ∆ values of the node MBR of the ∆-tree, respectively. The upper and lower bound for an approxiQ mate answer sum , at the level l of the ∆-tree may be formally defined as:
Approximate Aggregate Queries with Guaranteed Error Bounds
629
m n − ∑ (Σ∆ )i + ∑ (Σ∆ ) j i =1 j =m+1 Q m Q UB ( k ) = sumPC + ∑ (Σ∆ )i + ∑n (Σ∆+ ) j i =1 j =m+1 As searching is performed from the root to the leaf node, the query results with error bounds are progressively refined and are returned to a user. Thus, the user can stop the query processing when the stopping criteria (time and/or error bound) is satisfied. Q LB ( k ) Q = sum PC
+
3 Tight Bound Technique In this section, we introduce a method that obtains a tight error bound using the spatial relationship between the query MBR and a node MBR. Definition 3.1 (Positive and negative half spaces) Let µ i be the weighted mean posi(-) tion of the i’th dimension for all positive values of a node MBR and let µ i be the weighted mean position of the i’th dimension for all negative values, where i ∈ D. The (+) MBR is divided into two hyper spaces by µ i of each dimension. We call these two hyper spaces positive half spaces of the i’th dimension, denoted by phs2i-1 and phs2i, (-) respectively. The MBR is also divided into two hyper spaces by µ i of each dimension. We call these two hyper spaces negative half spaces of the i’th dimension, denoted by nhs2i-1 and nhs2i, respectively. (+)
Let PHS = {phs1, phs 2, …, phs 2d} and NHS = {nhs1, nhs2, …, nhs2d} denote the set of positive half spaces and the set of negative half spaces which are found in a ddimensional node MBR, respectively. Let MBRT be the MBR of a node T that intersects with MBRQ. Lemma 3.2 For all phs j ∈ PHS and nhs k ∈ NHS, the upper and lower bound of a part of an approximate answer to the query in the MBRT at the level l of the ∆-tree are as + − Q Q follows: Initially we set UBMBR (l ) =Σ∆ and LBMBR (l ) = Σ∆ . T T
Σ∆+ and Σ∆− , if ∃ j such that Vol(MBR ∩ MBR ) Q Q Case 1. UBMBR = = (l ) LB (l ) T Q MBRT T 2 2 ⊂ phs j ∧ ∃ k such that Vol(MBRT ∩ MBRQ) ⊂ nhsk Σ∆+ + Σ∆− and − Q Q Case 2. UBMBR (l ) = LBMBR (l ) =Σ∆ , if ∃ j such that Vol(MBRT ∩ T T 2 2 MBRQ) ⊂ phs j ∧ ∃ k such that Vol(MBRT ∩ MBRQ) ⊃ nhsk + Σ∆− + Σ∆+ , if ∃ j such that Vol(MBR ∩ Q Q Case 3. UBMBR = Σ∆ and = (l ) LB (l ) T MBR T T 2 2 MBRQ) ⊃ phs j ∧ ∃ k such that Vol(MBRT ∩ MBRQ) ⊂ nhsk + Σ∆− and Σ∆+ , if ∃ j such that − Q Q Case 4. UBMBR = Σ∆ + = Σ∆ + (l ) LB (l ) MBR T T 2 2 Vol(MBRT ∩ MBRQ) ⊃ phs j ∧ ∃ k such that Vol(MBRT ∩ MBRQ) ⊃ nhsk
630
S.-J. Chun, J.-H. Lee, and S.-L. Lee
4 Experiment and Conclusion In order to evaluate the effectiveness of our method, we have conducted an extensive experiment on diverse data sets that are generated synthetically with various dimensions. Test data sets were generated to have two types of distributions: uniform and zipf distribution. The dimensionalities of the test data are 3, 4, and 5. The cardinality of each dimension d is 512 for d = 3, 128 for d = 4, and 64 for d = 5, respectively. The number of data elements that are to be inserted into the ∆-tree is 40000. Three types of queries are used based on query size (i.e., query volume / data cube volume) as follows: large(=0.1), medium(=0.05), small(=0.01). Our experimental results indicate that the large query has the best EBR in both the naive and tight bound techniques. The experimental result with the zipf distributed data indicates that the zipf ditributed data has almost twice EBR value compared to the uniform distributed data. Although the MRA-tree in [4] only provided the naive bound, we applied our tight bound technique to the MRA-tree. Experimental results show that our method provides tighter error bounds than the MRA-tree. We showed that the tight bound technique presented in this paper could be applicable to other methods such as the pCube and the MRA-tree and was very general and effective on various dimensions of the data cube and query sizes. To our knowledge, this is the first approach to provide tight error bounds for approximate results to OLAP range-sum queries.
References 1. N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger: The R*-tree: an efficient and robust access method for points and rectangles. Proc. of ACM SIGMOD Conference (1990) 322– 331 2. Seok-Ju Chun, Chin-Wan Chung, Ju-Hong Lee, and Seok-Lyong Lee: Dynamic Update Cube for Range-Sum Queries. Proc. of VLDB Conference (2001) 521–530 3. C. Ho, R. Agrawal, N. Megido, R. Srikant: Range queries in OLAP Data Cubes. Proc. of ACM SIGMOD Conference (1997) 73–88 4. L. Lazaridis, and S. Mehrotra: Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure. Proc. of ACM SIGMOD Conference (2001) 401–412 5. M. Riedewald, D. Agrawal, and A. E. Abbadi: pCube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds. Proc. of SSDBM Conference (2000) 95–108
Improving Classification Performance by Combining Multiple T AN Classifiers Hongbo Shi, Zhihai Wang, and Houkuan Huang 1 School of Computer and Information Technology, Northern Jiaotong University, Beijing, 100044, China 2 School of Computer Science and Software Engineering, Monash University, Clayton, Victoria, 3800, Australia
Abstract. Boosting is an effective classifier combination method, which can improve classification performance of an unstable learning algorithm. But it does not have much more improvements on a stable learning algorithm. T AN , Tree-Augmented Naive Bayes, is a tree-like Bayesian network. The standard T AN learning algorithm generates a stable T AN classifier, which is difficult to improve its accuracy by boosting technique. In this paper, a new T AN learning algorithm called GT AN and a T AN classifier combination method called Boosting-M ultiT AN are presented. Through comparisons of this T AN classifier combination method with the standard T AN classifier in the experiments, the Boosting-M ultiT AN shows higher classification accuracy than the standard T AN classifier on the most data sets. Keywords: Boosting, Combination Method, Classification, Data Mining
1
Introduction
Classification is a fundamental issue for machine learning and data mining. Boosting [1,2,3] is an effective classifier combination method. The stability is a key factor of a boosting algorithm. As we know, boosting technique can improve classification performance of an unstable learning algorithm, such as the decision tree learning [2,4,5], but it has no obvious effect on a stable learning algorithm, such as naive Bayes [8,9], and sometimes it has even lower classification accuracy. T AN , Tree-Augmented Naive Bayes, is a tree-like Bayesian network, which is an extended model of the naive Bayes. The T AN classifier is superior to the naive Bayes classifier by relaxing the independence assumption of the naive Bayes. There are two standard T AN learning approaches [6,7]. However, the T AN classifier generated by these two algorithms is stable, and it is difficult to improve classification accuracy by simply using boosting method. In this paper, we propose a new algorithm GT AN (General Tree-Augmented Naive Bayes) and a combination method Boosting-M ultiT AN . GT AN algorithm generates different T AN classifiers through setting two special parameters. Boosting-M ultiT AN consequently build the combination classifier by these T AN classifiers. The experimental results show that the Boosting-M ultiT AN combination classifier has higher classification accuracy than the standard T AN classifier in the most cases. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 631–634, 2003. c Springer-Verlag Berlin Heidelberg 2003
632
2
H. Shi, Z. Wang, and H. Huang
T AN Model and GT AN Learning Algorithm
T AN is the natural extend of naive Bayes, which combines the simplicity of naive Bayes with the ability to express the dependence among attributes in the Bayesian network. T AN satisfies the following conditions: (1)each attribute has the class attribute as a parent; (2)attributes may have one other attribute as a parent. At present, there are two T AN learning algorithms: CI-based approach [6] and accuracy-based approach [7]. These two standard T AN learning algorithm search arcs space to select the arcs that have stronger dependence relation between two attributes. In general, T AN classifiers generated by these two algorithms are stable and it is hard to improve its classification performance by boosting technology. GT AN algorithm concentrates on the arcs whose conditional mutual information are larger than a certain threshold, selects some arcs from these arcs in a special way, and then forms a T AN model by adding the selected arcs to naive Bayes.GT AN algorithm is as follows: (1) Initialize G=(V, T ) to a naive Bayes classifier, V = {X1 , X2 , · · · , Xn }, T ={}. Let E = {e1 , e2 , · · · , en(n−1)/2 } be the edges set of complete undirected graph; ε be a small threshold, start-edge be a parameter; (2) Let k = start-edge; (3) Compute conditional mutual information I(Xi ; Xj |C) between Xi and Xj , where Xi and Xj are two nodes of edge ek , i = j , C is a class node; (4) If I(Xi ; Xj |C) > ε, Xi and Xj have no paths, then add (Xi , Xj ) to T ; (5) If |T | < n − 1, and k < n(n − 1)/2, then k = k + 1, go back to (3); else go to (6); (6) Transform the resulting undirected tree to a directed one with a root node; (7) Construct a T AN model by adding node C and adding an arc from C to each Xi . In most cases, it is obvious that the classification performance of the T AN classifier generated by GT AN algorithm is not the best. However, this T AN classifier constructed in this way will form the different T AN classifier through setting parameter ε and adjusting parameter start-edge. These different T AN classifiers will reflect dependent relations of attribute from different aspects.
3
Combination Method of Multiple T AN Classifiers
We regard different T AN classifiers generated by GT AN as a series of base classifiers. The training of the first base classifier begins with the original training set while the others’ training depends on performance of its previous classifiers. The instances misclassified by the previous classifiers will have great probability in the training set of new classifiers. In the end, the combination classifier is built by these base classifiers. Boosting-M ultiT AN algorithm is following: Input: Training data set Dataset; the number of base classifiers N um.
Improving Classification Performance
633
Output: The combination classifier of multiple TAN classifiers. (1) Set T rainData-1=Dataset, initialize all weighting values of training in(1) stances, wi = 1/N, i = 1, 2, · · · , N ; (2) start-edge=1; t = 1; l = 1; (3) While ((t <= N um) and (l <= 2N um)) a) Quote GT AN with T rainData-t and start-edge , and construct t-th T AN classifier T AN (t) : X → Y ; b) Use T AN (t) to classify each instance in Dataset and estimate the error rate e(t) of T AN (t) ; c) If e(t) > 0.5, go to h); d) Calculate Beta(t) = log((1 − e(t) )/e(t) ); e) Revise weighting value of the next sample classifier (t)
(t)
wi = wi · eBeta ·I(yi =T AN (xi )) ; f) Normalize w(t+1) , make the total equal 1; g) t = t + 1; h) l = l + 1; start-edge = start - edge + n/2N um; (4) endwhile N um Return H(x) = arg maxy∈Y ( t=1 Beta(t) · I(y = T AN (t) (x))) (t+1)
4
(t)
Experimental Methodology and Results
We choose 13 data sets from the UCI machine learning repository for our experiment. A pre-discretization step was applied to data sets that include continuous attributes, which converts continuous attributes into nominal attributes. In data sets with missing value, we regarded missing value as a single value. Table 1. Experimental results of comparing two classifiers Domain 1 2 3 4 5 6 7 8 9 10 11 12 13
T AN Boosting − M ultiT AN
Anneal 94.2149+-0.23 Audiology 65.6195+-1.00 Car 91.6001+-0.22 Contact-lenses 65.8333+-4.68 House-Votes-84 93.1954+-0.32 Iris Classification 91.7500+-1.47 King-Rook-vs-king-Pawn 93.4473+-0.12 LED 73.9600+-0.24 Mushroom 99.4090+-0.03 Promoter Gene Sequences 82.9717+-3.26 Soybean Large 87.3585+-0.36 Tic-Tac-Toe End Game 74.4394+-1.17 Zoology 97.5270+-0.84
97.7673+-0.51 76.1726+-1.24 87.1065+-0.82 71.9250+-4.18 94.9425+-0.46 93.8000+-1.61 93.2979+-0.39 73.9762+-0.72 99.9717+-0.03 89.3396+-2.05 92.5549+-0.68 74.7185+-1.09 96.8483+-0.96
634
H. Shi, Z. Wang, and H. Huang
Our experiments have compared Boosting-M ultiT AN combination classifier with a standard T AN classifier[6] by the classification accuracy. The classification performance was evaluated by ten-folds cross-validation for all the experiments on each data set. Table 1 shows the classification accuracy and standard deviation of each classifier. Boldface font indicates that the accuracy of BoostingM ultiT AN is higher than that of T AN at a significance level better than 0.05 using a two-tailed pairwise t-test on the results of the 20 trials in a domain. From Table 1, the significant advantage of Boosting-M ultiT AN over T AN in terms of higher accuracy can be clearly seen.
5
Conclusion
In this paper, we propose a new T AN learning algorithm GT AN and a T AN classifier combination method Boosting-M ultiT AN . The experimental results show that the Boosting-M ultiT AN classifier has higher classification accuracy than the standard T AN classifier in the most cases. However, if the diversity between several base classifiers is very small, we may not obtain the desirable combination classifiers. Therefore, classifiers that have bigger diversity should be chosen as base classifiers. The judgement of diversity between different T AN classifiers is a topic for further discussion.
References 1. Schapire, R.E., The strength of weak learnability. Machine Learning, 5(1990), 197– 227. 2. Freund,Y. and Schapire,R.E., Experiments with a new Boosting algorithm, Proceedings of the Thirteenth International Conference on machine learning, San Francisco, CA: Morgan Kaufmann(1996), 148–156. 3. Freund,Y., An adaptive version of the Boost by majority algorithm, Proceedings of the twelfth Annual Conference on Computational Learning Theory(1999) 4. Bauer, E. and Kohavi, R., An empirical comparison of voting classification algorithms: Bagging, boosting, and variants, Machine Learning, 36(1/2) (1999), 105– 139. 5. Quinlan,J.R., Bagging, Boosting, and C4.5, Proceedings of the Thirteenth National Conference on Artificial Intelligence, Menlo Park, CA:AAAI Press(1996), 725–730. 6. Friedman,N., Geiger,D., and Goldszmidt,M., Bayesian network classifiers. Machine Learning, 29(1997), 131–163. 7. Keogh, E. J., Pazzani, M. J.: Learning Augmented Bayesian Classifiers: A Comparison of Distribution-Based and Classification-Based Approaches. In: Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics. (1999) 225–230 8. Ting,K.M. and Zheng,Z., Improving the performance of boosting for naive Bayesian classification. Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-99), Berlin: Springer-Verlag(1999), 296–305. 9. Zheng,Z., Naive Bayesian Classifier Committees. Proceedings of ECML’98, Berlin: Springer Verlag, (1998), 196–207.
Image Recognition Using Adaptive Fuzzy Neural Network and Wavelet Transform Huanglin Zeng and Yao Yi Sichuan Institute of Light Industry & Chemical Technology 643033, Zigong, Sichuan, P.R.China
Abstract. We have proposed to design and implement an image recognition system in software and hardware using features extracted from the wavelet transform (WT) of the image as input to a pattern classifier. The wavelet transform will be computed via an adaptive neural network while the pattern classification will be carried out by an adaptive fuzzy neural network. Thus, the system will be fully parallel and distributive.
1 Introduction There are many applications that call for automatic image recognition. Since digital images have huge amount of redundant data, recognition is not possible at the pixel level. One has to reduce the raw image data to a feature vector of finite dimension, which can then be used in a pattern classifier. The design of a pattern classifier involves a) extraction of a set of meaningful feature vectors and b) training of the classifier using known feature characteristics. The degree of success of a recognition task depends largely on the choice of both the image features and classifier. There are a variety of feature measures that describe an object in an image [1]. From an implementation point of view, it is desirable to have the least dimension for feature vectors. More recently, there is growing interest in multi-resolution representation of images called the wavelet transform (WT)[2]. WT has the property of varying spatial and frequency resolutions better than the short time Fourier transform. WT closely matches the human vision system, which is more sensitive to certain spatial frequencies than others. WT represents an image with varying frequency and spatial resolution and thus captures the essence of an image. Besides multi-resolution representation, WT also effects image data reduction. This is an important factor affecting practical implementation. Pattern classification, a critical stage in the recognition process, is the hurdle in many cases. The sheer amount of data within an image, as well as its variability, renders many classical methods ineffective. Attempts to address this difficulty using analytical methods, statistical techniques or rule-based approaches have generally been inadequate. The neural network theory has already provided useful solutions to a wide range of problems[3]. Unfortunately, these signal processing methods do not deal with the problems of information overlapping (fuzziness), which occurs from natural variations, original G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 635–639, 2003. © Springer-Verlag Berlin Heidelberg 2003
636
H. Zeng and Y. Yi
measurements and data processing methodology, etc. Fuzzy set theory provides useful concepts and tools to deal with imprecise signal representation, and information so that an element may partially belong to a class and simultaneously belong to more than one class. One of promising ways is to design an adaptive fuzzy neural network pattern classifier to partition the feature vector space, which may have overlapping memberships.
2 The Description on an Image Recognition System We propose to design and implement an image recognition system using features extracted from the wavelet transform of the image as inputs to a pattern classifier. At first, an adaptive neural network is proposed for wavelet transform of signal. Then the input features are computed via the co-occurrence matrices, which are based on the compressed, signal features by WT. Finally, a fuzzy central cluster neural network classifier (FCCNN) are provided for pattern classification. The system proposed will be fully parallel and distributive, and it can be implemented in software and hardware. The recognition system consists of two phases as shown in fig.1 Input Images
WT via Neural Network
Feature Extract
Fuzzy Neural Network Classifier Design
Training Phase Unknown Images
WT via Neural Network
Feature Extract
Fuzzy Neural Network Classifier
Classification Phase Class Membership
Fig. 1. Block Diagram of the overall image recognition system
In the first phase an adaptive fuzzy neural network classifier is designed that generates fuzzy associative memory rules from a set of training images by productspace clustering. In the second phase an unknown image is input to the system to determine its class membership based on the fuzzy associative memory rules. Now, a brief description of the system follows: An adaptive neural network for Wavelet Transform In an image processing, a continuous signal is usually represented and processed by using its discrete samples. Our analysis is based on assumption that the signal is bandlimited and the discrete samples satisfy the classical Shannon sampling theorem. By an image, we shall mean a digitized gray scale picture that consists of 256 256 pixel, each of which takes a value between 0 and 256 . We shall represent the value of the
Image Recognition Using Adaptive Fuzzy Neural Network and Wavelet Transform
637
pixel in row x and column y of the image by f(x,y), and regard a given image f(x,y) as a vector in a 256 256-dimensional vector space. We wish to represent f(x,y) in some optimal sense by projecting it onto a chosen set of two-dimensional scaling function i (x,y). If an image is segmented into 256 256 block vectors, and each of block vectors is m-dimensional vector, this requires finding projection coefficients ai such that the resultant vector H(x,y) q
p
∑∑
qxp
∑
H(x,y)= i =1 s =1 ars rs (x,y)= i =1 ai i (x,y) (2.1) is either identical to f(x,y) or generate a minimal error vector in the optimal sense. Where q is the number of scaling function which the signal is spanned, and p is the number of pixels on which each of the overlapping scaling function rs (x,y)in this transform is supported. ars is the projection coefficient of signal on the space spanned by scaling function. An adaptive neural network can complete all projection coefficients of wavelet transform for WT[4]. Requantization and Binarization For recognition purposes we have to limit the gray scale resolution to a lower value so as to minimize the computational burden. Nominally, the number of quantization levels is 256 for black and white images. In the case of color images, the three primary red, green and blue components are individually quantized to 256 levels. The wavelet transformed image signal will, therefore, be requantized to only 64 levels. Co-occurrence Matrix A co-occurrence matrix is a square matrix with elements corresponding to the relative frequency of occurrence of pairs of gray levels of pixels separated by a certain distance in a given direction[5]. The unnormalized co-occurrence matrix is denoted by P(i,j,d,©), where i and j are respectively the intensities of the pixels at locations k,l and m,n respectively, separated by a distance d in the direction©with respect to the horizontal. P represents a second-order probability distribution. The image is divided into non-overlapping square windows of size, say,16h16 pixels in this work. The co-occurrence matrices for all the windows in the various bands of o the wavelet-filtered images are computed for different orientations in intervals of 45 using the definition of the co-occurrence matrices. The co-occurrence matrices for the various orientations are given by [5]. 0 P0 = #{[(k,l),(m,n)]|k-m=0,l-n = d} (2.2) 0 P45 = #{[(k,l),(m,n)]|k-m=d,l-n=-d, k-m=-d,l-n=d} (2.3) 0 P90 = #{[(k,l),(m,n)]||k-m=d,l-n=0} (2.4) 0 P135 = #{[(k,l),(m,n)]|k-m=d,l-n=d, k-m=-d,l-n=-d} (2.5) For each sub-image, four co-occurrence matrices which are of size 64h64 corresponding to 64 gray levels will be computed for four orientations based on the projection coefficients of WT of each sub-image. Feature Vectors The features computed via every co-occurrence matrices in this work are the contrast CON ,angular second moment ASM , Entropy H , the image pixel correlation CORR, and the inverse different moment IDM for each sub-image [5]. The contrast CON is defined as
638
H. Zeng and Y. Yi
∑∑
2
2
CON = i j |i-j| (Pij) The angular second moment ASM is defined as
∑∑
2
(2.6)
ASM = i j (Pij) The entropy H is defined as
(2.7)
H = - i j Pijl0g(Pij) The image pixel correlation COR is defined as
(2.8)
∑∑
COR =
∑∑ i
j
(i-u)(j-u)(Pij)/³
∑∑
2
∑∑
(2.9)
where u = i j ij(Pij) , ³ = i j (i-u)(j-u). The inverse different moment IDM is defined as
∑
2
2
2
IDM = i , j ,i ≠ j Pij/|i-j| (2.10) In order to keep the feature vector dimension low, we take the average of the cooccurrence matrices to compute the features, thus, say, each sub-image with 16h16 window has one feature vector of dimension 5 whose elements are CON, ASM, H, CORR, IDM for one orientation in co-occurrence matrix. In other words, A feature vector with 20 dimensions represents a sub-image with 16h16 window. A Fuzzy Neural Network Classifiers Pattern classification, a critical stage in the recognition process, is the hurdle in many cases. In image recognition, different techniques are often used to classify pattern vectors into the known classes based on class membership of the pattern vectors. Assume that the features extracted from every subimage consist of a sample Xi=(xi1, T s s hN ..., xis) R , and the sample sets to be classified consist of X=(X1, ..., XN) R . The predefined classes of the samples consist of C=(c1, ...,cm),and m is the number of predefined classes. The fuzzy partition of feature can be recorded in a fuzzy partition matrix as follows fc1(X1) fc1(X2) .. . fc1(XN)
(2.11)
fcm(X1) fcm(X2) .. . fcm(XN) Where fck(Xi) is a class membership for XI belonging to Ck, i=1,... ,N, k=1, ... ,m. The fuzzy neural network classifier is a time-varying fuzzy associative memory system, which maps balls of fuzzy sets in the input space to balls of fuzzy sets in the output space. This system adaptively estimates, stores and modifies the decomposed fuzzy associative memory rules from the training data. The network’s synaptic connection matrix changes in time according to the competitive learning algorithm we proposed and converges to the fuzzy associative memory centroids in the input-output space. The more spnaptic vectors clustered about a centroid of fuzzy associative memory rule, the greater its weight in the recalled fit-vector output.
Image Recognition Using Adaptive Fuzzy Neural Network and Wavelet Transform
639
3 Summaries We propose a new system of pattern classification for image recognition using neural networks. The merits of proposed techniques are: 1. The wavelet transform decomposes an input image into varying spatial and frequency resolutions. This results in better clustering properties since the WT gives independent feature measures as well as data dimension compression. 2. An adaptive neural network is used to compute the WT. Hence it is computationally efficient and adapts to the changing statistics of the images. 3. A new fuzzy neural network classifier accommodates overlapping clusters (class memberships) and therefore increases the classification accuracy.
References 1. R.O.Duda and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley,1973. 2. S. G. Mallat, A Theory of Multiresolution Signal Decomposition: The Wavelet Representation, IEEE Trans.PAMI 11(7), PP.674–693. 1989 3. A. Delopoulos, etc., Invariant Image Classification Using Triple-correlation-based Neural Network, IEEE Trans on Neural Networks, Vol. 5, No. 3, PP. 392–407, 1994 4. H. Zeng, etc., A Fast Learning Algorithm for Solving a System of Linear Equations and Related Problems,Advances in Modeling & Analysis, A , Vol. 29, No. 3, PP. 33–39, 1995 5. R. M. Haralick and L.G.Shapiro, Computer and Robot Vision Volume I, Addison Wesley Publishing Company, 1992.
SOM Based Image Segmentation Yuan Jiang, Ke-Jia Chen, and Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {jy, ckj}@ai.nju.edu.cn, [email protected]
Abstract. Image segmentation plays an important role in image retrieval system. In this paper, a method for segmenting images based on SOM neural network is proposed. At first, the pixels are clustered based on their color and spatial features, where the clustering process is accomplished with a SOM network. Then, the clustered blocks are merged to a specific number of regions. Experiments show that these regions could be regarded as segmentation results reserving some semantic means. This approach thus provides a feasible new solution for image segmentation which may be helpful in image retrieval.
1
Introduction
The availability of large amount of images brings the needs of retrieving them based on their contents. Therefore content-based image retrieval (CBIR) has attracted researchers from different disciplines such as computer vision, image processing, database, information retrieval, etc. Also, CBIR has demonstrated its wide usefulness in many fields such as biochemistry, medical diagnosis, education, etc. Many projects have been started to develop efficient retrieval systems [2, 3, 6, 7]. Image features such as color, texture, shape and spatial relationship of objects are commonly used in CBIR. In the process of extracting feature vectors of images, image segmentation is usually involved [5]. In this paper, a method of image segmentation based on Self-Organizing feature Map (SOM) is proposed. At first, color and spatial information of the pixels are used in pixel clustering with SOM. Then, the clustering result is processed so that a specific number of regions are obtained, which is the final segmentation result. The rest of the paper is organized as follows. Section 2 briefly introduces image segmentation and SOM algorithms. Section 3 presents and illustrates the proposed method. Conclusion as well as some discussion of future work is given in Section 4.
2 Background Image segmentation aims at dividing an image into several sections in terms of semantic information of the image. Popular approaches [1] of segmentation include threshold techniques, edge-based methods, region-based techniques, and connectivitypreserving relaxation methods. Threshold techniques divide an image based on G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 640–643, 2003. © Springer-Verlag Berlin Heidelberg 2003
SOM Based Image Segmentation
641
grayscale information of pixels. Their weakness exists in ignoring spatial information, which may cause failure in the presence of blurred region boundaries. Edge-based methods distinguish objects through contour detection as intensity levels of objects usually diversify in their contours. The main idea of region-based methods is to group pixels of similar intensity levels. Neighboring pixels of similar intensity levels are grouped into a region first. Then adjacent regions are merged under some criteria. Connectivity-preserving relaxation methods are usually referred to as the active contour model. SOM [4] can serve as a clustering tool for high-dimensional data, which constructs a topology that the high-dimensional space is mapped onto map units in such a way that relative topology distances between input data are preserved. The map units usually form a two-dimensional regular lattice. The training process of SOM is quite simple. Each map unit is associated with a reference vector. At first, all the reference vectors are randomly designated. Each input vector is compared with all the reference vectors and the unit whose reference vector is most similar to the input vector is identified. Then, the reference vectors neighboring to that of the identified unit are moved to the input vector.
3
The Proposed Method
Features usually used in image segmentation include color, texture, shape, etc. Color features have received most attention since Color Histogram [8] is proposed. The proposed method could be regarded as a specific region-based image segmentation technique, which mainly utilizes color information to segment images. The aim of this method is to divide an image into n irregular blocks where n is a variable dictated by users. For an image such as the one shown in Figure 1, every pixel is represented by a fivedimensional feature vector where the features are x, y coordinates and R, G, B values of the pixel, respectively. It is obvious that the xcoordinate and the y-coordinate encode the spatial information of a pixel, where the R, G and B values encode its color information. Note that the feature values have been Fig. 1. A sample original image normalized. Each five-dimensional feature vector is regarded as an input vector of the SOM network. The output of the SOM network is n classes. The SOM network is trained according to rules introduced in Section 2. After the training is accomplished, input data vectors that are topologically close are mapped to the same class. That means the input vector space is divided into n classes. Thus a primitive clustering result is obtained, which is shown in Figure 2. As mentioned above, each pixel is associated with a certain class after clustering. However, pixels belong to the same class are not always connected. As Figure 2 shows, there may exist a lot of isoloated pixels and small blocks. Here we call those blocks ‘scattered blocks’. Then, the next step is to eliminate the isolated pixels and merge the scattered blocks.
642
Y. Jiang, K.-J. Chen, and Z.-H. Zhou
isolated pixel
scattered blocks Fig. 2. The clustering result
The isolated pixels are eliminated with the help of a gliding window. If an isolated pixel is found in the gliding window, its associated class will be substituted by the most frequent class occurred in the gliding window. The gliding window moves across the whole image from left to right, and from top to bottom. This process is repeated until no isolated pixels left. For example, with a 3*3 gliding window, Figure 2 becomes Figure 3.
Fig. 3. Result after eliminating isolated pixels
Fig. 4. Result after merging scattered blocks based on size and position
The scattered blocks are merged as follows: Blocks with the least number of pixels will be identified and then incorporated into its largest neighboring block. First, scattered blocks are ranked in terms of pixel number. Then, the block with the least pixel number is merged into its largest neighbor. Third, the ranking of the enlarged block is adjusted while the merged block is eliminated from the rank list. The process is repeated until there are 2* n blocks left. Now Figure 3 becomes Figure 4. Then, the mean of R, G, B values of each remaining block is calculated. Distance between each pair of neighboring blocks is computed according to Euclidean distance. The block with the least mean of R, G, B values is identified and merged into its nearest neighboring block. Then, the R, G, B values of the enlarged block and its related distances are recalculated. This process is repeated until there are n blocks left. As shown in Figure 5, these Fig. 5. Result after merging scattered n blocks are the ultimate result of blocks based on color and position segmentation.
SOM Based Image Segmentation
4
643
Conclusion
Many multimedia applications require retrieving data based on their contents. In the process of content-based image retrieval, image segmentation is usually involved. In this paper, a method of SOM based image segmentation is proposed. A set of fivedimension vectors is used to represent an image, which is fed to a SOM network. After the clustering result is gained, isolated pixels are eliminated and scattered blocks are merged. Experiments show that this method can get preferable segmentation result. In order to improve segmentation quality, more features such as texture feature may be employed in the clustering process and the post-processing of the clustering result, which is an issue to be explored in future work. Acknowledgement. The authors want to thank the anonymous reviewers for their constructive comments. This work was supported by the National Natural Science Foundation of China under the grant number 60105004, and the Natural Science Foundation of Jiangsu Province, China, under the grant number BK2001406.
References 1. Asano, T., Chen, D.Z., Katoh, N. and Tokuyama, T. Polynomial-time solutions to image segmentation. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, Atlanta, GE, 104–113, 1996. 2. Carson, C., Thomas, M., Belongie, S., Hellerstein, J. M. and Malik, J. Blobworld: a system for region-based image indexing and retrieval. In Proceedings of the 3rd International Conference on Visual Information Systems, Amsterdam, The Nethelands, 509–516, 1999. 3. Flicker, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D. and Yanker, P. Query by image and video content: the QBIC system. IEEE Computer, 28(9):23–32, 1995. 4. Kohonen, T. Self-Organizing Maps, 2nd edition, Springer-Verlag, Berlin, 1997. 5. Park, S. H., Yun, I. D. and Lee, S. U. Color image segmentation based on 3-D clustering: morpho-logical approach. Pattern Recognition, 31(8):1061–1076, 1998. 6. Pentland, A., Picard, R. W. and Sclaroff, S. Photobook: content-based manipulation of image databases. International Journal of Computer Vision, 18(3):233–254, 1996. 7. Smith, J. and Chang, S. F. VisualSEEK: a fully automated content-based image query systerm. In Proceedings of the 4th ACM Multimedia Conference, Boston, MA, 87–98, 1996. 8. Swain, M. J. and Ballard, D. H. Color indexing. International Journal of Computer Vision, 7 (1):11–32, 1991.
User’s Interests Navigation Model Based on Hidden Markov Model Jing Shi, Fang Shi, and HangPing Qiu Computer department of NanJing University of Science and Technology, 210007 Nanjing, China [email protected]
Abstract. To Find user’s frequent interest navigation patterns, we com-
bine the information of web content and web server log, mine the web data to build a hidden markov model. In our approach, we build a hidden markov model according to page’s content and web service log firstly, and then we present a new incremental discovery algorithm Hmm_R to discover the interest navigation patterns. …
1 Introduction When a user accesses a web site, he has some interest. Because different users have different interest, they access the web site through the different paths. From the other side, the designer of the web site will make the web pages confirm some kind of interest classification structure. So the navigation pattern can reflect the user’s access interest. WebWathcer[1] process the log in a web site, organize the log data into the transactions, use the classical data mining approaches to mine the data. But the result can’t be used to adjust the web site’s content automatically. PageGather [2] approach proposed the idea of optimizing the structure of Web sites based co-occurrence patterns of pages within usage data for the site. But the approach will flat the web site’s structure and when there are a large numbers of web pages, the number of the index page will become overabundant. In our approach, we combine the technologies of web content and usage mining, integrate the distributing of the subjects in the user navigation patterns and user’s attention degree of web pages (obtained from the visit time), build a hidden markov model[3] according to web service logs firstly, then discover the interest navigation patterns. These methods can broadly used in analyzing paths, recommendation, and reconstructing the web site.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 644–647, 2003. © Springer-Verlag Berlin Heidelberg 2003
User’s Interests Navigation Model Based on Hidden Markov Model
645
2 HMM for User Interest Navigation 2.1 Usage Preprocessing
Giving the support and confidence, we can employ the association rule discovery algorithm to discover rules on the transaction sets. We can discover the frequent sequence set S. The above preprocessing tasks result in a set of k pageview records, S = { < url1 , t1 > , < url 2 , t 2 > ,L , < url n , t n > } , where ti means in the frequent sequence pattern set S, associated the pageview represented urli, whose average visited time is ti. It will be used to determine the weight of user’s attention to the page. 2.2 Content Preprocessing Content Preprocessing analysis the association between subjects and pageviews. Web page automatic marked method based Ontology is used to discover the subject of the pageview[4]. The preprocessing task is to obtain a set p(σ 1 , µ1 ;σ 2 , µ 2 L;σ i , µ i ) according to a pageview. Where µ i is the weight represents the correlation of subject σ i and pageview p.
2.3 Interest Navigation Model Based HMM In the patterns, we can confirm the users have more interest in finding some useful information about the subject and not any other. These patterns can be regarded as the common interest frequent navigation patterns according to the subject. 2.3.1 Computing the Transition Probability and Attention Probability of a Subject on a Pageview Definition1: The one-step transition probability of two pages in set S:
P ( qi → q j ) ≈ The qi → q j
count ( qi → q j )
count ( qi ) represents qi and q j are linked directly by hyperlink. The
count( qi → q j ) represents the number of the transaction in which users access the web site from qi to q j in one step in S. The count( qi ) represents the number of the transactions in S, each of which has qi . Definition2: The attention probability of a keyword σ , which the user pay on the page qi :
p(σ | qi ) = µ × t ’i
646
J. Shi, F. Shi, and H. Qiu
The µ is the weight of the subject σ of the page content which computed in the content processing, t’i is the ratio of the time of the user visit the page to the user’s total visit time. p(σ | qi ) represents the attention degree of the user pay on the subject σ in the page qi . Definition3: the interest association rule R (σ | S k ) : If we know an interest frequent navigation pattern S k and σ , and R (σ | S k ) ≥ C (C is a given confidence threshold). k −1
R (σ | S k ) = Π ( P( qi → qi +1 ) × P (σ | qi +1 )) i =1
2.3.2 Defining of Interest Navigation Model 1.a pageview is the state q in the HMM. 2.There is a subject set = σ 1 , σ 2 ,L , σ M . σ i is a subject on the pageview.
∑
3.a pageview includes a subset ( σ 1’ ,L , σ m’ ) of Σ . 4. q and q’ are linked directly by hyperlink. They have a transition probability
P ( qi → q j ) 5.The attention probability p(σ | qi ) is that the users pay on the subject σ in the page qi . Through analyzing the Log file, we can build an INM (Interest Navigation Model) that satisfies these definitions. 2.4 Discovering the Interest Navigation Pattern The interest association rule represents all the users’ navigation pattern for pageviews. In order to discover the interest association rule, we give an incremental discovery algorithm: Algorithm: Hmm_R Input: Q,S,C Begin: k k:=1; j:=1; S :=S; While j=1 do j:=0; For each s ∈ S k For each q ∈ Q If R (σ | ( s, q )) ≥ C then
S k := S k +1 + ( s, q );
User’s Interests Navigation Model Based on Hidden Markov Model
647
R k +1 := R k +1 + R (σ | ( s, q )); j:=1; End if; End for; End For; k:=k+1; End While; End. Output:
R k ,k=1,…,n
3 Conclusions and Future Work Our approach is essentially a recommendation approach based on web usage mining and web content mining. Through mining the user transaction record and the subjects on the pageviews, we recommend the mining result in order to accelerate the user’s access. We firstly use the HMM to analyze the common interest navigation patterns. This approach resolves the self-adaptation problem of the web site. In the approach, building the HMM doesn’t require the complex training process, the transition probability and the common attention probability to a subject on a pageview are easily calculated. There are some characteristics in our approach. 1) It is a kind of optimizing approach. 2) The mining object is the interactive action and the common interest, and the mining result faces up to the total users. 3) The discovered navigation pattern is not always has direct hyperlink with each other. The next step of our works not only explores the recommendation approaches but also explores the prediction approaches. So we can predict the interest of the users in the future.
References 1. Cooley R, Mobasher B et al. Grouping Web page references into transactions formining World Wide Web browsing patterns. In: Proc Knowledge and Data Engineering Workshop, Newport Beach, CA, 1997.245–253 2. Rabiner,L.R. A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE 77(2), New York, 1989, 257–286 3. Rabiner,L.R. A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE 77(2), New York, 1989, 257–286 4. I. Horrocks, et al., The Ontology Interchange Language: OIL, tech. report, Free Univ. of Amsterdam, 2000
Successive Overrelaxation for Support Vector Regression Yong Quan, Jie Yang, and Chenzhou Ye Inst. of Image Processing & Pattern Recognition, Shanghai Jiaotong Univ. Shanghai 200030, People’s Republic of China Correspondence: Room 2302B, No.28, Lane 222, fanyu Road, Shanghai City, 200052, People’s Republic of China [email protected] Abstract. Support vector regression (SVR) is an important tool for data mining. In this paper, we first introduce a new way to make SVR have the similar mathematic form as that of support vector classification. Then we propose a versatile iterative method, successive overrelaxation, for the solution of extremely large regression problems using support vector machines. Experiments prove that this new method converges considerably faster than other methods that require the presence of a substantial amount of the data in memory. Keywords: Support Vector Regression;Support Vector Machine, Successive Overrelaxation, data mining
In this work, we propose a new way to make SVR have the similar mathematic form as that of support vector classification, and derive a generalization of SOR to handle regression problems. Simulation results indicate that the modification to SOR for regression problem yield dramatic runtime improvements.
1 Successive Overrelaxation for Support Vector Machine Given a training set of instance-label pairs
(xi , yi ), i = 1,K, l
where
xi ∈ R n and
y ∈ {1,−1} , Mangasarian outlines the key modifications from standard SVM to RSVM [1]. Suppose a is the solution of the dual optimization problem [2]. Choose ω ∈ (0,2) . Start with any a 0 ∈ R l . Having a i compute a i +1 as follows: l
(
(
(
a i +1 = a i − ωD −1 Aa i − e + L a i +1 − a i
where
)))
∗
(⋅)∗ denotes the 2-norm projection on the feasible region of (1), that is
(ai )∗
ai ≤ 0 0 = ai 0 < ai < C , i = 1, K , l . C ai ≥ C
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 648–651, 2003. © Springer-Verlag Berlin Heidelberg 2003
(1)
Successive Overrelaxation for Support Vector Regression
649
2 Simplified SVR and Its SOR Algorithm Most of those already existing training methods are originally designed to only be applicable to SVM. In this paper, we propose a new way to make SVR have the similar mathematic form as that of support vector classification, and derive a generalization of SOR to handle regression problems. 2.1 Simplified SVR Formulation Similar to [3], we also introduce an additional term the formulation stated as follows: min s .t .
The solution problem. min s .t .
1 2
of
∑ ∑ (a l
(2)
l
i =1
j =1
l
i
)
a i , a ∈ [0 , C ]
(2)
L
be
)(
transformed
)
) ∑ (a
)
w ϕ ( x i ) + b − y i ≤ ε + ξ i∗ ξ i , ξ i∗ ≥ 0 i = 1, ,l
− a i∗ a j − a ∗j ϕ ( x i )ϕ (x
(
(
l 1 w T w + b 2 + C ∑ ξ i + ξ i∗ 2 i =1 yi − w ϕ (xi ) − b ≤ ε + ξ i
can
+ ε ∑ a i + a i∗ − i =1 ∗ i
(
b 2 2 to SVR. Hence we arrive at
l
i
j
into
) + 1 ∑ ∑ (a 2 l
l
i =1
j =1
the
i
dual
)(
optimization
− a i∗ a j − a ∗j
)
(3)
)
− a i∗ y i
i =1
The main reason for introducing our variant (2) of the SVR is that its dual (3) does not contain an equality constraint, as does the dual optimization problem of original SVR. This enables us to apply in a straightforward manner the effective matrix splitting methods such as those of [3] that process one constraint of (2) at a time through its dual variable, without the complication of having to enforce an equality constraint at each step on the dual variable a . This permits us to process massive data without bringing it all into fast memory. Define y1 − ε d ϕ (x ) 1 1 1 a1 M T M ,E H = ZZ , d ϕ (x ) , a l l Z = a = ∗l d l +1ϕ ( x l +1 ) a1 M M d ϕ (x ) a∗ 2l 2l 2 l ×1 l 2 l ×1
= dd
T
M M , , y −ε 1 c = l d = − y1 − ε − 1 M M − y − ε − 1 l 2 l ×1
2 l ×1
Thus (3) can be expressed in a simpler way. min s .t .
1 1 a T Ha + a T Ea − c T a 2 2 a i ∈ [0 , C ], i = 1, ,l
(4)
L
If we ignore the difference of matrix dimension, (4) and (2) have the similar mathematic form. So many training algorithms that were used in SVM can be used in SVR.
650
Y. Quan, J. Yang, and C. Ye
2.2 SOR Algorithm for SVR Here we let A = H + E , L + D + L = A . The nonzero elements of constitute the strictly lower triangular part of the symmetric matrix T
L ∈ R 2l ×2l A , and the
2l ×2l
nonzero elements of D ∈ R constitute the diagonal of A . The SOR method, which is a matrix splitting method that converges linearly to a point a satisfying (4), leads to the following algorithm: a
i+1 j
= a
i j
−ϖA
−1 jj
2l
∑
A
jk
a
i k
− c
j
+
k = j
j −1
∑
A
jk
k =1
a
i +1 k
∗
(5)
A simple interpretation of this step is that one component of the multiplier
a j is
updated at a time by bringing one constraint of (4) at a time.
3 Experimental Results The SOR algorithm is tested against the standard chunking algorithm and against the SMO method on a series of benchmarks. The SOR, SMO and chunking are all written in C++, using Microsoft’s Visual C++ 6.0 complier. Joachims’ package SVMlight (version 2.01) with a default working set size of 10 is used to test the decomposition method. The CPU time of all algorithms are measured on an unloaded 633 MHz Celeron II processor running Windows 2000 professional. The chunking algorithm uses the projected conjugate gradient algorithm as its QP solver, as suggested by Burges [4]. All algorithms use sparse dot product code and kernel caching. Both SMO and chunking share folded linear SVM code. In this experiment, we consider the approximation of the sinc function
(
)
f ( x) = (sin πx ) πx . Here we use the kernel K (x1 , x2 ) = exp − x1 − x2 0.1 , C = 100 and ε = 0.1 . Figure 1 shows the approximated results of SMO method and SOR method respectively.
Fig. 1. Approximation results of SMO method (a) and SOR method (b)
2
Successive Overrelaxation for Support Vector Regression
651
Table 1. Approximation effect of SVMs using various methods Time(sec)
Experiment
Number of SVs Expectation of Error
Variance of Error
SOR
0.232f0.023
9f1.26
0.0014f0.0009
0.0053f0.0012
SMO
0.467f0.049
9f0.4
0.0016f0.0007
0.0048f0.0021
0.521f0.031
9f0
0.0027f0.0013
0.0052f0.0019
0.497f0.056
9f0
0.0021f0.0011
0.006f0.0023
Chunking
SVM
light
In table 1 we can see that the SVMs trained with various methods have nearly the same approximation accuracy. And SOR algorithm is faster than the other algorithms.
4 Conclusion In summary, SOR is a simple method for training support vector regressions which does not require a numerical QP library. Because its CPU time is dominated by kernel evaluation, SOR can also be dramatically quickened by the use of kernel optimizations, such as linear SVR folding and sparse dot products. SOR can be anywhere from several to hundred even to thousand times faster than the standard Chunking algorithm, depending on the data set.
References [1]
[2] [3] [4]
J. Platt. Fast training of support vector machines using sequential minimal optimization. In B.Schölkopf, C.Burges, and A.Smola, editors, Advances in Kernel Methods – Support Vector Learning, MIT Press, 1998 E.Osuna, R.Freund, and F.Girosi. An improved training algorithm for support vector machines. In Proc. of IEEE NNSP’s97, 1997 Olvi L.Mangasarian and David R.Musicant. Successive overrelaxation for support vector machines. IEEE Trans on Neural Networks, 1999, 10(5): 1032~1037 C.J.C.Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998, 2(2)
Statistic Learning and Intrusion Detection Xian Rao, Cun-xi Dong, Shao-quan Yang p.b.135, Xidian University, Xi’an, China, zip 710071 [email protected]
Abstract. The goal of intrusion detection is to determine whether there are illegal or dangerous actions or activities in the system by checking the audit data on local machines or information gathered from network. It also can be look as the problem that search relationship between the audit data on local machines or information gathered from network and the states of the system need to be protected, that is, normal or abnormal. The statistic learning theory just study the problem of searching unknown relationship based on size limited samples. The statistic theory is introduced briefly. By modeling the key process of intrusion detection, the relationship between two problems can be found. The possibility of using the methods of statistic theory in intrusion detection is analyzed. Finally the new fruit in statistic learning theory –Support Vector Machines—is used in simulation of network intrusion detection using the DRAPA data. The simulation results show support vector machines can detection intrusions very successfully. It overcomes many disadvantages that many methods now available have. It can lower the false positive with higher detection rate. And since it using small size samples, it shortens the training time greatly. Keyword: statistic learning intrusion detection network security support vector machines neural network
1 Introduction Statistic learning plays a fundamental role in AI study. It has accumulated lots of theories since it began at 35 years ago. But the Approximation theory that traditional statistic learning studied is the statistic character based on infinite size samples. Practically, the size of samples is usually limited. So given the limited size samples, the generalization of a well learning machine becomes worse. The deeper part in statistic learning did not attract more attention until resent years. The Structure Risk Minimization principle and Minimum Description Length Principle become studying focuses. The research on small size samples problem is carried on. Along with the fast development of Internet, the network security becomes more important each day. Only when the security problem solved, we can take full advantage of network. Intrusion detection is an important area in network security. So it gets a lot of focuses. Many approach are studied to solve intrusion detection problem.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 652–659, 2003. © Springer-Verlag Berlin Heidelberg 2003
Statistic Learning and Intrusion Detection
653
By studying statistic learning and intrusion detection, we find that two theories have something in common. So the new method in statistic learning—Support vector machines—can be used in intrusion detection.
2 Statistic Learning Statistic learning theory[1] is a statistical learning rule for small size samples problem. It is an important development and complement to the traditional statistics. It provides a theory frame for machine learning theory and method given size limited samples. The core of the theory is to control the generalizing ability of a learning machine by controlling its capacity. 2.1 The Description of Learning Problem A learning problem based on samples can be described by a three-parts model as figure 1 shows. There, G is a generator that generates random vectors x ∈ R , which is take out independently from a definite but N
x G
F (x) . S is a y for every
unknown distribution function trainer that returns an output
x according to the same definite but unknown distribution F y | x . LM is a input
y S
LM
( )
learning machine that can realize a certain
( )
Fig. 1. Learning model based on function set f x, α , α ∈ Λ , where Λ is samples parameter set. A learning problem is to choose the best function which is mostly approximate the response of the trainer from a given function sets f ( x, α ), α ∈ Λ . This choice is based on training set which are
composed of samples sized of l drawn out independently according to the unite distribution
( ) () ( ) (x , y ),L, (x , y )
F x, y = F x F y | x 1
1
l
(1)
l
In order to make the chosen function have the best approximation, it is necessary to measure the errors or loss respond function
( )
( ( ))
L y, f x, α between the output of trainer and that of the
f x, y given the input x . The mathematical expectation is
R (α ) = ∫ L( y, f ( x , α ))dF (x , y )
(2)
654
X. Rao, C.-x. Dong, S.-q. Yang
That is also called Risk functional. The aim of learning is looking for function
(
)
f x,α 0 that makes the RF minimum when the unit distribution is unknown and all the information is included in the training set as the formula (1 ) shows. 2.2 Traditional Statistic Learning Theory The traditional learning machine is based on the Empirical Risk Minimization (ERM) principle, but it is find that the generalization of all these machines is not good given small sized samples in experiment. Statistic learning comes back to its originated point that is studying the learning machine given small size samples. The statistic learning theory can be divided into four parts. First is the Consistent theory of learning process. This theory answers under which situation the learning process based on ERM is consistent. Second one is the non-approximation theory. This theory answer at what speed it converges when studying. By studying it is concluded that the bound of generalization composed of two parts: one is the Risk
l Remp (α l ) , the other is confidence interval Φ , where l is the hk number of samples and hk represents VC dimension. Third is the theory about Functional
generalizing ability in controlling learning procedure. This theory is on how to control the speed of generalization (generalizing ability ) during learning procedure. Since the convergence of learning are controlled by two parts, the learning algorithm based only on ERM principle can not guarantee its generalizing ability given small size samples. So it comes a more universal principle—Stuctural Risk Minimization (SRM) Principle. The principle is minimizing Risk functional based both on Empirical Risk and confidence interval. Thus the ability of generalization when given small sized samples can be guaranteed. The last is a theory on constructing learning algorithm. The algorithm that is able to control the generalizing ability can be made based on this theory. These algorithm included Neural Network (NN) and Support Vector Machines(SVM). 2.3 The Common Methods NN is a learning algorithm that minimize the ER without changing confidence interval. The idea is used to estimate the weight of neural cell. The method that using sigmoid approximation function compute the empirical risk grades is called back propagation algorithm. According to these grades, the parameters of NN can be modified iteratively using the standard computation based on grades. Under the statistic learning frame, the new universal learning method, support vector machine is produced. It learns by fixing the Empirical functional and minimizing the confidence interval. Its main idea is constructing a separating hyperplane under the linearly separable case. Then generalize it to the linearly nonseparable case. SVM, such a learning machine is used for constructing optimal separating hyperplane. It solves the learning problem for small sized samples efficiently.
Statistic Learning and Intrusion Detection
655
3 Intrusion Detection The original intrusion detection model was presented by Dorthy Denning[2] in 1968. Now it becomes an important task in network security field. 3.1 Description of Intrusion Detection Intrusion detection judges whether there are illegal or dangerous actions or activities in the system by checking the audit data on local machines or information gathered from network.. Almost every intrusion detecting runs by three steps: gathering information phase, learning phase and detecting phase. In the gathering information phase, all kinds of normal or abnormal information are collected. During the training and learning phase the relationship between gathered information and system state is find by analyzing the information already known. Then in the detecting phase we can determine the state of unknown audit data or network traffic data according to the relationship we got in second phase. Of these three phases, the first two are more important for they guarantee the correctness of the detection. Now a model for these two phase are made. The target system need to protected can be looked as an y x generator, noted by O. all the information of the system O P S outputting can be translated to a number vectors, noted by x , by a processor P. The output of trainer S is noted by y . The training and learning process can be described by figure 2. The distribution
LM Fig. 2. The model of the gathering information and training and learning phase of intrusion detection
F (x) of the parameter number vectors gained from processor P is unknown. The output
y of trainer S is
produced according to the same definite but unknown probability function LM is a learning machine that can realize a certain function set
( )
F y|x .
( )
f x, α , α ∈ Λ ,
where Λ is parameter set. The intrusion detection can be described as choosing the best function which is mostly approximate the response of the trainer from a given function sets f ( x,α ),α ∈ Λ . This choice is based on training set which are composed of
l samples drawn out independently according to the unite probability F x, y = F x F y | x . The training set is noted as T .
( ) () ( )
656
X. Rao, C.-x. Dong, S.-q. Yang
3.2 Methods in Intrusion Available Detecting intrusions can be divided into two categories: anomaly intrusion detecting and misuse detecting[3]. Anomaly detection means establishing a “normal” behavior pattern for users, programs or resources in the system and then looking for deviations from this behavior. The methods often used are quantitative analysis, statistical approach and non-parameters statistic analysis and rule-based algorithm. According to model above their process can be summarized as following. In the gathering information phase, all the data collected is the legal or expected behavior. Thus all the output of trainer set is +1. Supposed that the number of training samples gathered is lα , the training set can be noted as Ta = {( xi ,+1) | i = 1, L , l a } . The learning machine chooses the best approximation function to the respond of trainer from given function set f ( x , α ), α ∈ Λ based on the training set Ta . Misuse detection means to looking for know malicious or unwanted behavior according to the knowledge base on gathered intrusion information. The main approach are simple matching, expert system and the state translated method. In the gathering information phase, intrusion knowledge or anomaly behaviors are collected, so all the out put of the trainer are –1. Supposed that that the number of training samples gathered is l n , the training set can be noted as
Tn = {( xi ,−1) | i = 1,L , l n } . The learning machine chooses the best approximation function to the respond of trainer from given function set f ( x , α ), α ∈Λ based on the training set Tn . 3.3 Intrusion Detection and Statistic Learning
When the key steps of intrusion detection described by mathematical model, it is can be seen that intrusion detection can be looked as a problem looking for relationship between system audit data or network traffic data and the system state based on the known knowledge. The key point, looking for relationship, here is the problem just studied by statistic learning. It is feasible to solve the intrusion detection using the statistic learning methods. In intrusion detection, both computer system by network system only have two states: being intruded or not, noted as –1 and +1. So the output of trainer in intrusion detection model are also two values: +1 and –1,that is y ∈ {+ 1,−1} . The pattern recognition problem is very suitable here. The difference between intrusion detection and statistic learning is that intrusion detection uses different types of data. Some maybe are string type. Some are char type. So most intrusion detection systems have preprocessor parts to translate the gathered information to the type that the detector can read. Because statistic learning dealing with number vector, it is necessary to translate all gathered information to numbers by the preprocessor when using the statistic learning method.
Statistic Learning and Intrusion Detection
657
By analysis the intrusion detection model shown above, it is can be seen that both anomaly detection and misuse detection use unilateral knowledge that brings some disadvantage hard to be overcome. The false alarm rate of anomaly detection is high. And misuse detection can not detection unknown attack. Neural Network as one of the statistic learning has been used in detection. But it is only used by anomaly detection or by misuse detection. It is mean that the training set it uses is unilateral, so the advantage of NN is not taken a good use. Beside that, the NN not only need large training data size but also is inclined to get into local minimum. In most case, the training set we can get is not very big. So the mew statistic learning method base on small size samples—SVM is an approach very suitable for intrusion detection. At the same time, using both normal and abnomal information in training improve the detection performance.
4 Using SVM in Intrusion Detection SVM, a new learning method, in statistic learning has found its successful uses in handwriting recognition[4] and face recognition[5]. We now find it performed well too in intrusion detection. 4.1 DRAPA Data The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks. The raw training data was about five million connection records from seven weeks of network traffic. Similarly, the two weeks of test data yielded around two million connection records for test. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection has 41 features divided into tree categories. Nine features in first category indicate the basic information of a net connection, such as protocol type (char) and bytes from source (float) et.. There are 13 features in second category, which is called derived features. They are the domain knowledge provided by connected hosts, including number of failed login attempts, number of created files and so on. The 19 features in last category reflect the statistic character of net traffic, for example, number of connections to the same host as the current connection in the past two seconds, percentage of connections that have ``SYN'' errors in past two seconds and so on. At last each connection is labeled as either normal, or as an attack, with exactly one specific attack type. There are over twenty kinds of attacks simulated in 7 weeks. They were falls into four category: DOS: denial-of-service, e.g. syn flood; R2L: unauthorized access from a remote machine, e.g. guessing password; U2R: unauthorized access to local super user (root)
658
X. Rao, C.-x. Dong, S.-q. Yang
privileges, e.g., various ‘‘buffer overflow’’ attacks; probing: surveillance and other probing, e.g., port scanning. During the last 2 weeks of gathering test data, besides attacked used before, new attacks are added to test the generalizing ability. 4.2 Data Preprocessing By discussing, it is known that the preprocessing must be made before using the statistic method to detect. There are two reasons. First is to unite data form, that is, translating different types of data into a number vector. Second is to minimize the differences between each element. For the network connection data, the types of different features maybe different. Even their types are same, but their ranges are not. So they should be preprocessed first. For those features whose types are string or char we need to code for them, that is translating them to numbers. For the SVM classify something according to their space character, the main purpose of coding is to distinguish different strings or chars. An simple coding method will perform this function. Besides type translation, minimum difference between each connection is necessary. For example, because of the uncertainty number and size of translated data, the value and their change of bytes from source or bytes from destination are great. But for the percentage of same serve rate in past two minutes is ranged from 0 to1. In order to shorten their difference, the bytes from source and from destination need to be normalized. 4.3 Training Phase and Detecting Phase In training phase, the connection labeled is used to train SVM in order to approximate the output of trainer. The training data need to be translated to the form as shown in formula (1) that could be directly used by SVM. The training data set given has over 5,000,000 connections. Only a small part is used for training. The generalization ability of SVM guaranteed the detection system still performance well under small size samples. The training samples we used is only one or two of thousands, so we greatly reduced the time used in training. In detecting phase the SVM can class unlabeled connections. We tested on all the test data of 2,000,000 connections data and got satisfied result. 4.4 Simulation Result The effects of different normalizing methods in preprocessing phase are shown in figure 3. The training set used is only 1.8 of the thousands of all training set. “no norm” means no normalization for training sets; the power normalization is denoted as “power norm”; and max value normalization is denoted as “max norm”. The figure shows that using the normalized data can get better results whatever the normalizing method you use. At the same time, the size of training data size also effects intrusion detection performance. The more data used, the better intrusion detection can perform. The ROC of curve of different training sets size is shown in figure 4. The three curves in the figure represents the size of training set are 0.08%, 0.18% and 0.26%. It just validate out though the more samples are used, the better performance it get.
Statistic Learning and Intrusion Detection
659
Fig. 3. The ROC curve using different norm-lizing methods, training sets is 0.18%
Fig. 4. The ROC of different training data size. The normalizing method is power normalizing.
5 Conclusions By analysis and comparison of statistic learning theory and intrusion detection, it is found that the key problems need to be solved in two are the same. It is natural to apply the method in statistic learning to solve intrusion detection problem. For most cases, the sizes of training sets we get are small. It is more practical to using SVM in intrusion detection. By simulation, using SVM in intrusion detection has many advantages including shorten the training time, high detection rate with lower false alarm, real time detection and upgraded easily and so on. An attempt of applying statistic learning method in intrusion detection is made in this paper. More work need to be done in the future to perfect the application.
References 1. Vapnik V. N.. The Nature of Statistical Learning Theory .New York:spring-Verlag,1995. 2. Denning D.E. An Intrusion Detection Model. IEEE Trans. On Software Engineering,1987,13(2):222–232 . 3. Humar G. Classification and detection of computer intrusions [ph.d.Thesis].Purdue University,1995 4. Cortes C.,Vapnik V. Support vector networks. Machine Learning,1995(20):273–297 5. Osuna E., Freand R.,and Girosi F. Training Support Vector Machines:an Implication to face detection. IEEE Conference on Computer Vision and Pattern Recognition. 1997:130-136
Raoxian received the BS degree from department of electronics engineering in xi dian university, xi an, china. She is nowing working toward a Doctor degree in the department of communication and information in xidian university. Dongchunxi
A New Association Rules Mining Algorithms Based on Directed Itemsets Graph
1,2
1
Lei Wen and Minqiang Li 1
School of Management, TianJin University, No.92 WeiJin Road, TianJin City, Post code 300072, China 2 Department of Economy and Management, North China Electric Power University, No.204 QingNian Road, BaoDing City, Post code 071003, HeBei, China [email protected] [email protected]
Abstract. In this paper, we introduced a new data structure called DISG (Directed itemsets graph)in which the information of frequent itemsets was stored. Based on it, a new algorithm called DBDG(DFS Based –DISG) was developed by using depth first searching strategy. At last we performed a experiment on a real dataset to test the run time of DBDG. The experiment showd that it was efficient for mining dense datasets.
1 Introduction Mining association rules was an important task of knowledge discovery. Association rules described the interesting relationships among attributes in database. The problem of finding association rules was first introduced by agrawal[1] and has been attracted great intention in database research communities in recent years. Mining association rules can be decomposed into two steps: the first was Generate frequent itemsets.The second was generate association rules. The most famous algorithms was Apriori[2]. The algorithms employed a breathfirst and downward closure strategy and use that any subset of a large itemsets must be large and any superset of a small itemsets must be small to prune the search space. Most of the algorithms were the variant of Apriori. Such as Partition[3], DHP[4] and DIC[5] etc. They all need scan the database multiple times. The Apriori-inspiries algorithm showed good performance with sparse dataset, but was not suitable for dense dataset. Now more works focus on how to construct a tree structure to replace the original database for mining frequent itemsets. FP-growth[6] and tree projection[7] were samples of them. They all had a good performance than others . In this paper, we introduced a new algorithms DBDG which discover frequent itemsets by using directed itemsets graph. This algorithms used vertical database to count the support of itemsets. The record of the Vertical database was a pair item, Tidsets , where Tidsets was a set of TID of the transaction which support the item. So the frequent patterns can be counted via Tidsets intersections efficiently.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 660–663, 2003. © Springer-Verlag Berlin Heidelberg 2003
A New Association Rules Mining Algorithms Based on Directed Itemsets Graph
661
2 The DISG(Directed Itemsets Graph) : Design and Construction Definition 1:(directed graph) a directed graph D= V, A was consisted of a vertex set V and an arc set A. an arc Aij directed from Vi to Vj signed Vi called tail, Vj called head. Definition 2:Directed itemsets graph DISG= V, A defined as follow: (1)Vertex set V of the DISG was the set of frequent 1-itemset FI FI1, FI2, …FIn . Each vertex has three field: the first is the name of frequent item; The second is the support of the frequent item signed Supi ; The third is the adjacent vertex of the vertex. (2)The arc indicate was a frequent 2-itemsets, with a number s corresponded to the support of frequent 2-itemsets . Based on the definition above, we have the following DISG construction algorithms: Algotithm1 (DISG Construction) input: Vertical database D/ and minsup s out put: DISG begin Select the frequent item FIi from vertical database D/; add all FII into the vertex set V by the support descending order; while V do select Vi From V; for each Vj V (j>i) do begin if s(Vi,Vj) >s do add Vj to the adjacentitemlist of Vi; end; delete Vi From V; end; As a novel structure, DISG included all information of frequent patterns. At the same time, it only stored the frequent item and frequent 2-itemsets. So the size of DISG was smaller than tree structure.
3 The Algorithms of DBDG Now we introduced algorithms DBDG(DFS Based –DISG) with Depth–First-Search strategy. First selected the vertex Vi from V. Selected it’s adjacentvertex Vj with highest support. Then counted s(Vi,Vj), if s(Vi,Vj)>minsup, then check adjacentvertex of Vj. Continue the step above until support s(Vi,Vj,…Vm)was less than minsup or the adjacent list was empty. Then returned to vertex of last level and repeated the steps until all the vertex had been visited as start vertex. The following was the algorithms code.
662
L. Wen and M. Li
Algorithms: DBDG: Input: DISG, Output: FIS Begin While V do for each Vi of V FIS=Vi, S(FIS)=S(Vi); Select unvisited Vj from Vi.adjacentlist; FIS=FIS Vj, S(FIS)=S(FIS,,Vj) Call for DFS(Vj) end; Procedure DFS(Vj) begin if Vj.adjacentlist do Select Vk with highest support from Vj.adjacentlist; if S(FIS, Vk) minsup do S(FIS)=S(FIS, Vk) FIS=FIS Vk Call DFS(Vk) else Output FIS delete Vk from Vj.adjacentlist Call DFS(Vj) Else Return to its parent vertex Vi Call DFS(Vi) End;
4 Experiments
UXQWLPHVHFRQG
To assess the performance of algorithm DISG, we performed an experiments on a PC with P4 1.5Ghz and 256MB main memory. All the programs are writed by Visual c++ 6.0. An real datasets mushroom (from the UC Irvine Machine Learning Database Repository) was used in this experiment .The runing time was showed in figure 1.
VXSSRUW Fig. 1. Computational performance
A New Association Rules Mining Algorithms Based on Directed Itemsets Graph
663
5 Conclusion In this paper we introduced a new data structure DISG and a algorithm called DBDG for mining frequent itemsets. There were several advantage of DBDG over other approach: (1) it constructed a highly compacted DISG, which was smaller than original database; (2) it avoided scan database multiple times by using vertical database to count the support of frequent itemsets; (3) it employed depth first strategy and decrease the number of candidate itemsets. An experiment showed it had a good performance of mining dense datasets. In recent years, discovering maximum frequent itemsets[8,9] or closed frequent itemsets[10] was a new field to solve the dense dataset problem. So in the future, we will focus on researching how to discover maximum frequent itemsets or closed frequent itemsets based on DISG.
References 1.
Agrawal R., Imielinski T., Swami A., "Mining association rules between sets of items in very large databases." Proceedings of the ACM SIGMOD Conference on Management of data, washington,USA,(1993) 207–216 2. Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, (1994) 3. A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. Proc. 1995 Int. Conf. Very Large Data Bases (VLDB’95), Zurich, Switzerland,(1995) 432–443 4. J. S. Park, M. S. Chen, and P. S. Yu. An efficient hash-based algorithm for mining association rules. Proc. 1995 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’95), San Jose, CA,(1995)175–186 5. Brin S., Motwani R. Ullman J. D. and Tsur S. Dynamic Itemset Counting and implication rules for Market Basket Data. Proceedings of the ACM SIGMOD, (1997)255–264 6. J. Han, J. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation. Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD00), Dallas, TX, USA, (2000) 1–12 7. Agarwal R. C., Aggarwal C. C., Prasad V. V. V., Crestana V., A Tree Projection Algorithm for Generation of Large Itemsets For Association Rules. Journal of Parallel and Distributed Computing, Special Issue on High Performance Data Mining, 61(3), (2001)350–371 8. R. C. Agarwal, C. C. Aggarwal, and V.V.V. Prasad. Depth first generation of long patterns. In Proc. of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, USA ,(2000 )108–118 9. Roberto J. Bayardo. Efficiently mining long patterns from databases. In Proceedings of ACM-SIGMOD International Conference on Management of Data,(1998) 85–93 10. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. Proc. 7th Int. Conf. Database Theory (ICDT99), Jerusalem, Israel, (1999) 398–416
A Distributed Multidimensional Data Model of Data Warehouse Youfang Lin, Houkuan Huang, and Hongsong Li School of Computer Science and Information Technology, Northern Jiaotong Unversity, 100044, Beijing, China {lyf,hhk}@computer.njtu.edu.cn
Abstract. The base fact tables of a data warehouse are generally of huge size. In order to shorten query response time and to improve maintenance performance, a base fact table is generally partitioned into multiple instances of same schema. To facilitate multidimensional data analysis, common methods are to create different multidimensional data models (MDDM) for different instances which may make it difficult to realize transparent multi-instance queries and maintenance. To resolve the problems, this paper proposes a logically integrated and physically distributed multidimensional data model.
1 Introduction Data size of data warehouse is typically huge. Hence, a base fact table in warehouse may generally be partitioned into several partitions. To facilitate multidimensional data analysis, common resolutions are to build different cubes for these different instances of the base fact table. To do that, we should carry out similar design process repeatedly. Furthermore, for queries involving multiple instances of this kind of models, we have to write extra codes for front-end application to merge different parts of query results, which make it impossible to realize transparent multi-instance query through data access engine. To resolve the problems, we propose and are implementing a distributed multidimensional data model that can organize physically distributed tables into a logically integrated model, aiming at establishing a theoretical foundation of model logic for a distributed warehouse engine and enhancing the engine to provide distributed cube management and transparent query services.
2 Logic and Instances of Multidimensional Data Model We classify base fact table attributes into dimension attributes, measure attributes, and other attributes. We denote a BFT by a tuple (FN, DAS, MAS, OAS). Given a table F and a tuple set TS of it, we call FI=(F, TS) a base fact table instance of F.
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 664–667, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Distributed Multidimensional Data Model of Data Warehouse
665
Dimensions are generally organized into hierarchical levels. In this paper, we use a model of dimension lattice to exemplify the distributed multidimensional data model. Definition 1. Dimension schema is partial order set 〈DN, ≤〉, where DN is a finite set of dimension levels, ≤ is a partial order defined on DN. Each dimension level d is associated with an attribute set, denoted by AS(d), where it includes a key k(d). We denote the domain of k(d) by dom(d). Suppose that 〈DN, ≤〉 is a lattice, we call it a dimension lattice DL. For a level di, given a domdi we call the tuple dIi didomdi dimension level instance of di And we call DNI ^dI1,…,dIk`, k _DN_ dimension level set instance. If (dj≤di)∧(dj≠di)∧¬∃dk ((dj≤dk)∧(dk≤di)), we denote the relation between dj and di by dj
666
Y. Lin, H. Huang, and H. Li
RFjk:dom(dj) dom(dk)∈RFS. Taking domi(dj) as domain of RFjk, we get the function values and let them be domi(dk) which is sub-set of dom(dk). Then we get a new function RFi,jk:domi(dj) domi(dk) and an instance dIik=(dk, domi(dk)) of dk; Let RFSi=RFSi RFi,jk, DNIi=DNIi {dIik}; If there still exists unprocessed direct descendants of dj, then go to step to process other descendant, else go to step to process other unprocessed dimension level instance. For all pi, by executing steps above, we can finally get m sub-instances of DLI. We call the set {DLI1, …, DLIm} a P partion of DLI. Therefore, based on the definitions above, given F=(FN, DAS, MAS, OAS), we say a dimension of F is a tuple (DL, da), where da∈DAS, DL is a dimension lattice defined on da. That’s to say, a dimension lattice combining with a dimension attribute of a base fact table constructs a dimension of the base fact table. The same, we define a dimension instance as DI=(DLI, da), where DLI is an instance of DL, and say that instances derived from a same dimension are isomorphic. The merge operation of dimension instances and partition concept of DI can be subsequently deduced that we have no space to discuss them here. In multidimensional data analysis, a basic operation is to view data in different multidimensional spaces. To identify a multidimensional space in an MDDM, we use a concept of dimension level vector. Given n dimensions, DS={D1, D2, …, Dn}, defined on F. Get exactly one dimension level from each dimension lattice, we get an ordered tuple DG=(dx1, dx2, …, dxn), where dxi is a dimension level of DL of Di. We say that DG is a dimension level vector of DS. Therefore, a dimension level vector can use to represent a multidimensional space whose points are Cartesian products of domains of all vector components. We denote the space by V(DG). In a multidimensional space, we can define measures in which users are interested. We call these measure values an aggregate item, denoted by f(m), where f is an aggregate function, m is a measure attribute. In our model, to maintain the measure consistency of different spaces so as to facilitate data maintenance and query, we require that different spaces share same aggregate items, and we denote them by AIS.
3 Distributed Multidimensional Data Model We say that multidimensional data model MDDM is a triple (F, DS, AIS), where F is a base fact table, DS is a dimension set defined on F, and AIS is a set of aggregate items defined on F and DS. Based on the instance definitions above, we propose the concept of MDDM instance as MDDMI=(FI, DSI, AIS), where FI is an instance of F, DSI=(DI1, …, DIn), DIi is the instance of Di generated from the domain of Di.da of FI using the process similar to the DLI divide process described in section 2. In our model, we use the concept of multidimensional view as well to identify the queries to MDDM and define a view as MDV= (MDDM, DG). If DG=(dx1, dx2,…,dxn), AIS={a1, a2, …, am}, the schema of MDV will be (k(dx1), k(dx2), …, k(dxn), m1, m2, …, mm), where mi(i=1..m) is the corresponding attribute in V(DG) of ai. In our model, we define a data cube as a tuple C=(MDDM, MDVS), where MDVS is the set of multidimensional views defined on MDDM. And given two cube views
A Distributed Multidimensional Data Model of Data Warehouse
667
v1=(dx1, dx2,…,dxn), v2=(dy1, dy2,…,dyn) in MDVS, for all i(i=1…n), if dyi≤dxi, we say v2≤1v1, and say that v1 is an ancestor of v2, v2 is a descendant of v1. It can be proved that 〈MDVS, ≤1〉 is a lattice. Given a data cube C=(MDDM, MDVS) and an instance MDDMI=(FI, DSI, AIS) of MDDM, for a cube view v∈MDVS, we can get the result set of v by computing FI. We call the result set of v the instance of v, Then we say that CI=(MDDMI, MDVSI) is a data cube instance of C, where MDVSI is the set of cube view instances. We say that cube instances derived from a same MDDM are isomorphic. A data cube is generally partially materialized by implementing some materialized views in a warehouse. We denote the set of materialized view by MVS. Given a data cube C=(MDDM, MDVS) and an instance CI=(MDDMI, MDVSI), given an MVS, we say that CMI=(CI, MVS) is a materialized instance of C. Definition 2. Given an MDDM, if {MDDMI1, …, MDDMIm} is the set of m MDDMs, we define a data warehouse subject as Subject={CMI1, …, CMIm}, where CMIi is a data cube materialized instance created by a cube design process of MDDMIi. Hence, we know that a warehouse subject consists of multiple materialized instances of data cube. Every instance is derived from a common multidimensional data model by different instance design processes. Therefore, we say that this kind of warehouse subject is distributed, and say that the data model that the subject is based on is a distributed multidimensional data model.
4 Model Operations and Instance Sharing Mechanism Our complete model definition also include a complete set of model operations, such as model design operations provided for designing a model, algebraic query operations to support distributed transparent query, algebraic operations of MDDM instance and data cube instance to construct or destroy different kinds of instances. We also provide several level of sharing mechanism to share logic model object or physical MDDM instances. We cannot discuss these contents here because of length of paper.
References 1. Venky Harinarayan, Anad Rajaraman, etc.: Implementing Data Cubes Efficiently, Proceedings of SIGMOD, (1996) 205–227. 2. Inderpal Singh Mumick, Dallan Quass etc.: Maintenance of Data Cubes and Summary Tables in a Warehouse, Proceedings of SIGMOD, (1997)100–111 3. C. A. Hurtado, A. O. Mendelzon, A. A. Vaisman: Maintaining Data Cubes under Dimension th Updates. Proceedings of the 15 IEEE-ICDE 99. 4. E. Baralis, S. Paraboschi, E. Teniente: Materialized View Selection in a Multidimensional th Database. Proceedings of the 23 International VLDB Conference. (1997)156–165. 5. A. Shukla, P. M. Deshpande, J. F. Naughton: Materialized View Selection for Multidimenth sional Datasets. Proceedings of the 24 International VLDB Conference. (1998)488–499.
An Overview of Hybrid Possibilistic Reasoning Churn-Jung Liau Institute of Information Science Academia Sinica, Taipei, Taiwan [email protected]
Abstract. The objective of this paper is to introduce the hybrid logic methodology into possibilistic reasoning. It has been well-known that possibilistic logic has some strong modal logic flavor. However, modal logic lacks the capability of referring to states or possible worlds though states are crucial to its semantics. Hybrid logic circumvents the problem by blending the classical logic mechanism into modal logic. It is a hybrid of classical logic and modal logic, however, unlike classical logic, it treats terms and formulas uniformly. We study some variants of hybrid possibilistic logic, including hybrid qualitative possibility logic, hybrid graded modal logic, and hybrid possibilistic description logic. The syntax and semantics of the logics are presented. Some possible applications of the proposed logics are also suggested. Keywords: Possibilistic logic, Qualitative possibility logic, Hybrid logic, Graded modal logic, Description Logic.
1
Introduction
Knowledge representation and reasoning is fundamental to knowledge based systems. In practice, the acquired knowledge is rarely certain. To represent and reason about uncertain knowledge, many uncertainty reasoning methods have been proposed and extensively studied. Among them, possibilistic logic[4] is a logic of partial ignorance and/or inconsistency based on possibility theory[11]. It has been shown that possibilistic reasoning can be formulated in a kind of graded modal logic[5,7]. This kind of graded modal logic can do not only possibilistic but also similarity-based reasoning[5], so it substantially extends the application scope of possibilistic logic. The standard modal logic(ML) is based on the so-called Kripke semantics. The basic entities of a Kripke model are possible worlds (or states) and the binary relations over them. However, in modal logic, we can only talk about the unary relations by propositional wffs and the binary ones by modal wffs. What is lacking in modal logic is thus the capability of referring to the individuals or states explicitly. This means that modal logic is sometimes an inadequate representation formalism for applications. The hybrid logic paradigm is an attempt to correct the situation[1]. It is called hybrid logic because it can be seen as a hybrid of classical and modal G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 668–675, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Overview of Hybrid Possibilistic Reasoning
669
logic. The key idea of hybrid logic is the treatment of terms (or individuals, states, etc.) as formulas. The purpose of this paper is to introduce the hybrid methodology into possibilistic reasoning. Though it is a followup of [6], we will focus on the general methodology instead of the further technical development. In the following, we will first review the hybrid movement in the ordinary modal logic and then show how the development can be mimicked in the realm of possibilistic reasoning.
2 2.1
Review: The Hybrid Logic Methodology Basic Hybrid Logic
One of the most basic HL is to add nominals and satisfaction operator to the ML[1]. The wffs of HL is defined by W F F := p | a | ¬ϕ | ϕ ∧ ψ | ϕ ∨ ψ | ϕ ⊃ ψ | [α]ϕ | αϕ | a : ϕ. For the semantics, a Kripke frame is a pair (W, (Ri )i≥0 ) where W is a set of possible worlds and each Ri is a binary relation over W and an HL model M is a triple (W, (Ri )i≥0 , V ), where (W, (Ri )i≥0 ) is the underlying Kripke frame and V is a truth assignment which assigns to each propositional symbol a subset of W and each nominal a singleton subset of W . By somewhat abusing the notation, we will also write V (a) for the possible world w if it is the unique element of V (a). The satisfaction of a wff with respect to an HL model M and a possible world w is defined as follows: 1. 2. 3. 4. 5.
the classical wffs are defined as usual. M, w |= [αi ]ϕ iff for all (w, w ) ∈ Ri , M, w |= ϕ M, w |= αi ϕ iff there exists (w, w ) ∈ Ri such that M, w |= ϕ M, w |= a iff V (a) = {w} M, w |= a : ϕ iff M, V (a) |= ϕ
According to the semantics, the nominals are interpreted as names of the possible states. Thus the near-atommic satisfaction statement a : b asserts that the states named by a and b are identical. This makes it possible to reason about the equality of states. Similarly a : αi b means that the state named by b is an Ri -successor of the state named by a, so HL has the capability of making assertions about the relations that hold between specific states. In summary, HL brings to ML the classical concepts of identity and reference. 2.2
Description Logic
The idea of extending modal logics with the capability of reasoning about specific individuals is not new to the knowledge representation community. The idea is fully developed under the title of description logic (DL). Nowadays, there have been many variants of DL’s. Here, we introduce one of them, called ALC[10].
670
C.-J. Liau
Though ALC is a rather restricted sub-language of many description logics, it possesses the most essential features of these logics. The alphabets of ALC consists of three disjoint sets, the elements of which are called concept names, role names, and individual names respectively. The roles terms of ALC are just role names and denoted by R (sometimes with subscripts) and the concept terms are formed according to the following rules. C ::= A | | ⊥ | C D | C D | ¬C | ∀R : C | ∃R : C where A is metavariable for concept names, R for role terms and C and D for concept terms. The wffs of ALC consists of terminological and assertional formulas. Their formation rules are as follows. 1. If C and D are concept terms, then C = D is a terminological formula. 2. If C is a concept term, R is a role term, and a, b are individual names, then R(a, b) and C(a) are assertional formulas. The terminological formula C ¬D = ⊥ is abbreviated as C D. The Tarskian semantics for ALC are given by assigning sets to concept names and binary relations to roles names. Formally, an interpretation for ALC is a pair I = (U, [| · |]), where U is a set of universe and [| · |] is an interpretation function which assigns to each concept name a subset of U , each role name a subset of U × U , and each individual name an element of U . The domain of [| · |] can be extended to all concept terms by induction 1. 2. 3. 4.
[||] = U and [|⊥|] = ∅. [|¬C|] = U \[|C|], [|C D|] = [|C|] ∩ [|D|], and [|C D|] = [|C|] ∪ [|D|]. [|∀R : C|] = {x | ∀y((x, y) ∈ [|R|] ⇒ y ∈ [|C|])} [|∃R : C|] = {x | ∃y((x, y) ∈ [|R|] ∧ y ∈ [|C|])} An interpretation I = (U, [| · |]) satisfies a wff C = D ⇔ [|C|] = [|D|], R(a, b) ⇔ ([|a|], [|b|]) ∈ [|R|], C(a) ⇔ [|a|] ∈ [|C|].
If I satisfies a wff ϕ, it will be written as I |= ϕ. A set of wffs Σ is said to be satisfied by I, written as I |= Σ, if I satisfies each wff of Σ and Σ is satisfiable if it is satisfied by some I. A wff ϕ is an ALC-consequence of Σ, denoted by Σ |=ALC ϕ or simply Σ |= ϕ, iff for all interpretations I, I |= Σ implies I |= ϕ, and ϕ is ALC-valid if it is the ALC-consequence of ∅. Without the assertional formulas, a DL is nothing more than an ML if we consider each concept term in DL as a wff in ML. However, the assertional wffs C(a) and R(a, b) corresponds exactly to the satisfaction statements a : ϕ and a : αb in HL. Thus DL also provide a similar expressive power as HL. However, there is some subtle difference between DL and HL. While nominals in HL are treated as wffs, individual names are not treated as concepts in DL. In HL, every thing is treated equally. They are all wffs.
An Overview of Hybrid Possibilistic Reasoning
3
671
Review: Possibilistic Logic
Possibility theory is developed by Zadeh from fuzzy set theory[11]. Given a universe W , a possibility distribution on W is a function π : W → [0, 1]. In general, the normalized condition is required, i.e., supw∈W π(w) = 1 must hold. Obviously, π is a characteristic function of a fuzzy subset of W . Two measures on W can be derived from π. They are called possibility and necessity measures and denoted by Π and N respectively. Formally, Π, N : 2W → [0, 1] are defined as Π(S) = supw∈S π(w) and N (S) = 1 − Π(S), where S is the complement of S with respect to W . Here, for convenience, we define sup ∅ = 0 and inf ∅ = 1. Based on possibility theory, Dubois and Prade propose the possibilistic logic (PL)[4]. The wffs of PL are one of the forms (ϕ N c) or (ϕ Πc), where ϕ is a classical logic formulas and c ∈ (0, 1]. A model for PL is a possibility distribution π : Ω → [0, 1], where Ω is the set of all classical logic interpretations. For any classical wff ϕ, |ϕ| = {ω | ω |= ϕ} is the truth set of ϕ. Then, the possibility and necessity measures of a classical logic wff can be defined via its truth set, i.e., N (ϕ) = N (|ϕ|) and Π(ϕ) = Π(|ϕ|). For a PL model π, define π |= (ϕ N c) iff N (ϕ) ≥ c and π |= (ϕ Πc) iff Π(ϕ) ≥ c. To emphasize the qualitative aspect of possibilistic reasoning, a qualitative possibility logic (QPL) is proposed in [3]. While PL reasons about the possibility and necessity degrees of the wffs, QPL concerns mainly the relative comparison of possibility measures between two wffs. The syntax of QPL is an extension of propositional language with a binary connective “≥” and its wffs are defined by W F F := p | ¬ϕ | ϕ ∧ ψ | ϕ ∨ ψ | ϕ ⊃ ψ | ϕ ≥ ψ The wff “ϕ ≥ ψ ∧ ¬(ψ ≥ ϕ)” is abbreviated as “ϕ > ψ”. For the semantics, a QPL model is a triple M = (W, π, V ), where W is the set of possible worlds, π : W → [0, 1] is a possibility distribution over W , and V is a truth valuation which assigns to each propositional symbol a subset of W . The satisfaction relation of the modal formulas is as follows: M, w |= ϕ ≥ ψ iff Π(ϕ) ≥ Π(ψ) where Π(ϕ) and Π(ψ) are defined via the truth sets of ϕ and ψ as in PL, though the truth sets in this case are subsets of possible worlds instead of classical interpretations.
4
Hybrid Logic in Possibilistic Reasoning
We have seen that the main motivation of hybrid logic is that modal logic lacks the capability of referring to the states. By using the nominals, hybrid logic thus extend the expressive power of modal logic. We would like to ask whether the same extension could be made for the possibilistic logic. The following would be the preliminary attempts to the answer of the question.
672
4.1
C.-J. Liau
Hybrid Possibilistic Logic
According to the syntax of PL, only possibility or necessity-qualified wffs are actually allowed. Thus PL is not an extension of propositional logic since propositional logic wffs are not those of PL. However, to hybridize PL, we must first add the propositional part to PL and then nominals can be seen as a special kind of propositional symbols as in basic hybrid logic. Thus, given a set of propositional symbols {p0 , p1 , . . .} and a set of nominals {a0 , a1 , . . .}, the wffs of hybrid possibilistic logic (HPL) is defined as W F F := a | p | ¬ϕ | ϕ ∧ ψ | ϕ ∨ ψ | ϕ ⊃ ψ | [c]ϕ | cϕ | a : ϕ where c is a numeral in [0, 1]. Note that we adhere to a modal notation for the necessity and possibility-qualified wffs. The notation has been used in some graded modal formulation of possibilistic logic[5,7]. An HPL model is a triple M = (W, π, V ), where W is the set of possible worlds, π : W → [0, 1] is a possibility distribution over W , and V is a truth valuation which assigns to each propositional symbol a subset of W and each nominal a singleton subset of W . The conditions for the satisfaction of wffs are essentially same as those for HL except the modal operators which are defined as follows: 1. M, w |= [c]ϕ iff N (ϕ) ≥ c 2. M, w |= cϕ iff Π(ϕ) ≥ c where N (ϕ) and Π(ϕ) is defined via the truth set of ϕ and the possibility distribution π as in the case of PL. The semantics makes the truth values of the modal wffs in HPL independent of the worlds in which they are evaluated. In other words, M, w |= [c]ϕ for some world w iff M, w |= [c]ϕ for all w ∈ W . However, we can have a more general semantics for the HPL if each world is associated with a (possibly) different possibility distribution. Indeed, this is particularly useful in similaritybased reasoning[5]. The hybrid logic for possibilistic reasoning based on the more general semantic setting has been studied in [6]. It is shown that hybrid logic may help the development of proof methods in the graded modal formulation of possibilistic reasoning. The logic is called hybrid graded modal logic(HGML) and its syntax is also a bit more general than that of HPL. The wffs of HGML are defined as follows: W F F := a | p | ¬ϕ | ϕ ∧ ψ | ϕ ∨ ψ | ϕ ⊃ ψ | [c]ϕ | [c]+ ϕ | cϕ | c+ ϕ | a : ϕ where c is a numeral in [0, 1]. Except satisfaction statements, the HGML language is just that of quantitative modal logic (QML) in [7] The HGML language is interpreted on fuzzy hybrid models. Define an HGML model as a triple M = (W, R, V ), where W is a set of possible worlds (or states), R : W × W → [0, 1] is a fuzzy binary relation on W , and V is a truth value assignment defined just as in the hybrid models. Given R, we can define
An Overview of Hybrid Possibilistic Reasoning
673
a possibility distribution πw for each w ∈ W such that πw (s) = R(w, s) for all s ∈ W. Given a model M, in addition to clauses for Boolean connectives and satisfaction statement in HL, those for modal wffs are as follows, where Πw and Nw are the possibility and necessity measures corresponding to the possibility distribution πw . 1. 2. 3. 4.
M, w M, w M, w M, w
|= cϕ ⇔ Πw (ϕ) ≥ c, |= c+ ϕ ⇔ Πw (ϕ) > c, |= [c]ϕ ⇔ Nw (ϕ) ≥ c, |= [c]+ ϕ ⇔ Nw (ϕ) > c,
Note that [c]+ and c+ correspond to the strict inequality of the uncertainty measures. 4.2
Hybrid Qualitative Possibility Logic
It has been shown that the hybridization of possibilistic logic is helpful in the development of its proof methods[6]. However, for more practical applications, we can take the hybrid qualitative possibility logic(HQPL) as a tool for reasoning about multi-criteria decision making. Let us first define its syntax with respect to the set of propositional symbols, nominals, and a set of modality labels {≥0 , ≥1 , · · ·}. Its wffs are W F F := a | p | ¬ϕ | ϕ ∧ ψ | ϕ ∨ ψ | ϕ ⊃ ψ | ϕ ≥i ψ | a : ϕ For the semantics, an HQPL model is a triple M = (W, (πi )i≥0 , V ) where W is the set of states, each πi is a possibility distribution over W , and V is the truth valuation as in the HL models. In addition to the satisfaction clauses of HL, we have M, w |= ϕ ≥i ψ iff Πi (ϕ) ≥ Πi (ψ) where Πi is the possibility measure for πi and defined via the truth set of ϕ and ψ. In practical applications, each modality label can correspond to preference relation under a decision criterion, while the nominals are exactly the options available to the decision maker. There are in general two kinds of preference statements for the multi-criteria decision-making problems. The first is about the description of the general preference. This can be modelled by the QPL wff ϕ ≥i ψ which means that some options satisfying ϕ are preferred than some ones satisfying ψ according to the criterion i. The second regards the preference between specific options. This can only be modelled by wffs of the form a ≥i b which means that the option a is preferred to b according to the criterion i. 4.3
Hybrid Possibilistic Description Logic
Some works on the application of fuzzy description or modal logics to information retrieval have been done previously[9,8]. However, following the tradition of DL,
674
C.-J. Liau
most approaches separate the terms and formulas. Here, we shows a hybrid logic approach where the objects to be retrieved and the queries are treated uniformly. Let propositional symbols and nominals be given as above and {α0 , α1 , · · ·} be a set of role names, then the syntax for the wffs of hybrid possibilistic description logic(HPDL) is as follows: W F F := a | p | ¬ϕ | ϕ ∧ ψ | ϕ ∨ ψ | ϕ ⊃ ψ | a : ϕ| + [c]i ϕ | [c]+ i ϕ | ci ϕ | ci ϕ | [αi ]ϕ | αi ϕ where c is still a numeral in [0, 1]. An HPDL model for the language is a 4-tuple (W, (Ri )i≥0 , (Si )i≥0 , V ), where W and V are defined as in the HL models, each Ri is a (crisp) binary relation over W and each Si is a fuzzy binary relation over W . In the application, it is especially assumed that each Si is in fact a similarity relation. A fuzzy relation S : W × W → [0, 1] is called a ⊗-similarity relation if it satisfies the following three conditions: (i) reflexivity: S(w, w) = 1 for all w ∈ W (ii) symmetry: S(w, u) = S(u, w) for all w, u ∈ W (iii) ⊗-transitivity: S(w, u) ≥ supx∈W S(w, x) ⊗ S(x, u) for all w, u ∈ W where ⊗ is a t-norm1 . The semantics for modalities based on role names is the same as that for ML and for the similarity-based modalities, we adopt the semantics of HGML, so we do not have to repeat it again. In the application, the model can have an information retrieval interpretation, where W denotes the set of objects (documents, images, movies, etc.) to be retrieved and each Si is an endowed similarity relations which is associated with some aspect such as style, color, etc. As for the truth valuation function V and the relations Ri ’s, they decide the extensions of each wff just like the interpretation function of DL, so each wff in HPDL also corresponds to a concept term in DL. Since nominals are just a special kinds of wffs and each nominal refers to a retrievable object, the objects in the model are also denoted by wffs in the same way as queries. Let us look at an example adapted from [8] to illustrate the use of the logic. Example 1 (Exemplar-based retrieval) In some cases, in particular, for the retrieval of multimedia information, we may be given an exemplar or standard document and try to find documents very similar to the exemplar but satisfying some additional properties. In this case, we can formulate the query as ca ∧ ϕ, where a is the name for the exemplar and ϕ denotes the additional desired properties. According to the semantics, b : ca will be satisfied by an HPDL model (W, (Ri )i≥0 , S, V ) iff S(V (a), V (b)) ≥ c. Thus, a document referred by b will meet the query if it can satisfy the properties denoted by ϕ and is similar to the exemplar at least with degree c. 1
A binary operation ⊗ : [0, 1]2 → [0, 1] is a t-norm iff it is associative, commutative, and increasing in both places, and 1 ⊗ x = x and 0 ⊗ x = 0 for all x ∈ [0, 1].
An Overview of Hybrid Possibilistic Reasoning
675
It has been shown that the hybridization of DL in fact makes it possible to accommodate more expressive powers than ALC[2]. In particular, it can express number restriction and collection of individuals. For example, b : (book ∧ authora1 ∧ authora2 ∧ ¬a1 : a2 ) means that b is a book with at lease two authors.
5
Conclusion
We have presented some preliminary proposals on the hybridization of possibilistic logic in this paper. We study some variants of hybrid possibilistic logic and show that some application domains such as multi-criteria decision making and information retrieval indeed benefit from the hybridization. Further works on the elaboration of the proposed logical systems are expected.
References 1. P. Blackburn. “Representation, reasoning, and relational structures: a hybrid logic manifesto”. Logic Journal of IGPL, 8(3):339–365, 2000. 2. P. Blackburn and M. Tzakova. “Hybridizing concept languages”. Annals of Mathematics and Artificial Intelligence, 24:23–49, 1999. 3. L. Farinas del Cerro and A. Herzig. “A modal analysis of possibility theory”. In R. Kruse and P. Siegel, editors, Proceedings of the 1st ECSQAU, LNAI 548, pages 58–62. Springer-Verlag, 1991. 4. D. Dubois, J. Lang, and H. Prade. “Possibilistic logic”. In D.M. Gabbay, C.J. Hogger, and J.A. Robinson, editors, Handbook of Logic in Artificial Intelligence and Logic Programming, Vol 3 : Nonmonotonic Reasoning and Uncertain Reasoning, pages 439–513. Clarendon Press - Oxford, 1994. 5. F. Esteva, P. Garcia, L. Godo, and R. Rodriguez. “A modal account of similaritybased reasoning”. International Journal of Approximate Reasoning, pages 235–260, 1997. 6. C.J. Liau. “Hybrid logic for possibilistic reasoning”. In Proc. of the Joint 9th IFSA World Congress and 20th NAFIPS International Conference, pages 1523– 1528, 2001. 7. C.J. Liau and I.P. Lin. “Quantitative modal logic and possibilistic reasoning”. In B. Neumann, editor, Proceedings of the 10th ECAI, pages 43–47. John Wiley & Sons. Ltd, 1992. 8. C.J. Liau and Y.Y. Yao. “Information retrieval by possibilistic reasoning”. In H.C. Mayr, J. Lazansky, G. Quirchmayr, and P. Vogel, editors, Proc. of the 12th International Conference on Database and Expert Systems Applications (DEXA 2001),, LNCS 2113, pages 52–61. Springer-Verlag, 2001. 9. C. Meghini, F. Sebastiani, and U. Straccia. “A model of multimedia information retrieval”. JACM, 48(5):909–970, 2001. 10. M. Schmidt-Schauß and G. Smolka. “Attributive concept descriptions with complements”. Artificial Intelligence, 48(1):1–26, 1991. 11. L.A. Zadeh. “Fuzzy sets as a basis for a theory of possibility”. Fuzzy Sets and Systems, 1(1):3–28, 1978.
Critical Remarks on the Computational Complexity in Probabilistic Inference S.K.M. Wong, D. Wu, and Y.Y. Yao Department of Computer Science University of Regina Regina, Saskatchewan, Canada, S4S 0A2 {wong, danwu, yyao}@cs.uregina.ca
Abstract. In this paper, we review the historical development of using probability for managing uncertain information from the inference perspective. We discuss in particular the NP-hardness of probabilistic inference in Bayesian networks.
1
Introduction
Probability has been successfully used in AI for managing uncertainty [4]. A joint probability distribution (jpd) can be used as a knowledge base in expert systems. Probabilistic inference, namely, calculating posterior probabilities, is a major task in such knowledge based systems. Unfortunately, probabilistic methods for coping with uncertainty fell out of favor in AI from 1970s to the mid-1980s. There are two major problems. One is the intractability of acquiring a jpd with a large number of variables, and the other is the intractability of computing posterior probabilities for probabilistic inference. However, the probability approach managed to come back in middle 1980s with the discovery of Bayesian networks (BNs) [4]. The purpose of introducing BNs is to solve the intractability of acquiring the jpd. The BN provides a representation of the jpd as a product of conditional probability distributions (CPDs). The structure of such a product can be characterized by a directed acyclic graph (DAG). Once the jpd is specified in this manner, one still has to design efficient methods for computing posterior probabilities. “Effective” probabilistic inference methods have been developed [2] for BNs and they seem to be quite successful in practice. Therefore, the BNs seemingly overcome the representation problems that early expert systems encountered. However, the task of computing posterior probabilities for probabilistic inference in BNs is NP-hard as shown by Copper [1]. This negative result has raised some concerns about the practical use of BNs. In this paper, we review the historical development of using probability for managing uncertainty. In particular, we discuss the problem of probabilistic inference in BNs. By studying the proof in [1], we observe that the NP-hardness of probabilistic inference is due to the fact that the DAG of a BN contains a node having a large number of parents. This observation of the cause of NP-hardness may help the knowledge engineers to avoid the pitfalls in designing a BN. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 676–681, 2003. c Springer-Verlag Berlin Heidelberg 2003
Critical Remarks on the Computational Complexity
677
The paper is organized as follows. We introduce some pertinent notions in Section 2. In Section 3, we recall the difficulties of early expert systems using a jpd. Probabilistic inference is discussed in Section 4. In Section 5, we analyze the cause of the NP-hardness of probabilistic inference in BNs. We conclude our paper in Section 6.
2
Background
Let R denote a finite set of discrete variables. Each variable of R is associated with a finite domain. By XY , we mean the union of X and Y , i.e., X ∪ Y and X, Y ⊆ R. The domain for X ⊆ R, denoted DX , is the Cartesian product of the individual domain for each variable in X. Similarly, the domain for R, denoted DR , or just D, if R is understood, is the Cartesian product of each individual domain of each variable in R. Each element in the domain is called a configuration of the corresponding variable(s) and we use the corresponding lowercase letter, possibly using subscripts to denote it. For example, we use x to denote that it is an element in the domain of X. Naturally, we use X = x to indicate that X is taking the value x. We define a joint probability distribution (jpd) over R to be a function p on D, denoted p(R), such that for each configuration t ∈ D, 0 ≤ p(t) ≤ 1, and t∈D p(t) = 1. Let X ⊆ R. We define the marginal (probability distribution) of p(R), denoted p(X), as p(X) = p(R). R−X
Let X, Y ⊆ R and X ∩ Y = ∅. We define the conditional probability distribution of X given Y , denoted p(X|Y ), as p(X|Y ) =
P (XY ) , whenever p(Y ) = 0. p(Y )
Probabilistic inference means computing the posterior probability distribution for a set of query variables, say X, given exact values for some evidence variables, say Y , that is, to compute p(X|Y = y).
3
Probabilistic Inference Using JPD
Early expert systems tried to handle uncertainty by considering the jpd as a knowledge base and working solely with the jpd [5] without taking advantage of the conditional independencies satisfied by the jpd. Suppose a problem domain can be described by a set R = {A1 , . . . , An } of discrete variables, the early expert systems tried to use the jpd p(R) to describe the problem domain and conduct inference based on p(R) alone. This method quickly becomes intractable
678
S.K.M. Wong, D. Wu, and Y.Y. Yao
when the number of variables in R becomes large. Let |R| = n and assume the domain of each variable Ai ∈ R is binary. To store p(R) in a table requires exponential storage, that is, we need to store 2n entries in the single table. Computing a marginal probability, for example p(Ai ), require 2n−1 additions, namely, p(Ai ) = p(R). A, ..., Ai−1 , Ai+1 , ...,An
In other words, both the storage of the jpd and the computation of a marginal exhibit exponential complexity. Thus, probabilistic inference using jpd alone soon fell out of favor in AI in 1970s [5].
4
Probabilistic Inference Using BNs
The problems of using a jpd alone for representing and reasoning with uncertainty were soon realized and were not solved until late 1980s, in which time Bayesian networks were discovered as a method for representing not only a jpd, but also the conditional independency (CI) information satisfied by the jpd. A Bayesian network defined over a set R of variables consists of a DAG which is augmented by a set of CPDs whose product yields a jpd [4]. Consider the BN over R = {X, U1 , U2 , U3 , U4 , C1 , C2 , C3 , Y } in Fig. 1, the structure of the DAG encodes CI information which can be identified by d-separation [4]. Each node in the DAG corresponds to a variable in R and is associated with a conditional probability. For each node without parents, it is associated with a marginal. For instance, the node X in the DAG is associated with the marginal p(X). For each node with parents, it is associated with a CPD of this node given its parents. For instance, the node Y in the DAG is associated with the CPD p(Y |C1 C2 C3 ). The product of these CPDs defines the following jpd: p(R) = p(X) · p(U1 |X) · p(U2 |X) · p(U3 |X) · p(U4 |X) ·p(C1 |U1 U2 U3 ) · p(C2 |U1 U2 U3 ) · p(C3 |U2 U3 U4 ) ·p(Y |C1 C2 C3 ).
(1)
The factorization in the above equation indicates that instead of storing the whole jpd p(R) in a single table, in which case the exponential storage problem described early would occur, one can now store each individual CPD instead. Since storing each CPD requires significantly less space than storing the entire jpd, seemingly the BN approach has solved the problem of storage intractability experienced by early expert systems. More encouragingly, “effective” algorithms have been developed for probabilistic inference without the need of recovering the entire jpd as defined by equation (1). These methods are referred to as the local propagation method and its variants [3]. The local propagation method first moralizes and then triangulates the DAG. Then a junction tree is constructed on which the inference can be performed [3].
Critical Remarks on the Computational Complexity
679
Fig. 1. A BN defined over R = {X, U1 , U2 , U3 , U4 , C1 , C2 , C3 , Y }.
5
NP-Hard in Probabilistic Inference
The introduction of BNs seems to have completely solved the problems encountered by early expert systems. By utilizing the CI information, individual CPD tables are stored instead of the whole jpd. Therefore, the storage intractability problem is “solved”. By applying the local propagation method, computing posterior probability can be effectively and efficiently done without resorting back to the whole jpd. That is, therefore, the computational intractability problem is “solved” as well. However, Cooper [1] showed that the task of probabilistic inference in BNs is NP-hard. This result indicates that there does not exist an algorithm with polynomial time complexity for inference in BNs in general. In the following, we analyze the cause of the NP-hardness by studying the proof in [1]. Cooper [1] proved the decision problem version of probabilistic inference is NP-complete by successfully transforming a well known NP-complete problem, the 3SAT problem [1], into the decision problem of probabilistic inference in BNs. For example, consider an instance of 3SAT in which U = {U1 , U2 , U3 , U4 } is a set of propositions and C = {U1 ∨ U2 ∨ U3 , ¬U1 ∨ ¬U2 ∨ U3 , U2 ∨ ¬U3 ∨ U4 } is a set of clauses. The objective of a 3SAT problem is to find a satisfying truth assignment for the propositions in U so that every clause in C is evaluated to be true simultaneously. We can transform this 3SAT problem into a BN decision problem: “Is p(Y |X) > 0 ?” The corresponding BN is shown in Fig.1. The 3SAT example given in the preceding paragraph involves only 4 propositions in U and 3 clauses in C. Suppose in general the number of propositions in U is n and the number of clauses in C is m. The corresponding BN according to the construction in [1] is depicted in Fig. 2. Note that the BN in Fig.1 and the BN in Fig. 2 have the same structure.
680
S.K.M. Wong, D. Wu, and Y.Y. Yao
Fig. 2. A BN defined over R = {X, U1 , . . . , Un , C1 , . . . , Cm , Y }.
The jpd defined by the DAG in Fig. 2 is: p(R) = p(X) · p(U1 |X) · . . . p(Ui |X) . . . · p(Un |X) ·p(C1 |Ui1 Ui2 Ui3 ) · . . . · p(Cm |Uj1 Uj2 Uj3 ) ·p(Y |C1 . . . Cm ).
(2)
One may immediately note that in equation (2), the CPD p(Y |C1 . . . Cm ) indicates that Y has m parents, i.e., C1 . . . Cm , and all the other CPDs have at most 3 parents. Therefore, the storage requirement for the CPD p(Y |C1 . . . Cm ) is exponential in m, the number of parents of Y . In other words, the representation of this CPD table is intractable, which renders the representation of the BN intractable. On the other hand, if one wants to compute p(Y ), one needs to compute p(C1 . . . Cm ) regardless of the algorithm used, i.e., p(Y ) = p(Y, C1 . . . Cm ) C1 ...Cm
=
p(C1 . . . Cm ) · p(Y |C1 . . . Cm ).
C1 ...Cm
The number of additions and multiplications for computing p(Y ) is exponential with respect to m, i.e., the number of parents of Y . The above analysis has demonstrated that the cause of the exponential storage and exponential computation is due to the existence of the CPD p(Y |C1 . . . Cm ). In designing a BN, there are several existing techniques such as noise-or and divorcing that can be used to remedy this situation. The noise-or technique [2] allows the designer to specify p(Y |C1 ), . . ., p(Y |Cm ) individually and combine them into p(Y |C1 . . . Cm ) under certain assumptions. The divorcing technique [2], on the other hand, partitions the parent set {C1 , . . . , Cm } into m < m groups and introduces m intermediate variables. Each intermediate
Critical Remarks on the Computational Complexity
681
variable is the parent of Y . More recently, the notion of contextual weak independency [6] was proposed to further explore possible decomposition of a large CPD table into smaller ones.
6
Conclusion
In this paper, we have reviewed the problem of probabilistic inference in BNs. We point out that the cause of NP-hardness is due to a particular structure of the DAG. That is, there exists a node in the DAG with a large number of parent nodes.
References [1] G.F. Cooper. The computational complexity of probabilistic inference using bayesian belief networks. Artificial Intelligence, 42(2-3):393–405, 1990. [2] F.V. Jensen. An Introduction to Bayesian Networks. UCL Press, 1996. [3] S.L. Lauritzen, T.P. Speed, and K.G. Olesen. Decomposable graphs and hypergraphs. Journal of Australian Mathematical Society, 36:12–29, 1984. [4] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco, California, 1988. [5] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, New Jersey, 1995. [6] S.K.M. Wong and C.J. Butz. Contextual weak independence in bayesian networks. In Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 670–679. Morgan Kaufmann Publishers, 1999.
Critical Remarks on the Maximal Prime Decomposition of Bayesian Networks Cory J. Butz, Qiang Hu, and Xue Dong Yang Department of Computer Science, University of Regina, Regina, Canada, S4S 0A2, {butz,huqiang,yang}@cs.uregina.ca
Abstract. We present a critical analysis of the maximal prime decomposition of Bayesian networks (BNs). Our analysis suggests that it may be more useful to transform a BN into a hierarchical Markov network.
1
Introduction
Very recently, it was suggested that a Bayesian network (BN) [3] be represented by its unique maximal prime decomposition (MPD) [1]. An MPD is a hierarchical structure. The root network is a jointree [3]. Each node in the jointree has a local network, namely, an undirected graph called a maximal prime subgraph. This hierarchical structure is claimed to facilitate probabilistic inference by locally representing independencies in the maximal prime subgraphs. In this paper, we present a critical analysis of the MPD representation of BNs. Although the class of parent-set independencies is always contained within the nodes of the root jointree, we establish in Theorem 2 that this class is never represented in the maximal prime subgraphs (see Example 3). Furthermore, in Example 4, we show that there can be an independence in a BN defined precisely on the same set of variables as a node in the root jointree, yet this independence is not represented in the local maximal prime subgraph. Conversely, we explicitly demonstrate in Example 5 that there can be an independence holding in the local maximal prime subgraph, yet this independence cannot be realized using the probability tables assigned the corresponding node in the root jointree. This paper is organized as follows. Section 2 reviews the maximal prime decomposition of BNs. We present a critique of the MPD representation in Section 3. The conclusion is presented in Section 4.
2
Maximal Prime Decomposition of Bayesian Networks
Let X, Y, Z be pairwise disjoint subsets of U . The conditional independence [3] of Y and Z given X is denoted I(Y, X, Z). The conditional independencies encoded in the Bayesian network (BN) [3] in Fig. 1 on U = {a, b, c, d, e, f, g, h, i, j, k} indicate that the joint probability distribution can be written as p(U ) = p(a) · p(b) · p(c|a) · p(d|b) · p(e|b) · p(f |d, e) · p(g|b) · p(h|c, f ) · p(i|g) · p(j|g, h, i) · p(k|h). Olesen and Madsen [1] proposed that a given BN be represented by its unique maximal prime decomposition (MPD). An MPD is a hierarchical structure. The G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 682–685, 2003. c Springer-Verlag Berlin Heidelberg 2003
Critical Remarks on the Maximal Prime Decomposition
683
Fig. 1. A Bayesian network D encoding independencies on the set U of variables.
root network is a jointree [3]. Each node in the jointree has a local network, namely, an undirected graph called a maximal prime subgraph. Example 1. Given the BN D in Fig. 1, the MPD representation is shown in Fig. 2. Each of the five nodes in the root jointree has an associated maximal prime subgraph as denoted with an arrow. The root jointree encodes independencies on U , while the maximal prime subgraphs encode independencies on proper subsets of U . For example, the root jointree in Fig. 2 encodes I(a, c, bdef ghijk), I(abcdef gij, h, k) I(ack, f h, bdegij), and I(abcdef k, gh, ij) on U , while I(bde, f g, h), for instance, can be inferred from the maximal prime subgraph for node bdef gh. The numerical component of the MPD is defined by assigning the conditional probability tables of the BN to the nodes of the jointree. In our example, this assignment must be as follows: φ1 (ac) = p(a) · p(c|a), φ2 (cf h) = p(h|c, f ), φ3 (bdef gh) = p(b) · p(d|b) · p(e|b) · p(f |d, e) · p(g|b), φ4 (ghij) = p(i|g) · p(j|g, h, i), and φ5 (hk) = p(k|h).
3
Critical Remarks on the MPD of BNs
A parent-set independency I(Y, X, Z) is one such that Y XZ is the parent-set [2] of a variable ai in a BN. For example, the BN D in Fig. 1 encodes the parent-set independencies I(c, ∅, f ) and I(h, g, i); c and f are the parents of variable h, while g, h and i are the parents of j. The proof of the next result is omitted due to lack of space. Theorem 2. All parent-set independencies in a Bayesian network are not represented in the maximal prime decomposition.
684
C.J. Butz, Q. Hu, and X.D. Yang
Fig. 2. The maximal prime decomposition (MPD) of the BN D in Fig. 1.
Example 3. Although the BN D in Fig. 1 indicates that variables c and f are unconditionally independent, the maximal prime decomposition of D indicates that c and f are dependent. Similar remarks hold for the parent-set independence I(h, g, i). Example 4. I(def h, b, g) holds in the given BN. On the contrary, b does not separate {d, e, f, h} from g in the maximal prime subgraph bdef gh as g and h are directly connected (dependent). Example 5. I(bde, f g, h) can be inferred by separation from the maximal prime subgraph bdef gh. However, it can never be realized in the probability table φ(bdef gh) as variable h does not appear in any of the conditional probability tables p(b), p(d|b), p(e|b), p(f |d, e), p(g|b) assigned to φ(bdef gh). In [3], Wong et al. suggested that a BN be transformed in a hierarchical Markov network (HMN). An HMN is a hierarchy of Markov networks (jointrees). The primary advantages of HMNs are that they are a unique and equivalent representation of BNs [3]. For example, given the BN D in Fig. 1, the unique and equivalent HMN is depicted in Fig. 3. Example 6. The BN D in Fig. 1 can be transformed into the unique MPD in Fig. 2 or into the unique HMN in Fig. 3. Unlike the MPD representation
Critical Remarks on the Maximal Prime Decomposition
685
Fig. 3. The hierarchical Markov network (HMN) for the BN D in Fig. 1.
which does not encode, for instance, the independencies I(c, ∅, f ), I(h, g, i), and I(d, b, e), these independencies are indeed encoded in the appropriate nested jointree of the HMN.
4
Conclusion
The maximal prime decomposition (MPD) [1] of Bayesian networks (BNs) is a very limited hierarchical representation as it always consists of precisely two levels. Moreover, the MPD is undesirable since it is not a faithful representation of BNs. On the contrary, it has been previously suggested that BNs be transformed into hierarchical Markov networks (HMNs) [3]. The primary advantages of HMNs are that they are a unique and equivalent representation of BNs [3]. These observations suggest that compared with the MPD of BNs, HMNs seem to be a more desirable representation.
References 1. Olesen, K.G. and Madsen, A.L.: Maximal prime subgraph decomposition of Bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics, B, 32(1):21–31, 2002. 2. Wong, S.K.M., Butz, C.J. and Wu, D.: On the implication problem for probabilistic conditional independency. IEEE Transactions on Systems, Man, and Cybernetics, A, 30(6): 785–805, 2000. 3. Wong, S.K.M., Butz, C.J. and Wu, D.: On undirected representations of Bayesian networks. ACM SIGIR Workshop on Mathematical/Formal Models in Information Retrieval, 52–59, 2001.
A Non-local Coarsening Result in Granular Probabilistic Networks Cory J. Butz, Hong Yao, and Howard J. Hamilton Department of Computer Science, University of Regina, Regina, Canada, S4S 0A2, {butz,yao2hong,hamilton}@cs.uregina.ca
Abstract. In our earlier works, we coined the phrase granular probabilistic reasoning and showed a local coarsening result. In this paper, we present a non-local method for coarsening variables (i.e., the variables are spread throughout the network) and establish its correctness.
1
Introduction
In [3], we coined the phrase granular probabilistic reasoning to mean the ability to coarsen and refine parts of a probabilistic network depending on whether they are of interest or not. Granular probabilistic reasoning is of importance as it not only leads to more efficient probabilistic inference, but it also facilitates the design of large probabilistic networks [4]. It is then not surprising that Xiang [4] explicitly states that our granular probabilistic reasoning [3] demands further attention. We proposed two operators called nest and unnest for coarsening and refining variables in a network, respectively [3]. In [1], we showed that the nest operator can be applied locally to a marginal distribution with the same effect as if it were applied directly to the joint distribution. However, no study has ever addressed how to coarsen variables spread throughout a network. In this paper, we present a method, called Non-local Nest, for coarsening non-local variables, that is, variables spread throughout a network. This method gathers all variables to be coarsened into one marginal distribution, and then applies the nest operator. We also prove our method is correct. This paper is organized as follows. Section 2 reviews a local nest method. We present a non-local method for nesting variables in Section 3. The conclusion is presented in Section 4.
2
A Local Method for Nesting
Consider the joint distribution p(R) represented as a probabilistic relation r(R) in Fig. 1, where R = {A, B, C, D, E, F } = ABCDEF is a set of variables. Configurations with zero probability are not shown. The nest operator φ is used to coarsen a relation r(XY ). Intuitively, φA=Y (r) groups together all the Yvalues into a nested distribution for coarse variable A given the same X-value. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 686–689, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Non-local Coarsening Result in Granular Probabilistic Networks
687
More formally, φA=Y (r) = {t | t(X) = u(X), t(A) = {u(Y p(R))}, t(p(XA)) =
u(p(R)), and u ∈ r}.
u
Attribute p(R) in the A-value is relabeled p(Y ) and the values are normalized. A 0 0 r(R) = 0 0 0 1
B 0 0 0 1 1 0
C 0 0 1 0 0 0
D 0 0 0 1 1 2
E 0 1 1 0 1 2
F 0 0 1 2 2 0
p(R) 0.05 0.05 0.20 0.15 0.15 0.40
Fig. 1. A probabilistic relation r(R) representing a joint distribution p(R).
Example 1. Recall the relation r(ABCDEF ) in Fig. 1. Nesting variables Y = DE as the single variable G gives the nested relation φG={D,E} (r) in Fig. 2. For instance, given the fixed X-value A : 0, B : 1, C : 0, F : 2, the Y-values D : 1, E : 0, p(R) : 0.15 and D : 1, E : 1, p(R) : 0.15 are grouped into a nested distribution. Here the attribute p(R) is relabeled as p(DE), and the probability values 0.15 and 0.15 are normalized as 0.50 and 0.50. In practice, a joint distribution r(R) is represented as a Markov network (MN) [2]. The dependency structure of a MN is an acyclic hypergraph (a jointree) [2]. The acyclic hypergraph encodes conditional independencies [2] satisfied by r(R). For example, the joint distribution p(R) in Fig. 1 can be expressed as the MN: p(R) =
p(ABD) · p(ABC) · p(ACE) · p(BCF ) , p(AB) · p(AC) · p(BC)
(1)
where R = {R1 = {A, B, D}, R2 = {A, B, C}, R3 = {A, C, E}, R4 = {B, C, F }} is an acyclic hypergraph, and the marginal distributions r1 (ABD), r2 (ABC), r3 (ACE), and r4 (BCF ) of r(R) are shown in Fig. 3. In our probabilistic relational model [2], the MN in Eq. (1) is expressed as r(R) = ((r1 (ABD) ⊗ r2 (ABC)) ⊗ r3 (ACE)) ⊗ r4 (BCF ), where the Markov join operator ⊗ means r(XY ) ⊗ r(Y Z) =
p(XY ) · p(Y Z) . p(Y )
688
C.J. Butz, H. Yao, and H.J. Hamilton ABC
G
0 0 0
D E p(DE) 0 0 0.5 0 1 0.5
0 0 1 φG={D,E} (r) = 0 1 0
1 0 0
D E p(DE) 0 1 1.0 D E p(DE) 1 0 0.5 1 1 0.5 D E p(DE) 2 2 1.0
F p(ABCGF )
0
0.1
1
0.2
2
0.3
0
0.4
Fig. 2. The nested relation φG={D,E} (r), where r is the relation in Fig. 1.
r1 =
A 0 0 1
B 0 1 0
D 0 1 2
p(R1 ) A B C p(R2 ) ACE 0.3 r2 = 0 0 0 0.1 r3 = 0 0 0 0.3 0 0 1 0.2 0 0 1 0.4 0 1 0 0.3 0 1 1 1 0 0 0.4 1 0 2
p(R3 ) B C F p(R4 ) 0.2 r4 = 0 0 0 0.5 0.2 0 1 1 0.2 0.2 1 0 2 0.3 0.4
Fig. 3. The marginals r1 (ABD), r2 (ABC), r3 (ACE), and r4 (BCF ) of relation r.
We may omit the parentheses for simplified notation. For example, the Markov join r1 (ABD) ⊗ r2 (ABC) ⊗ r3 (ACE) is shown in Fig. 4. The main result in [1] was that the nest operator can be applied locally to one marginal distribution in a Markov network with the same effect as if applied directly to the joint distribution itself. For example, φG={E} (r) is the same nested distribution as r1 (ABD) ⊗ r2 (ABC) ⊗ φG={E} (r3 (ACE)) ⊗ r4 (BCF ). We next study how to coarsen variables spread throughout a network.
3
A Non-local Method for Nesting
We call Y ⊆ R a nestable set with respect to a MN on R = {R1 , . . . , Rn }, if Y does not intersect any separating set [2] of R. Since the nest operator is unary, the first task is to combine the nestable set Y of variables into a single table. The well-known relational database selective reduction algorithm (SRA) is applied for this purpose. The nest operator can then be applied to coarsen Y as attribute A. We now present the formal algorithm Non-local Nest (NLN) to coarsen a nestable set Y as attribute A in a given MN on acyclic hypergraph R.
A Non-local Coarsening Result in Granular Probabilistic Networks A 0 0 r1 (ABD) ⊗ r2 (ABC) ⊗ r3 (ACE) = 0 0 0 1
B 0 0 0 1 1 0
C 0 0 1 0 0 0
D 0 0 0 1 1 2
689
E p(ABCDE) 0 0.05 1 0.05 1 0.20 0 0.15 1 0.15 2 0.40
Fig. 4. The Markov join r1 (ABD) ⊗ r2 (ABC) ⊗ r3 (ACE).
Algorithm 1 NLN(Y, A, R) 1. Let Rj , . . . , Rn be those elements of R not deleted by the call SRA(Y, R). 2. Return φA=Y (rjn ), where rjn = rj (Rj ) ⊗ . . . ⊗ rn (Rn ). Theorem 1. Let r(R) = r1 (R1 ) ⊗ . . . ⊗ ri (Ri ) ⊗ rj (Rj ) . . . ⊗ rn (Rn ) be represented as a MN, and let Y be a nestable set. Then φA=Y (r) is the same as r1 (R1 ) ⊗ . . . ⊗ ri (Ri ) ⊗ r , where r is the nested relation returned by the call NLN(Y, A, {R1 , R2 , . . . , Rn }). Example 2. DE is a nestable set with respect to R = {R1 , R2 , R3 , R4 } as used in our running example. Suppose we wish to compute NLN(DE, G, R). Then SRA(DE, R) = {R1 , R2 , R3 }. Again, the Markov join r1 ⊗ r2 ⊗ r3 is shown in Fig. 4. The reader can verify that the nested relation φG={D,E} (r) in Fig. 2 is the same as r4 (BCF )⊗ φG={D,E} (r1 (ABD) ⊗ r2 (ABC) ⊗ r3 (ACE)).
4
Conclusion
Xiang [4] explicitly states that more work needs to be done on our granular probabilistic networks [3]. In this paper, we have extended the work in [1] by coarsening a non-local set of variables (i.e., variables spread throughout a network). Theorem 1 establishes the correctness of our approach.
References 1. Butz, C.J., Wong, S.K.M.: A local nest property in granular probabilistic networks, Proc. of the Fifth Joint Conf. on Information Sciences, 1, 158–161, 2000. 2. Wong, S.K.M., Butz, C.J.: Constructing the dependency structure of a multi-agent probabilistic network. IEEE Trans. Knowl. Data Eng. 13(3) (2001) 395–415. 3. Wong, S.K.M., Butz, C.J.: Contextual weak independence in Bayesian networks, Proc. of the Fifteenth Conf. on Uncertainty in Artificial Intelligence, 670–679, 1999. 4. Xiang, Y.: Probabilistic Reasoning in Multi-Agent Systems: A Graphical Models Approach. Cambridge Publishers, 2002.
Probabilistic Inference on Three-Valued Logic Guilin Qi Mathematics Department, Yichun College Yichun, Jiangxi Province, 336000 [email protected]
Abstract. In this paper, we extend Nilsson’s probabilistic logic [1] to allow that each sentence S has three sets of possible worlds. We adopt the ideas of consistency and possible worlds introduced by Nilsson in [1], but propose a new method called linear equation method to deal with the problems of probabilistic inference, the results of our method is consistent with those of Yao’s interval-set model method.
1
Introduction
Nilsson in [1] presented a method to combine logic with probability theory. In his probabilistic logic, the problems of probabilistic entailment were solved by a geometric analysis method. Later, Yao in [4] gave a modified modus ponens rule which may be used in incidence calculus, then carried out probabilistic inference based on modus ponens. Furthermore, Yao discussed the relationship between three-valued logic and interval-set model, and gave a modus ponens rule for interval-set model. Then he obtained an extension of previous probabilistic inference. Naturally, we may wonder whether there exists an extension of Nilsson’s probabilistic logic. This problem is discussed here.
2
Probabilistic Inference on Three-Valued Logic
As in probabilistic logic, we will relate each sentence S to some sets of possible worlds, but this time the numbers of the sets of possible worlds are three-two of them, say, W1 and W2 , containing worlds in which S was true and false respectively, the third, say, W3 , containing worlds in which S was neither true nor f alse. Clear, the consistent sets of truth values for sentences φ, φ→ψ, ψ are given by columns in the following table: φ φ→ψ ψ
T
T
T
I
I
T T
I I
F F
T I T I
I
F
F
F
I T F T
T I
T F
where T denotes the truth value ”true”, F denotes the truth value ”f alse”, I denotes the truth value ”unknown”. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 690–693, 2003. c Springer-Verlag Berlin Heidelberg 2003
Probabilistic Inference on Three-Valued Logic
691
Since the sentence S possesses a third truth value, which is different from T and F , we can extend the probability of S to a probability interval [P∗ (S), P ∗ (S)], where P∗ (S) is taken to be the sum of the probabilities of all the sets of worlds in which S is true, while P ∗ (S) is taken to be the sum of the probabilities of all the sets of worlds in which S is either true or unknown. Next, we define a matrix equation Π = VP ,
(1)
where Π, V, P are defined as follows. Suppose there are K sets of possible worlds for our L sentences in B(B is a set of sentences), then P is the K-dimensional column vector that represents the probabilities of the sets of possible worlds. V = (V1 , V2 , ..., VK ), where Vi is taken to be have components equal to 0, 1, or 1/2. The component vji = 1 if Sj has the value true in the worlds in Wi , vji = 0 if Sj has the value f alse in the worlds in Wi , vji = 1/2 if Sj has the value unknown in the worlds in Wi , here ”1/2” is not just a number but a symbol to represent the unknown state. Π is a L-dimensional column vector whose component πi denotes the ”probability” of each sentence S in B. We obtain the best possible bounds for the probability interval [P∗ (S), P ∗ (S)] by following steps: First we define two sets of integrals (Ii )∗ and (Ii )∗ as (Ii )∗ = {j∈Z|vij = 1},
(Ii )∗ = {j∈Z|vij = 1∨1/2}.
k Since πi = Σj=1 vij pj , we have
P∗ (Si ) = Σj∈(Ii )∗ pj ,
P ∗ (Si ) = Σj∈(Ii )∗ pj .
(2)
The rule of modus ponens allows us to infer ψ from φ and φ→ψ, but if the probability interval of φ and φ→ψ are given, what can we infer for [P∗ (ψ), P ∗ (ψ)]? To solve these kinds of problems, let us consider three sentences φ, φ→ψ, ψ. The consistent truth-value assignments are given by the columns in the matrix V as follows: 1 1 1 1/2 1/2 1/2 0 0 0 V = 1 1/2 0 1 1/2 1/2 1 1 1 1 1/2 0 1 1/2 0 1 1/2 0 The first row of the matrix gives truth values for φ in the four sets of possible worlds. The second row gives truth values for φ→ψ, and the third row gives truth values for ψ. Probability values for these sentences are constrained by the matrix equation (1), and by the rules of probability, Σi pi = 1 and 0≤pi ≤1 for all i. Now suppose we are given the probability interval of sentences φ and φ→ψ. The probability interval of φ, is denoted by interval [P∗ (φ), P ∗ (φ)]; the probability interval of φ→ψ, is denoted by [P∗ (φ→ψ), P ∗ (φ→ψ)]. By the definition of P∗ (S) and P ∗ (S)(where S is an arbitrary sentence) and the equation (2) we have P∗ (φ) = p1 + p2 + p3 , P∗ (φ→ψ) = p1 + p4 + p7 + p8 + p9 , P∗ (ψ) = p1 + p4 + p7 ,
P ∗ (φ) = p1 + p2 + · · · + p6 P (φ→ψ) = p1 + p2 + p4 + · · · + p9 P ∗ (ψ) = p1 + p2 + p4 + p5 + p7 + p8 . ∗
692
G. Qi
Now we will try to obtain the best possible bounds for [P∗ (ψ), P ∗ (ψ)] using a new method called linear equation method: First, we will find the best lower bound for P∗ (ψ), here we only consider the best linear lower bound1 of P∗ (ψ) and we assume it is the best lower bound. The reason that why not other combination be taken will be discussed elsewhere. By the constraints of pi , we know that the linear lower bounds of P∗ (ψ) are l1 p1 + l2 p4 + l3 p7 , 0≤li ≤1, and the best linear lower bound must be one of them. But which one is the best? We claim that it must satisfy following two conditions: Linear Condition: It should be the linear combination of P∗ (φ), P ∗ (φ), P∗ (φ→ψ), P ∗ (φ→ψ), and 1. Maximum Condition: It should be maximum among those satisfy linear condition . Proposition 1. The best lower bound for P (ψ) is P ∗ (φ) + P∗ (φ→ψ) − 1, if we are given the probability interval for φ and φ→ψ. Proof. The linear requires that the lower bounds l1 p1 + l2 p4 + l3 p7 should satisfy the linear equation l1 p1 + l2 p4 + l3 p7 = x1 p∗ (φ) + x2 p∗ (φ) + x3 p∗ (φ→ψ) + x4 p∗ (φ→ψ) + y,
(3)
where variables x1 , x2 , x3 are coefficients of p∗ (φ), p∗ (φ), p∗ (φ→ψ) respectively and y is the coefficient of 1. By the discussion above and Σi pi = 1 we know the equation (3) is equivalent to l1 p1 + l2 p4 + l3 p7 = a1 p1 + a2 p2 + ... + a9 p9
(4)
where a1 = x1 + x2 + x3 + x4 + y, a2 = x1 + x2 + x4 + y, a3 = x1 + x2 + y, a4 = x2 + x3 + x4 + y, a5 = x2 + x4 + y, a6 = x2 + x4 + y, a7 = x3 + x4 + y, a8 = x3 + x4 + y, a9 = x3 + x4 + y. Before continuing with our discussions about solving equation (4), let us consider a lemma: Lemma 1. Suppose p1 , p2 , ..., p9 are probabilities of the nine consistent possible worlds which satisfy Σi pi = 1 and 0≤pi ≤1, then pi , i = 1, 2, ..., 9 are linear independent. Proof. The proof of lemma 1 is clear. By lemma 1 we know the equation (4) is equivalent to the following system of equations: x1 + x2 + x3 + x4 + y = l1 , x1 + x2 + x4 = x1 + x2 + y = x2 + x4 + y = x3 + x4 + y = 0, x2 + x3 + x4 + y = l2 , x3 + x4 + y = l3 . (5) It is clear the solution for the system of equation (5) is x1 = 0, x2 = x3 = l1 = l2 = −y, x4 = 0, l3 = 0, 1
a lower bound is linear if it can be expressed as the linear combination of pi , where i = 1, 2..., 9, that is, it is equal to x1 p1 + .. + x9 p9 , for some integrals xi .
Probabilistic Inference on Three-Valued Logic
693
next, by the maximum condition we must have l1 = l2 = 1, therefore the best possible solution for equation (3) is x1 = 0, x2 = x3 = 1, x4 = 0, y = −1. In this way, we can know p1 + p4 = P ∗ (φ) + P∗ (φ→ψ) − 1 is the best possible lower bound of P∗ (ψ). Next, we will decide the best upper bound for P ∗ (ψ). As for the best lower bound, here we only consider the linear upper bounds. The linear upper bounds of P ∗ (ψ) are l1 p1 + l2 p2 + ... + l9 p9 , where li ≥1, for i = 1, 2, 4, 5, 7, while li ≥0, for 9 other i and Σi=1 li ≤1. The best upper bound must be one of them. We claim that this best upper bound should satisfy the linear condition and minimum condition, that is, it should be minimum among those satisfy linear condition. As for the best upper bound, we have following proposition: Proposition 2. The best upper bound for P (ψ) is P ∗ (φ→ψ), if we are given the probability interval for φ and φ→ψ. Proof. The proof of proposition 2 is similar to that of proposition 1. Thus, we have got the best possible interval to include [P∗ (ψ), P ∗ (ψ)], it is the interval [P ∗ (φ) + P∗ (φ→ψ) − 1, P ∗ (φ→ψ)], which are the bounds given by Yao [4].
References 1. N. J. Nilsson: Probability Logic, Artificial Intelligence. 28, 71–87, 1986. 2. G. Shafer: A Mathematical Theory of Evidence, Princeton University Press, 1976. 3. Y. Y. Yao, Xining Li: Comparison of Rough-set and Interval-set Model For Uncertainty Reasoning , Fundamenta Informaticae. 27 (1996) 289–298. 4. N. Recher: Many-valued Logic, New York, McGraw-Hill, 1969. 5. Z. Pawlak: Rough sets, International Journal of Computer and Information Science, 11, 341–356, 1981.
Multi-dimensional Observer-Centred Qualitative Spatial-temporal Reasoning *
1
1
Yi-nan Lu , Sheng-sheng Wang , and Sheng-xian Sha
2
1
Jilin University, Changchun China 130012 Changchun Institute of Technology, Changchun China 130012
2
Abstract. Multi-dimensional spatial occlusion relation (MSO) is an observercentred model which could express the relation between the images of two bodies x and y from a viewpoint v. We study the basic reasoning method of MSO, then extend MSO to spatial-temporal field by adding time feature to it, and the related definitions are given.
1 Introduction Most spatial representation models have used dyadic relations which are not observercentred. But observer-centred relation is really useful in the physical world. Spatial occlusion relation is an observer-centred spatial representation which studies the qualitative relation of two objects from a viewpoint. It is mainly used in computer vision and intelligent robot. In recent years, with the development of spatial reasoning research, occlusion has also been widely investigated in the Qualitative Spatial Reasoning (QSR) [1][2]. There are two important spatial occlusion models in QSR . The first one is Lines of Sight (LOS) proposed by Galton in 1994. LOS includes 14 relations of convex bodies[3]. The second is Randell’s ROC-20[4]. It could be used to handle both convex and concave objects. These two models are both based on RCC(Region Connection Calculus)theory which expresses the topology of region . Most acquired spatial data is abstract data, such as points stand for cities. So the dimensions of spatial data are various. Multi-dimensional data process has been more and more important in spatial information field. Since RCC require the objects in the same dimension[6][7], LOS and ROC are not suitable for multi-dimensional objects. To deal with this, we extended the RCC to MRCC which can express multidimensional topology, and based on MRCC proposed a multi-dimensional spatial occlusion relation MSO in 2002 [8].
2 MRCC and MSO RCC-8 ,the bound sensitive RCC model , is widely used to express spatial relation in GIS ,CAD and other spatial information software. RCC-8 has eight JEPD basic *
This paper was supported by Technological Development Plan of Jilin Province (Grant No. 20010588)
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 694–696, 2003. © Springer-Verlag Berlin Heidelberg 2003
Multi-dimensional Observer-Centred Qualitative Spatial-temporal Reasoning
695
relations [7]. Multi-dimensional RCC (MRCC) is a RCC-8 extended model for multidimensional objects. The function DIM(x) indicates the dimension of an object x , its value is 0,1,2 for point , line and area object, and we call them 0-D,1-D,2-D object respectively. MRCC relation of x and y is defined by a triple: (DIM(x),DIM(y),ψ) where ψ∈RCC8 The available relations of ψ depend on the combination of DIM(x) and DIM(y). By considering all the possible combining of DIM(x) and DIM(y), only 38 MRCC basic relations are available. x occludes y from viewpoint v is formally defined as : Occlude(x,y) ≡def ∃a∈i(x) ∃b∈i(y) [Line(v,a,b)] Function Line(a,b,c) means that points a,b,c fall on a straight line with b strictly between a and c. JEPD occlusion relation OP={NO,O,OB,MO} is defined by Occlude(x,y): NO(x,y) = ¬Occlude(x,y) ∧¬Occlude(y,x) O(x,y) = Occlude(x,y) ∧¬Occlude(y,x) OB(x,y) = ¬Occlude(x,y) ∧Occlude(y,x) MO(x,y) = Occlude(x,y) ∧Occlude(y,x) The definition of Multi-dimensional Spatial Occlusion (MSO) relation is: mso(x,y,v) ≡def (DIM(img(x,v)),DIM(img(y,v)),Ψ , φ ) where (DIM(img(x,v)),DIM(img(y,v)),Ψ) ∈MRCC, φ∈OP There are 79 reasonable MSO relations. Detail information of MRCC and MSO is in [8].
3 Conception Neighborhood Graph Predicate nbr(R1,R2) means that R1 and R2 are conception neighborhood. Definition 1: Given two MSO relation: mso1(x,y,v) = (d1, d2, rcc, op) , mso2(x,y,v’) = (d1’, d2’, rcc’, op’) nbr(mso1 , mso2 ) ≡def (d1= d1’) and (d2= d2’) and [ (rcc = rcc’) or nbr( rcc, rcc’) ] and [ (op = op’) or nbr( op, op’) ] nbr is available for MRCC, OP and MSP.
4 Relation Composition Definition 2: Given two MRCC relations: mrcc1 = (d1, d2, r1) , mrcc2 = (d2, d3, r2) mrcc1 o mrcc2 ≡def (d1, d3, r1 o r2 ) ∩ {all the MRCC basic relations } But as for the occlusion relation OP, we cannot treat them independently, because occlusion relation isn’t transitive[2]. We define fully occlusion relation FOP={FF,FFI,NF} to settle this problem: FF ={ (a,b,R,O)| 0 ≤ a, b ≤ 2,R={TPP,NTPP,EQ} } FFI={ (a,b,R,OB)| 0 ≤ a,b ≤ 2,R={TPPI,NTPPI,EQ} } NF= ¬ (FF∨FFI)
696
Y.-n. Lu , S.-s. Wang, and S.-x. Sha
Definition 3: Given two MSO relations: mso(x,y,v)=(d1, d2, Ψ1, φ1), mso(y,z,v)=( d2, d3, Ψ2, φ2) RF1 , RF2 ∈ FOP and mso(x,y,v)∈RF1 , mso(y,z,v)∈RF2 mso(x,y,v) o mso(y,z,v) ≡def [(d1,d2,Ψ1) o (d2,d3,Ψ2),{NO,O,OB,MO}]∩(RF1 o RF2)
5 Integrating Time Information into MSO Definition 4: Time feature is defined as (t1,c,t2)(A,B) , t1
(t1 , c1 , t 2 ) (A,B) o (t 2 , c2 , t 3 ) (A,B) = (t1 , c1 + c2 , t 3 ) (A,B) The key problem is what relations can be got after particular times of state changes. Function RC(R,c) defines the relation set which can be got after the state started from R and changed just c times. R if c = 0 RC ( R , c ) = nbr(R) if c = 1 RC ( nbr ( R ), c − 1 ) if c > 1
Function RR(R1,R2) shows the possible state changing times when the state started from R1 and ended at R2 .
RR ( R1 , R2 ) = {n | n ∈ N , R2 ∈ RC ( R1 , n )} Considering time feature, the mso relation at time point t is expressed as :
mso(x,y,v) t = ( DIM(img(x,v)), DIM(img(y,v)), Ψ , φ ) t Definition 6: The composition ⊕RC and ⊕RR of time and MSO are defined as Given mso( x, y, v) t1 = (d x , d y , R1 , P1 ) t1 , mso( x, y, v ) t 2 = ( d x , d y , R2 , P2 ) t 2
(d x , d y , R1 , P1 )t1 ⊕ RC (t1 , c, t 2 ) ( x , y ) = (d x , d y , RC ( R1 , c), RC ( P1 , c))t2 (d x , d y , R1 , P1 ) t1 ⊕ RR (d x , d y , R2 , P2 )t2 = (t1 , RR( R1 , R2 ) ∩ RR( P1 , P2 ), t 2 ) ( x , y )
References 1. 2. 3. 4. 5. 6. 7. 8.
Renz J. and Nebel B. , Efficient Methods for Qualitative Spatial Reasoning, ECAI-98, 1998, pages 562–566 Petrov A.P. and Kuzmin L.V. ,Visual Space Geometry Derived from Occlusion Axioms, J. of Mathematical Imaging and Vision ,Vol 6,pages 291–308 Antony Galton, Lines of Sight, AISB Workshop on Spatial and Spatio-temporal Reasoning,1994 , pages 1–15 David R. etc, From Images to Bodies: Modeling and Exploiting Spatial Occlusion and Motion Parallax,IJCAI,2001,pages 57–63 Papadias D. etc. , Multidimensional Range Query Processing with Spatial Relations, Geographical Systems, 1997 4(4),pages 343–365 A.G.Cohn and S.M. Hazarika, Qualitative Spatial Representation and Reasoning: An Overview , Fundamental Informatics , 2001,46 (1-2),pages 1–29 M.Teresa Escrig,Francisco Toledo, Qualitative Spatial Reasoning: Theory and Practice, Ohmsha published,1999,pages 17–43 Wang shengsheng,Liu dayou, Multi-dimensional Spatial Occlusion Relation, International Conference on Intelligent Information Technology (ICIIT2002) , Beijing, P199–204
Architecture Specification for Design of Agent-Based System in Domain View S.K. Lee and Taiyun Kim Department of Computer Science, Korea University Anam-Dong Sungbuk-Ku, Seoul, 136-701, Korea [email protected]
Abstract. How to engineer agent-based system is essential factor to develop agent or agent-based system. Existing methodologies for agent-based system development focus on the development phases and activities in each phase and don’t enough consider system organization and performance aspects. It is important to provide designer with detailed guidance about how agent-based system can be organized and specified. This paper defines computing environment in view of domain, proposes the methods that identify domain, deploy agent, organize system, and analyzes characteristics of the established system organization.
1 Introduction In recent, agent technology is emerging. Some researchers consider agent technology as one of new software engineering paradigm. Although agent technology isn’t mature and its usefulness isn’t enough verified, interest on agent technology will be increased in gradual, and agent technology will be developed and used in many fields. As to number, intelligence, mobility or capability of agent, types of agent can be classified into multi-agent, intelligent agent, informative agent, mobile agent, etc. Whatever types of agent are, engineering methodology to develop agent or agentbased system is required. These existing methodologies focus on the development phases and activities in each phase and don’t enough consider system organization and performance aspects. It is meaningful to establish guidance about how agentbased system can be constructed in efficient[2]. This paper is to provide designer with guidance about how agent-based system can be organized. For this, we survey the existing methodologies for agent-based system development in section 2. In section 3, this paper defines agents’ society in view of domain and describes organization method of agent-based system. Future work is described in section 4.
2 Related Work Many methodologies for agent-based system development have been proposed. We survey GAIA[5], MaSE[4], BDI[1] and SODA[3]. The existing methodologies provide useful engineering concepts and methods for agent-based system development, but there are some weaknesses to be improved. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 697–700, 2003. © Springer-Verlag Berlin Heidelberg 2003
698
S.K. Lee and T. Kim
- Consideration on agents’ society. It is necessary to model computing environment in agent-oriented view. - Consideration on deployment of agents and system organization. Deployment of agents in agent-based system isn’t enough considered in engineering steps. - Organization methods aren’t described clearly. This may be because agents act in autonomous. Based on relationship among agents, we can reason about organization. - Consideration on performance of agent-based system. The existing methodologies don’t provide method to check whether system organization is efficient.
3 Architecture Design of Agent-Based System 3.1 Domain Oriented Computing Environment Model In agent-oriented view, computing environment can be recognized as set of agents like block. This paper recognizes a block as a domain. That is, computing environment is composed of domains and agents are placed in domains. Each domain has manager to manage domain in internal and support interaction among domains in external. Figure 1 is to model computing environment in view of domain. Domain1
Domain2
Domain3
Domain4
agent manager
Fig. 1. Computing environment model in view of domain
To construct agent-based computing environment, domain partition and agent deployment are required. 3.2 Agent-Based System Organization We can identify and define necessary agents from application description using useful steps and activities proposed in existing methodologies: goal-role-agent model, relationship model, agent model, interaction model, etc. Additionally, deployment of agent, system organization and performance of system must be followed. This paper devises method that solves these issues in domain-oriented view. In order to achieve its roles, each agent interacts with other agents placed in local or remote. Examining relationship among agents from application description, we can define interaction type between two agents as followings. l : Local type. Two agents interact at same infrastructure or organization. r : Remote type. Two agents interact at remote locations, work at different infrastructure and organization. If interaction type of two
Architecture Specification for Design of Agent-Based System in Domain View
699
agents Ai and Aj is ‘l’, then it would be desirable to place both Ai and Aj in same area. On the other hand, if interaction type between agent Ai and Aj is ‘r’, then it would be efficient to separate Ai from Aj in different area. We can define domain identification procedure as below: Traversing all agents that have interaction type ‘l’ with agent Ai(i=1..n, number of agent) in sequence until there is no ‘l’ interaction type and classifying group of agents into domain, we can extract domains. When five agents are identified, and each interaction type is like figure 2, two domains can be extracted. A1, A2 and A4 are deployed in domain1 and A3, A5 in domain2. l
r
A2
A1
A3
r
l
l
r
A4
A5
Fig. 2. Interaction type between agents
Based on the result of domain partition and agent deployment, system organization can be established. Relation among domains is basically remote. One domain can be mapped into one local agent-based system. Therefore entire system is organized into the shape that agent-based systems interact with each other in remote. Applying domain identification procedure to figure 2, system is organized like figure 3. Domain1
A1
A2
Agent
A4
M1
Domain2
Manager
M2 A3
A5
Fig. 3. System organization in view of domain Table 1. Features of system organization of figure 3 Feature Item Deployed agents Locality of domain Domain load Average load of all agents Decision of domain load Overloaded agent
Domain1 A1, A2, A4 4/7(0.57145) 7/3 agents(2.33) 12/5 agents(2.4) low A2
Domain2 A3, A5 2/5(0.4) 5/2 agents(2.5) high A5
It is necessary to examine features of the established system organization: (1) Locality. This feature is the rate of local behaviors over total behaviors in certain domain. Low rate implies that the domain highly interacts with remote domain. (Number of ‘l’ link of all agents deployed in each domain) / (Number of total links in each domain)
700
S.K. Lee and T. Kim
(2) Domain load. This feature is to check certain domain is overloaded. This feature can be measured as below. (Average number of behaviors that agents should provide in domain) / (Average load of entire agents)
(3) Agent load. This feature is to identify overloaded agents. When number of links of certain agent has is more than average load of entire agents, the agent is overloaded. (number of links of each agent) / (average load of entire agents)
As an example, applying the metrics to figure 3, we can analyze features of system like table1. When initial design result doesn’t have good features, engineer can devise alternatives of system organization. Finally, we can specify agent-based system as figure 4. Computing Environment = < E-manager, E-platform, Domain1, Domain2 > E-manager = < E-DS, E-coordination, E-Interoperation > E-coordination = { Contract-net } E-interoperation = { Translator } E-platform = < E-communication, E-language > E-communication = { Internet } E-language = { ACL } Domain1 = < I-manager, I-platform, A1, A2, A4 > I-manager = < I-DS, I-coordination, I-Interoperation > I-coordination = { Contract-net } I-interoperation = { Translator } I-platform = < I-communication, I-language > I-communication = { Ethernet } I-language = { ACL } A1 = < Name, Address, State, Capability > …
Fig. 4. Specification example to design agent-based system architecture
4 Conclusion and Future Work The existing development methodologies for agent-based system must be improved. In particular, methods of agent deployment and performance prediction using system organization are needed. This paper proposes methods to complement existing methodology. Modeling real complex computing environment into domains, domain partition and agent deployment by l-r type make engineer organize agent-based system. The methods can be used to establish architecture of agent-based system and specify the architecture before detail design. In future, we will improve the methods proposed in this paper and combine our methods with existing methodology.
References [1] Huhns, M., et al, Interaction-oriented Software Development, IJSEKE, Vol. 11, No. 3, World Scientific Pub., 2001, 259–279. [2] Lisa M.J. Hogg, et al, Socially Intelligent Reasoning for Autonomous Agents, IEEE Trans. on Systems, Man, and Cybernetics-part A, Vol. 31, No. 5, Sep. 2001, 381–393. [3] Omicini. A., SODA: Societies and Infrastructures in the Analysis and Design of Agentbased Systems, LNCS 1957, Springer, 2000, 185–193. [4] Scott A., et al, Multiagent Systems Engineering, in IJSEKE, Vol. 11, No. 3, World Scientific Pub., 2001, 231–258. [5] Zambonelli, F., et al, Organizational Rules as an Abstraction for the Analysis and Design of Multi-agent Systems, IJSEKE, Vol. 11, No. 3, World Scientific Pub., 2001, 303–328.
Adapting Granular Rough Theory to Multi-agent Context Bo Chen and Mingtian Zhou Microcomputer Institute, School of Computer Science & Engineering, University of Electronic Science & Technology of China, Chengdu, 610054 [email protected], [email protected]
Abstract. The present paper focuses on adapting the Granular Rough Theory to a Multi-Agent system. By transforming the original triple form atomic granule into a quadruple, we encapsulate agent-specific viewpoint into information granules to mean “an agent knows/believes that a given entity has the attribute type with the specific value”. Then a quasi-Cartesian qualitative coordinate system named Granule Space is defined to visualize information granules due to their agent views, entity identities and attribute types. We extend Granular Rough Theory into new versions applicable to the 3-D information cube based M-Information System. Then challenges in MAS context to rough approaches are analyzed, in forms of an obvious puzzle. Though leaving systematic solutions as open issues, we suggest auxiliary measurements to alleviate, at least as tools to evaluate, the invalidity of rough approach in MAS. Keywords. Granule Space, Information Cube, M-Information System, Granular Rough Theory
1 Introduction Efforts on encapsulating all relevant elements in individual granules and developing rough theory over granules lie in the expectation that the rough approach could be more applicable for knowledge representation and approximation. In [1], we represent data cells of information table as Atomic Granules, i.e. triples of the form (ui , a j , vk ) , with the sentential meaning that “Entity ui has the attribute a j with the value of vk ”. Regarding atomic granule as the primitive construct, we define a granular calculus for Information System, with facilities of which, a granular approach to roughness is built up based on pure mereological relations over granules, referred to as Granular Rough Theory here. Shifting the context of information granules to a Multi-Agent System, each agent can have her own knowledge/belief/etc. of the outer world, which means there would be multiple information tables in the entire system. Hence, our new approach would set out by incorporating the viewpoint of agent into atomic granule, bringing in a quadruple (agt , ui , a j , vk ) , with the complete semantics a data cell could indicate: "Agent agt knows/believes/etc. that entity ui has the attribute a j with the value of vk ".
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 701–705, 2003. © Springer-Verlag Berlin Heidelberg 2003
702
B. Chen and M. Zhou
2 Granule Space For a given quadruple form atomic granule (agt , ui , a j , vk ) , it is easy to suppose that there exists a functional relation F : Ag × U × A → V , which states that the specific attribute value of a granule is determined by the agent viewpoint as well as entity identity and the attribute type, viz. vk = F (agt , ui , a j ) . Such an equation can be represented as a valued point in a 3-D coordinate system, as shown in Fig. 1.
Fig. 1. Coordinate Representation of Quadruples. A quadruple form granule is a visualized point Pt ( ag 5 , ui , a j ) of three coordinate arguments, whereas a value vk is point specific functional value. Each coordinate is a discrete set and sorted only by the given index, which distinguish the coordinate system from common mathematical plot systems
Given the point representation above, we define Granule Space as a hypothetical qualitative quasi-Cartesian coordinate system, with three non-negative axes standing for arbitrary-ordered discrete agent viewpoints, entity identities in the universe of discourse and attribute types for entities, holding 3-D points qualified by these coordinates, for each of which it is valued as the specific attribute value of an entity in the viewpoint of the associated agent. There are some properties of Granule Space: The coordinate system of it is similar to a restricted or pruned form of common Cartesian coordinate system in the convention of coordinate representation and interpretation for points. It is not a quantitative but a qualitative system, which distinguishes its coordinate system from its counterpart in ordinary Cartesian form. The differences include the non-numerical semantics of coordinates, the incomparability of values associated with different points for distinct attributes. It is intrinsically discrete, which takes two sides: the axes stand for discrete notions such as agent views, entities and attribute types; the contents in it are individual information granules dispersed in such a conceptual space. Then the point form of granules can be substituted with a sphere centered on the point Pt (agt , ui , a j ) , as indicated by the dashed circles in Fig.1. An Information Cube is a special case of all information granules contained inside the Granule Space, when for each of the agents, there is an information table about the same set of entities describing their identical set of attribute types. Such a situation is common when there are several agents with similar function of reasoning about the same set of entities to give respective decision rules. Fig.2 gives an illustration.
Adapting Granular Rough Theory to Multi-agent Context
703
Fig. 2. An information cube in Granule Space. A 3-D information cube can be a standard extension to original 2-D information table structure for multi-agent systems, which encodes the knowledge/belief of each agent to provide additional functions, viz. the ability to evaluate, synthesize and harmonize the distributed viewpoints. Associated values of granules are omitted
Three planes paralleled with the axes-plane have respective implicit connotation: The entity-attribute plane cutting with a specific agent graduation agt takes complete viewpoint of agent agt to the universe of discourse, referred to as Agent Sight. The agent-attribute plane intersecting a specific entity identity ui on the entity axis stands for all agent views about an entity ui , referred to as Entity Perspective. The agententity plane meeting the attribute axis at the graduation a j indicates the extensional values of specific attribute a j over each entity from all the agents’ viewpoints, referred to as Attribute Extension.
3 Adapting Granular Rough Theory to MAS We define a Multi-Agent oriented M-Information System I M = ( Ag , U , A) , given by an information cube for which Ag is the set of agent viewpoints, U is the universe of discourse and A is the set of attribute types. Substituting the triple form atomic granule ξ (ui , c j , vk ) with the quadruple ξ (agt , ui , c j , vk ) , the underlying granular calculus is modified to an extended version, the M-Granular Calculus CM . In CM , most of the basic operations are consistent with the original system, while the internal structures of the compound granules generated are incorporated with additional agent information. The collection of agent views for a compound granule is critical, which decides if the granule at hand is in the local scope of one agent view or spanning
704
B. Chen and M. Zhou
multiple views, due to which, the corresponding compound granules are named agtLocal Granules and Global Granules, respectively. Such a classification is applicable to the original definitions of Cluster Granule, Aspect Granule, and Aspect Cluster Granule and so on. It is natural for each participated agent to have her own right to analyze her own Agent Sight of the universe to achieve local perception of roughness. That is, by slicing the given information cube into layers paralleled with the entity-attribute axis plane, methods in 2-D Rough Granular Theory can be applied to local compound granules in each layer, so as to roughly approximate local decisional granules with local conditional aspect cluster granules. For M-Information System itself, the extended Granular Rough Theory should take account of attaining not only the local-roughness for each agent but also the globalroughness. It is straightforward to apply similar process when we define the Granular roughness from an information table, viz. to classify all the global aspect cluster granules into Regular, Irregular and Irrelevant Granules with respect to a given global decisional aspect cluster granule, moreover, to define the global Kernel, Hull and Corpus Granule to it. It should be noticed that the Shift operation is based on intensional connection amongst information granules, which is interpreted as aspects belonging to the same entity in the information table, whereas in an information cube, it is confined to be aspects of the same entity in the same aspect’s view. By nature, the above two approaches are both trying to reduce the 3-D information structure given by an information cube to 2-D structure. For the former, it applies the Slicing method to cut the information cube into layers, and then confines each pass of investigation in a single information table; the latter, it implicitly utilize the Flattening method to merge each layer of the information cube into a large global information table, in which, the original universe of entities is enlarged by re-labeling each of them a new identity incorporated with the hint of agent view. Then the extended version of Granular Rough Theory is consistent with its triple version.
4 Challenges of Multi-agent Context From the motivation of rough theory, the roughness is developed to discover the decision rules of an Information System, so that we could base our reasoning on these rules to infer the decision attribute value of an entity due to some conditional attributes values of it. On the other hand, the most important contribution of rough theory is to approximating a set of entities from inner and outer of the set. In the nonagent oriented system, there are no great conflicts between these two methodological connotations. Nevertheless, in the M-Information System, since different agents might have diverse knowledge/beliefs over the outer world, drastic inconsistency arises. For instance, there are two agents ag1 and ag 2 in the system, both of the agents may infer the decision rule “each paper that has the readability of 3 points and innovation of 4 is accepted” from their own information table. Then such a rule is translated into the representation form as “the class of papers that will be accepted can be approximated by the class of papers that have the attribute readability with value 3 and the attribute innovation with value 4”. Since the attributes “readability” and “innovation” are both somewhat subjective, local information table for either agent may be quite different in
Adapting Granular Rough Theory to Multi-agent Context
705
the real data distribution, but they happen to achieve common rules describing only the inference relation between attribute values. Then for a concrete paper, it may be hard to decide whether it is qualified or not without further efforts to coordinate contradictions between agents’ views. But if the information tables of each agent were identical, it would not make any sense to make efforts on it. Rough approach of information analysis is challenged in the context of MAS. Such a puzzle lies in multiple factors that affect the agents’ views to the outer world, including the epistemic characteristics of each agent, the system deployments and other concrete environments, and so on. It is out of our reach in the present paper to establish a systematic methodology to resolve the puzzle and left as an open issue for future research. We can establish measurements helpful in alleviating the invalidity of rough approach in some cases, at least, to give ways of evaluating the current applicability for rough approach. By evaluating the similarity among rows of a specific Entity Perspective Granule, the degree of inconsistence of entity’s perspective in different agent views can be calculated; whereas by assessing the similarity among rows of a given Attribute Extension Granule, the subjectivity, viz. the degree of dependence on agents’ personality, of an attribute can be found. If an attribute type is too subjective, when its value depends too much on the arbitrary epistemic state of an agent, this attribute is bound to differ drastically on the value for most of the entities, aggravating the degree of inconsistence of entity’s perspectives. In such a case, we can reconsider new attribute types that can better objectively characterizing the entities in order that we have a more rational decision system. On the other hand, if the subjectivity is not serious, and there are some entities that have much higher degree of inconsistence on its perspectives than average cases, we could try to find out what implicit reasons lead to these special cases, so that we can adjust the system or these special entities accordingly. A well-founded methodology of investigating similarity among constructs is part of the Rough Mereology established by A. Skowron and L. Polkowski, as stated in [2, 3, 4], in which the quantitative measure of the similarity is given by the value of Rough Inclusion. Based on it, deliberate system parameters can be defined to convey specific semantics in future works.
References 1. Chen Bo, Zhou Mingtian, A Pure Mereological Approach to Roughness. Accepted by RSFDGrC’2003. 2. Polkowski L., Skowron A., Approximate Reasoning about Complex Objects in Distributed Systems: Rough Mereological Formalization. Extended Version for lecture delivered at the Workshop: Logic, Algebra and Computer Science (LACS), Banach International Mathematical Center, Warsaw, December 16, 1996. 3. Skowron A., Polkowski L., Rough Mereological Foundations for Design, Analysis, Synthesis, and Control in Distributed Systems. Information Sciences, Elsevier Science Inc., 1998. 4. Polkowski L., Skowron A., Rough Mereology: A new paradigm for approximate reasoning. International Journal of Approximate Reasoning 15/4, pp 333–365.
How to Choose the Optimal Policy in Multi-agent Belief Revision? Yang Gao, Zhaochun Sun, and Ning Li State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, P.R.China {gaoy, szc, ln}@nju.edu.cn
Abstract. In multi-agent system, it is not enough to maintain the coherence of single agent’s belief set only. By communication, one agent removing or adding a belief sentence may influence the certainty of other agents’ belief sets. In this paper, we investigate carefully some features of belief revision process in different multi-agent systems and bring forward a uniform framework - Bereframe. In Bereframe, we think there are other objects rather than maintaining the coherence or maximize certainty of single belief revision process, and agent must choose the optimal revision policy to realize these. The credibility measure is brought forward, which is used to compute the max credibility degree of the knowledge background. In cooperative multi-agent system, agents always choose the combinative policy to maximize the certainty of whole system’s belief set according to the welfare principle. And in semi-competitive multi-agent system, agents choose the revision policy according to the Pareto efficiency principle. In competitive multi-agent system, the Nash equilibrium criterion is applied into agent’s belief revision process.
1 Introduction Belief revision is the process of incorporating new information into the current belief system. An agent’s belief revision function is a mapping brf : ϕ ( Bel ) × P → ϕ ( Bel ) , which on the basis of the current percept P and current beliefs Bel determine a new set of beliefs Bel. The most important principle of the belief revision process is that the consistency should be maintained. But some extra-logical criteria are needed. For example, the minimal changes principle in the AGM theory [1,2]. Another fundamental reason for belief revision is the inherent uncertainty. How uncertainty is represented in a belief system also fundamentally affect how belief is revised in the light of new information. There are two approaches: the first is numerical approaches and use numbers to summarize the uncertainty. Eg., probability theory. The second is Non-numerical approaches and there are no numbers in the representation of uncertainty, logically deal with the reasons for believing and disbelieving a hypothesis [3].
G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 706–710, 2003. © Springer-Verlag Berlin Heidelberg 2003
How to Choose the Optimal Policy in Multi-agent Belief Revision?
707
Most belief revision researches have been developed with a single agent in mind. And in multi-agent system, they act independently as single agents without awareness of other’s existence too. However, since agents communicate, cooperate, coordinate and negotiate to reach common goals, they are not independent. So, it is not enough to maintain the coherence of single agent’s belief set only. In current researches on multi-agent belief revision, couldn’t differentiate between cooperative multi-agent system, semi-competitive multi-agent system and competitive multi-agent system clearly. Wei Liu etc. divided the multi-agent belief revision into MSBR and MABR. He thinks that MSBR studies individual agent revision behaviors, ie, when an agent receives information from multiple agents towards whom it has social opinions. And MABR investigates the overall BR behavior of agent teams or a society [4,5]. There are some other researches in this field. For example, Benedita Malheiro used ATMS in MABR [6,7]. Kfir-dahav and Tennenholtz initate research on MABR in the context of heterogeneous systems [8] and mutual belief revision of Van der Meyden [9]. But they couldn’t be applied into all kinds of multi-agent system. In this paper, according to analyzing the various communication forms, we give some postulates about possible belief revision objects. Because there are other objects rather than maintaining the coherence or maximize certainty of single belief revision process, agent must choose the optimal revision policy to realize it. The problem of multi-agent belief revision is changed to the problem about how to choose the optimal revision policy in different multi-agent systems. Firstly, the certainty or credibility measure is brought forward, which is used to compute the max certainty degree of the knowledge background. In cooperative multi-agent system, agents always choose the combinative policy that can maximize the certainty measure of whole system’s belief set according to the welfare principle. And in semi-competitive multi-agent system, agents choose the revision policy according to the Pareto efficiency principle. Lastly, in competitive multi-agent system, the Nash equilibrium criterion is applied into agent’s belief revision process.
2 A General Framework of Multi-agent Belief Revision In order to implement cooperative, semi-competitive and competitive MABR, we must describe the objects of agent’s BR in every situation clearly. After BR, the belief set of individual agent maintains consistent while inter-agent inconsistency is allowed. In cooperative MAS, we try to maximize the consistency among agent belief sets. Under perfect situation, there is no inter-agent inconsistency, but inconsistency still exists when the strategies of each agent are limit. In semi-competitive MAS, each agent maximizes its own preference, then try to maximize the consistency of whole system. And in competitive MAS, agent only manages to maximize its own preference. In below sub-section, we describe a uniform model of MABR, that is, Bereframe. In our framework, we assume that the new information and initial belief set of each agent can be different, and the new information for each agent comes simultaneous. Agents carry out BR process in a parallel manner. During the process, each agent has
708
Y. Gao, Z. Sun, and N. Li
a number of strategies. In fact, from belief set B to the belief revision strategies, there B is a mapping: B → 2 B , that is, on belief set B there are 2 strategies under perfect situation. But when implement BR, we can only choose some of them. 2.1 The Uniform Model of Bereframe The MABR model of Bereframe is a six-tuple: . In this framework, obviously, A is the set of agent, B is the set of belief set of each agent, EE is the set of belief Epistemic Entrenchment of each agent, N is the set of new information, S is the set of strategies of each agent, P is the utility function of agent. P(Sj(i)) means the utility value that agenti gets when it uses revision strategy j. In Bereframe, every agent has a utility measurement, which refers to the amount of satisfaction an agent derives from an object or an event. Since in BR, agent prefers to keep beliefs with higher EE and gives up those with lower EE. Therefore, the utility of individual agent could be measured by the ultimate EE of the agent belief set, which is P(Sj(i)) in our framework. We use arithmetic average of all beliefs to express P(Sj(i)) because it is simple to calculate and easy to understand. The utility function of agent could be described by equation 1. 1 P ( S j (i )) = (1) ∑ EE ( belief k ) belief k ∈ Bi Bi k
2.2 BR in Cooperative, Semi-competitive, and Competitive MAS We implement MABR with three different evaluation criteria on the basis of Bereframe. Let’s take only two agents into account and explain our idea. A={A1, A2}. The environment of A1, A2 can be regarded as a big set E, then Cn(B1) ⊆E, Cn(B2) ⊆E. If we define that to increase the profit of individual agent is to expend its closure, then this problem can be dealt with the process of closure operation. Process: 1. If the intersection of two agents belief set closure is empty, then they will have no conflict and are cooperative. 2. If the intersection is not empty, then with the assumption that each agent maximizes its own profit, whether the cooperation is possible or not depends on whether it increases or reduces agent’s profit. 2.1 If the increase of the intersection is larger than the increase of each belief set after belief revision, then the basis of cooperation comes into existence and the requirement of maximizing the whole profit can be conceived. 2.2 Otherwise, agents are competitive. 2.2.1 Cooperative BR Cooperative BR uses the Social Welfare evaluation criteria. It considers global profit and this kind of BR will maximize the utility of whole system. The optimum solution
How to Choose the Optimal Policy in Multi-agent Belief Revision?
709
in our model is the strategy pair that maximizes the consistency of MAS. In our view, an account of consistency of belief set should consider two knowledge bases: 1. The knowledge background KB, which is the set of all the beliefs available. Here KB={B1 B2 Bn}. * 2. The knowledge base KB ⊆KB, which is the maximal consistent subset of KB. The measurement of consistency ρ is defined as follows: ρ = KB KB , that is the portion of the number of maximal consistent beliefs to that of the all beliefs of agents. 2.2.2 Semi-competitive BR In the semi-competitive MAS, it is not necessary to compute the global consistency, but to compute the EE of each belief after revision. EE measures the priority of selecting the belief and is restricted by three assumptions. The first is subjective factor. System or designer gives EE to the beliefs in the initial belief set. The EE of new information is also given by system. The second is logic factor. For example, if p•q, then EE of q is not smaller than that of p. If p q•r, then EE of r is not smaller than the minimum of that of p and q. And if S T•r, then EE of r is not smaller than the maximum of that of p and q. The third is experiential factor. When an agent discovers that other agents add or remove some beliefs, it should revise the EE of these beliefs of its own accordingly. For example, when A1 discovers that A2 reduces belief α and α is enclosed in its own belief set, it will subtract a very little amount ξ from EE of α. The semi-competitive MAS use Pareto Efficiency evaluation criteria. Definition (Pareto Efficiency): A pair of strategy (Sa(i), Sb(j)) is Pareto-optimal, if there is no other strategy which can improve P(Sb(j) without reducing P(Sa(i)). 2.2.3 Competitive BR In competitive MAS, each agent cares only about ist profit and chooses the strategy that maximizes the profit. If we don’t take the impact from other agents into account, the BR process is completely equal to individual BR and the computation of EE is independent from other agents’ experiential knowledge. If agents interact with each other, we should assume that, agent knows other agents’ strategies, other agents’ current or initial belief sets, the new information accepted by other agents and the strategies used by others are observable after BR operation. With these assumptions, we can revise beliefs in competitive MAS using max-min method. When A1 receives new information, it estimates which strategy A2 employing will minimize ist average EE, and then it chooses the best strategy it can under this situation. A2 also estimates A1 and employ a certain strategy accordingly. During this process, these two agents don’t exchange their information, because they are opposed. This method will result the Nash equilibrium. Definition (Nash Equilibrium): A pair of strategy (Sa(i), Sb(j)) is the Nash equilibrium, if and only if, for every agent, e.g. agent a: Sa(i) is the best strategy for agent a if all the others agents, here agent b, choose (Sb(j)).
710
Y. Gao, Z. Sun, and N. Li
3 Conclusion and Future Works In MASs, agents complete tasks by cooperating, negotiating and coordinating. They consider not only their own profit but also the profit of whole system. The multi-agent belief revision framework is developed here. The Bereframe provides a model to support several MABR e.g. cooperative MABR, semi-competitive MABR and competitive MABR in order to unite all MASs in one framework. Many other features of this framework are still under investigation and need future work. For example, when the inconsistency of belief set is reasonable [11] or there are malicious agents in the MAS, the BR will become very complex and need to be investigated in our future work deeply. Acknowledgements. This paper is supported by the National Natural Science Foundation of China under Grant No. 60103012 and the National Grand Fundamental Research 973 Program of China under Grant No. 2002CB312002.
References 1.
Alchouurron, C., Gardenfors, P., Makinson, D.: On the logic of theory change: Partial meet contraction functions and their associated revision functions. Journal of Symbolic Logic 50 (1985) 510–530 2. Gardenfors, P.: The dynamics of belief systems: Foundations vs. coherence theories. Revue Internationale de Philosophie 171 (1990) 24–46 3. Shafer, G.: A mathematical theory of evidence. Princeton, NJ: Princeton University Press (1976) 4. Liu W., Williams M.A.: A framework for multi-agent belief revision (Part I: The role of ontology). In: Foo N. (ed.) 12th Australian Joint Conference on Artificial Intelligence, Lecture Notes in Artificial Intelligence. Sydney. Australia: Springer-Verlag (1999) 168– 179 5. Liu W., Williams M.A.: A framework for multi-agent belief revision. Studia Logica 67 (2001) 219–312 6. Benedita Malheiro, N.R.Jennings, Eugenio Oliveira.: Belief revision in Multi-agent Systems. In: Proceedings 11th European Conf. on Artificial Intelligence (ECAI-94) (1993) 294–298 7. A.F. Dragoni, Paolo Giorgini, Marco Baffetti.: Distributed Belief Revision vs. Belief Revision in a Multi-Agent environment: first results of a simulation experiment In: Magnus Boman and Walter Van de Velde (Eds.), Multi-agent Rationality, LNCS No. 1237, Springer-Verlag (1997) 8. Noa E. Kfir-dahav, Moshe Tennenholtz.: Multi-agent belief revision. In: 6th Conference on Theoretical Aspects to Rationality and Knowledge (1996) 175–194 9. R. van der Meyden.: Mutual Belief Revision. In: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, Bonn (1994) 595–606 10. Nir Friedman, Joseph Y.Halpern.: Belief Revision: A Critique. In: Aiello, L.C., Doyle, J., Shapiro, S.C., (eds.) Principles of Knowledge Representation and Reasoning: Proceedings of the Fifth International Conference (KR'96), San Francisco, Morgan Kaufmann (1996) 421–431
Research of Atomic and Anonymous Electronic Commerce Protocol Jie Tang, Juan-Zi Li, Ke-Hong Wang, and Yue-Ru Cai Department of Computer, Tsinghua University, P.R.China, 100084 [email protected], [email protected]
Abstract. Atomicity and anonymity are the two important requirements for application in electronic commerce. However absolutely anonymity may lead to confliction with law enforcement, e.g. blackmailing or money laundering. Therefore, it is important to design systems satisfying both atomicity and revocable anonymity. Based on the concept of two-phase commitment, we realize atomicity in electronic transaction with the Trusted Third Part as coordinator. Also we develops Brands’ fair signature model, propose a method to enable not only anonymity but also owner-trace and money-trace. Keywords: fair blind signature, atomicity, fair anonymity, payment system
1 Introduction Recently, e-commerce (electronic commerce) is among the most important and exciting area of research and development in information technology. Some ecommercial protocol has been proposed, such as, SET[1], NETBILL[2], DigiCash[3], etc. However, application of it doesn’t grow up as expected, the main reasons restricting it’s wide and quick development are listed below. (1) Lack of fair transaction. A fair transaction is a transaction satisfying atomicity, which means that both sides agree with the goods and money before the transaction, and receive correct goods (or money) after the transaction, otherwise transaction terminates with reconversion to both sides. Unfortunately, most existing e-transaction protocols don’t enable or only partly enable atomicity. (2) Lack of privacy protection. Most of the current systems does not provide protection to users’ privacy. In these system, seller not only need to cater for the existing clients on the internet, but also want to mine the potential users, so all clients’ activities on the web are logged involuntary which lead to potential possibility of misusing the clients’ privacy. Another type system is anonymous one[3], in which client could execute payment anonymously, but new problems are emerged, e.g. cheat, money laundering, blackmailing, etc. To deal against the problems, we propose a new electronic transaction protocol to realize atomicity and fair anonymity at the same time in this paper. This paper is structured as follow. Chapter two introduces related work; Chapter three presents a new electronic transaction protocol. Chapter four analyzes the atomicity and fair anonymity of the protocol. Finally, chapter five summarizes the paper. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 711–714, 2003. © Springer-Verlag Berlin Heidelberg 2003
712
J. Tang et al.
2 Related Work Atomicity and Anonymity are two important characteristic in e-commerce. David Chaum put forth the concept of the anonymity in electronic commerce, and he realized the privacy in his Digicash[3]. But absolutely anonymity provides feasibility to criminal by untraceable e-crime, such as corruption, unlawful purchase, blackmailing crime, etc. In 1992, van Solms and Naccache[4] discovered a serious attack on Chaum’s payment system. Therefore, Chaum and Bronds proposed some protocols to implement controllable anonymity[5,6]. Afterward, Stadler brought forward fair anonymity[7]. Tygar brought forward that atomicity should be guaranteed in e-transaction[8,9]. Tygar divided atomicity into three levels: Money Atomicity, Goods Atomicity, Certified Delivery. (1) Money Atomicity. It is a basic Atomicity. If in a transaction money transfers either transfers successfully or reconverts, then money transfer is atomic. (2) Goods Atomicity. What satisfy goods atomicity need to satisfy money atomicity. Goods atomicity demands that if the money is paid then goods is guaranteed to be received and if goods is received, money must be paid as well. (3) Certified Delivery. Certified Delivery is based on the money atomicity and goods atomicity and it provides the capability to certify sellers and customers. Jean Camp presented a protocol realizing atomicity and anonymity in [10]. But in this system, fair anonymity is unavailable, and, goods are limited to electronic ones.
3 Atomic and Fair Anonymous Electronic Transaction Protocol We propose a new protocol, in which atomicity and anonymity are both satisfied, trace to the illegal transaction is available as well. We define the protocol as AFAP(Atomic and Fair Anonymous Protocol). The system model is a 5-tuple AFAPM := (TTP, BANK, POSTER, SELLER, CUSTOMER) , where TTP is third trusted partner, POSTER is a delivery system, POSTER and BANK are both second-grade trusted organizations and authenticated by TTP. AFAP comprises five sub-protocols: TTP Setup, BANK Initialization, Registration, Payment and Trace. In this section, we focus on the payment and trace. The elaborate introduction can refer to [11]. 3.1 Payment Sub-protocol Payment sub-protocol is the most important sub-protocol in AFAP. The flow and relation of participators in the payment sub-protocol can refer to Fig.1. Step 1. Preparing to buy good, CUSTOMER sends a TransReq to TTP. Step 2. TTP generates TID and Expiration_Date for the transaction. Step 3. CUSTOMER computes the drawn token Token_w and payment token Token_p. Then he sends the blinded Token_p to BANK.
Research of Atomic and Anonymous Electronic Commerce Protocol
713
1 Customer
TTP
9
2 4
3
Poster
5 8
7 Bank
Seller 6
Fig. 1. The flow and relation of participators in the payment sub-protocol
Step 4. CUSTOMER executes a blind signature protocol with BANK to obtain a signature of blinded Token_p by BANK, i.e. Sign SK _ BANK ( Act , Aot , Bt ) . Then BANK subtracts corresponding value from the CUSTOMER’s account. In this way, the signature is regarded as coin worthy of value in later transaction. Step 5. CUSTOMER selects the Pickup_fashion, sets Pickup_Response, which is kept secreted by CUSTOMER and POSTER. CUSTOMER then generate Trans_detail and sends Sign SK _ BANK ( Act , Aot , Bt ) and Trans_detail to SELLER. Step 6. SELLER startups a local transaction clock. Afterward SELLER sends Sign SK _ BANK ( Act , Aot , Bt ) and value to BANK. Step 7. BANK validates blind signature
Sign SK _ BANK ( Act , Aot , Bt ) . And then
BANK queries the signature-value and signature-payment database to judge the validity of the signature and whether the payment is an illegal dual payment. Passed all these steps, BANK adds payment signature on the SELLER’s account, and generates Trans_guarantee for SELLLER. Step 8. SELLER startup the process of dispatch, he transfers Pickup_fashion to POSTER, and notifies POSTER to ready for delivering good. Step 9. POSTER estimates whether Pickup_fashion is permitted, if passed, POSTER generates Pickup_guarantee and sends it to TTP. Step 10. Given received Pickup_guarantee before Expiration_Date, if there exist one or more Rollback request, TTP sends Rollback command to all participators, or TTP sends Trans_Commit. Otherwise, if TTP do not receive Pickup_guarantee before Expiration_Date, he sends Rollback to all participators to rollback the transaction. Step 11. Received Trans_Commit, BANK begins really transfer process. SELLER dispatches goods to POSTER; POSTER delivers goods by the Pickup_fashion. 3.2 Trace Sub-protocol When illegal transaction occurred, trace sub-protocol is activated, which include three aspects: owner trace, payment trace and drawn trace.
714
J. Tang et al.
(1) Owner Trace. It is aimed to discover the account based on the known illegal payment. BANK queries the payment token and sends it to TTP, TTP computes
I = (t o(1 / xT ) ) / t c to trace the owner of the illegal payment. (2) Payment Trace and Drawn Trace. Payment Trace is to discover payment token ’
’(1 / xT )
based on known drawn token. BANK sends t c to TTP. TTP computes t c = t o get payment token. Drawn trace is the inverse process to payment trace.
to
4 Conclusion With the rapid development of internet, commerce action on the internet have extensive application prosperity. The key to improve the e-commerce is to provide secure, privacy unsealed transaction. In this paper, we analyze atomicity and fair anonymity, propose a new protocol to realize them altogether. Based on this paper, we will improve in these aspects:(1)Electronic payment based on group blind signature. (2)More efficient atomicity.
References 1
Larry Loeb. Secure Electronic Transactions Introduction and Technical Reference. ARTECH HOUSE,INC. 1998 2. B. Cox, J. D. Tygar, M. Sirbu, Netbill security and transaction protocol. Proceedings of st the 1 USENIX Workshop on Electronic Commerce. 1995:77–88 3. D.Chaum, A.Fiat, M.Naor. Untraceable electronic cash. Advances in cryptology: Crypto’88 Proceedings, Springer Verlay, 1990:200–212 4. S. von Solms and D. Naccache. On blind signatures and perfect crimes. Computers and Security. October, 1992, 11(6): 581–583 5. D. Chaum, J.H.Evertse, J.van de Graff, R.Peralta. Demonstrating Possession of a Discrete Logarithm without Revealing It. Advances in Cryptology-CRYPTO ’86 Proceedings. Springer Verlag. 1987: 200–212 6. S. Brands. Untraceable Off-line Cash in Wallets with Observers. In Advances in Cryptology-Proceedings of CRYPTO’93, Lecture Notes in Computer Science. Springer Verlag. 1993, 773:302–318 7. M. Stadler, M. M. Piveteau, J. Camenisch. Fair blind signatures. In L. C. Guillou and J.J. Quisquater, editors, Advances in Cryptology-EUROCRYPT’95, Lecture Notes in Computer Science, Springer Verlag. 1995, 921:209–219 8. J. D. Tygar. Atomicity versus Anonymity: Distributed Transactions for Electronic Commerce. Proceedings of the 24th VLDB Conference, New York, USA, 1998:1–12 9. J. D. Tygar. Atomicity in Electronic Commerce. Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing,1996,5:8–26 10. L. Camp, M.Harkavy, J. D. Tygar, B.Yee. Anonymous Atomic Transactions. In Proceedings of 2nd Usenix Workshop on Electronic Commerce. 1996, 11:123~133 11. Tang Jie. The Research of Atomic and Fair Anonymous Electronic Transaction Protocol. YanShan University. 2002:35~59
Colored Petri Net Based Attack Modeling Shijie Zhou, Zhiguang Qin, Feng Zhang, Xianfeng Zhang, Wei Chen, and Jinde Liu College of Computer Science and Engineering University of Electronic Science and Technology of China Sichuan, Chengdu 610054, P.R.China {sjzhou, qinzg, ibmcenter}@uestc.edu.cn
Abstract. Color Petri Net (CPN) based attack modeling approach is addressed. CPN based attack model is flexible enough to model Internet intrusion, including the static and dynamic features of the intrusion. The processes and rules of building CPN based attack model from attack tree are also presented. In order to evaluate the risk of intrusion, some cost elements are added to CPN based attack modeling. Experiences also show that it is easy to exploit CPN based attack modeling approach to provide the controlling functions.
1 Introduction Attack modeling is a kind of approach, which can picture the processes of attacks and depict them semantically. An attack modeling approach must not only characterize the whole steps of attacks accurately, but also successfully point out how attacking process continues [2]. Additionally, reasonable response measures should also be indicated in practical intrusion detection and response system. In this paper, we present an approach based on colored petri net (CPN) [1] for attack modeling, which derived from other attacking modeling methods [2][3][4][5][6][7].
2 From Attack Tree to Colored Petri Net Based Attack Models The CPN based attack model can be defined from attack tree to reduce the cost of modeling. It is because that some attack models have been built with attack trees [2][3]. To build a CPN based attack model from an attack tree, the mapping rules between them should be determined. Root Nodes Mapping. In attack trees, root node is the goal of attacks. It is also the result of an attack. In CPN attack models, the root node can map to place: node maps to a place, node inputs map to arcs of place. This kind of place is called Root Place. The OR gate and AND gate will map to the event relationship of CPN. Their maps are showed in Fig. 1. The node with OR gate maps to event’s conflict relation of CPN. This means only one event occurs, and then the attack will take place. The node with AND gate maps to event’s sequential relation, that implies only all events occur then the attack will take place. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 715–718, 2003. © Springer-Verlag Berlin Heidelberg 2003
716
S. Zhou et al.
p t1
t2
p t
p
p
(a)
(b)
Fig. 1. Root Node and its mapping in CPN. (a) is OR gate of root node , and (b) is AND gate of root node
Leaf Nodes Mapping. Leaf nodes in attack trees are hacker’s actions breaking into the victim’s system. It is clearly that leaf nodes can map to the transition of CPN. But in an attack tree, all leaf nodes are connected directly. So it is difficult to do straightforward maps. Some intermediate states must be defined so that the mapping can be performed. Ruiu’s analysis of intrusion [8] divided attacks into seven stages: t
PS t p
Fig. 2. Leaf node and its mapping in CPN
Reconnaissance, Vulnerability Identification, Penetration, Control, Embedding, Data Extraction & Modification, and Attack Relay. Each stage can also be divided into some or several sub-stages. So we can model attacks’ stages and sub-stages as intermediate states when translating attack trees into CPN based attack models. Fig.2 shows how to deal with such translation. These newly added places (including the places added during translation of intermediate nodes) are called Added Place. In Fig.2, the value of place p can be derived from a function f(t), where t T, and f(t) . And the output arc of place p is the input arc of next transition. In Fig.2, the PS place is an Initialization Place whose means depend on the transition. Intermediate Nodes Mapping. Intermediate nodes of attack trees are sub-actions or sub-goals of hackers. It is more difficult to translate these nodes into CPN models because intermediate nodes have not only input arc(s) and output arc(s), OR and AND gate logics, but also the same problems confronted in leaf node translating. Mapping rules of leaf nodes are listed as follow: − The intermediate node itself maps to transition, t, of CPN − Input arc of an intermediate node maps the input arc of t, OR and AND gates are translated to confliction relation and sequence relation respectively. − Intermediate place is added in the same way as the translations of leaf nodes.
Colored Petri Net Based Attack Modeling
717
Temporal Logic Mapping. As to building template CPN from attack tree, an important issue is how to deal with temporal sequence of attacks. From above discussing, we know that an attack comprises many stages and sub-stages that all have temporal logical relations. One occurring sequence of many stages and substages means an intrusion while all occurring sequences comprise the attack trees. Event relations of CPN can depict temporal logics in an attack tree.
3 Extended CPN Model for Intrusion and Response Attack tree only depicts the process of an attack. No mechanism in attack tree is provided to allow active response and defense, partially due to limits of tree model. But CPN based attack modeling can give administrators such means to control the hacker’s action or carry out some effective response. Based on the definition of CPN, transition can fire only when all its bindings occur. So we can model the defense and response actions as follow: for each transition, an input arc is added to allow control, and an output arc is added to allow response. Additionally, if there are many control and response actions, many arcs can also be added. But readers should be aware that this model is not derived from attack tree, but extends directly from CPN attack model.
4 Conclusions CPN based attack modeling approach can be used to model the attacks. It is easy to map attack tree into CPN attack model. After other features are added to this model, it can be used to model the intrusion detection and response. Another important feature of this model is that intrusion can be quantified, so the most effective controlling actions can be determined.
References 1. 2. 3. 4. 5.
Kurt Jensen. Coloured Petri Nets: Basic Concepts, Analysis Methods and Practical Use, volume 1. Springer-Verlag, Berlin, Germany / Heidelberg, Germany / London, UK / etc., 1992. Tidwell, T., Larson, R., Fitch K., and Hale J. Modeling Internet Attacks, Proceedings of the 2001 IEEE Workshop on Information Assurance and Security, United States Military Academy, West Point, NY, 5-6 June, 2001, Pages 54–59. Cunningham, W. The WikiWikiWeb. http://c2.com/cgi-bin/wiki. Schneier, B.. Attack Trees. Dr. Dobb's Journal of Software Tools 24, 12 (Dec. 1999), 21{29}. Guy Helmer, Johnny Wong, Mark Slagell, Vasant Honavar, Les Miller, and Robyn Lutz. A software fault tree approach to requirements analysis of an intrusion detection system. In Proceedings, Symposium on Requirements Engineering for Information Security. Center for Education and Research in Information Assurance and Security, Purdue University, March 2001.
718 6. 7. 8.
S. Zhou et al. Jan Steffan and Markus Schumacher. Collaborative Attack Modeling, SAC 2002, Madrid, Spain. McDermott, J. Attack Net Penetration Testing. In The 2000 New Security Paradigms Workshop (Ballycotton, County Cork, Ireland, Sept. 2000), ACM SIGSAC, ACM Press, pp. 15{22}. Dragos Ruiu. Cautionary tales: Stealth coordinated attack how to, July 1999. http://www.nswc.navy.mil/ISSEC/CID/Stealth_Coordinated_Attack.html.
Intelligent Real-Time Traffic Signal Control Based on a Paraconsistent Logic Program EVALPSN Kazumi Nakamatsu1 , Toshiaki Seno2 , Jair Minoro Abe3 , and Atsuyuki Suzuki2 1
3
School of H.E.P.T., Himeji Inst. Tech., HIMEJI 670-0092 JAPAN, [email protected] 2 Dept. Information, Shizuoka University, HAMAMATSU 432-8011 JAPAN, {cs9051,suzuki}@cs.inf.shizuoka.ac.jp Dept. Informatics, Paulista University, SAO PAULO 121204026-002 BRAZIL, [email protected]
Abstract. In this paper, we introduce an intelligent real-time traffic signal control system based on a paraconsistent logic program called an EVALPSN (Extended Vector Annotated Logic Program with Strong Negation), that can deal with contradiction and defeasible deontic reasoning. We show how the traffic signal control is implemented in EVALPSN with taking a simple intersection example in Japan. Simulation results for comparing EVALPSN traffic signal control to fixed-time traffic signal control are also provided. Keywords: traffic signal control, paraconsistent logic program, intelligent control, defeasible deontic reasoning.
1
Introduction
We have already proposed a paraconsistent logic program called EVALPSN (Extended Vector Annotated Logic Program) that can deal with contradiction and defeasible deontic reasoning [2,4]. Some applications based on EVALPSN such as robot action control, automatic safety verification for railway interlocking and air traffic control have been already introduced in [5,6]. Traffic jam caused by inappropriate traffic signal control is a serious problem that we have to resolve. In this paper, we introduce an intelligent real-time traffic signal control system based on EVALPSN as one proposal to resolve the problem. Suppose that you are waiting for the change of the front traffic signal with red to green at an intersection. You must demand the change in your mind. This demand can be regarded as permission for the change. On the other hand, if you are going through the intersection with green signal you must demand keeping green. This demand can be regarded as forbiddance from the change. Then, there is a conflict between those permission and forbiddance. The basic idea of the traffic signal control is that the conflict can be managed by the defeasible deontic reasoning of EVALPSN. We show how to formalize the traffic signal control by defeasible deontic reasoning in EVALPSN. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 719–723, 2003. c Springer-Verlag Berlin Heidelberg 2003 ÿ
720
2
K. Nakamatsu et al.
Traffic Signal Control in EVALPSN
We take an intersection in which two roads are crossing described in Fig.2 as an example for introducing our method based on EVALPSN. We suppose an intersection in Japan, which means “cars have to keep left”. The intersection has four traffic signals T{1,2,3,4} , which have four kinds of lights, green, yellow, red T S3 -6 4 T2 and right-turn arrow. Each lane conÿ nected to the intersection has a sensor S S 4 2 ? 6 S to detect traffic amount. Each sensor is T1 1 ? T3 described as Si (1 ≤ i ≤ 8) in Fig.2. For ÿ 6 example, the sensor S6 detects the right S8 turn traffic amount confronting the trafS7 fic signal T1 . Basically, the traffic signal Fig. 1. Intersection control is performed based on the traffic amount sensor values. The chain of signaling is as follows: → red → green → yellow → right-turn arrow → all red →. For simplicity, we assume that the lengths of yellow and all red signaling time are constant, therefore, the signaling time of yellow and all red are supposed to be included in those of green and right-turn arrow, respectively as follows : T1,2 → red → red → green → arrow → red →, T3,4 → green → arrow → red → red → green → . Only the changes green to arrow and arrow to red is controlled mainly. The change red to green of the front traffic signal follows the change right-turn arrow to red of the neighbor one. Moreover, the signaling is controlled at each unit time t ∈ {0, 1, 2, . . . , n}. The traffic amount of each lane is regarded as permission for or forbiddance from signaling change such as green to right-turn arrow. For example, if there are many cars waiting for the signaling change with green to right-turn arrow, it can be regarded as the permission for the signaling change with green to right-turn arrow, on the other hand, if there are many cars moving through the intersection with green, it can be regarded as the forbiddance from the signaling change with green to right-turn arrow. We formalize such contradiction and its resolution by deontic defeasible reasoning in EVALPSN. We assume that minimum and maximum signaling times are previously given for each traffic signal and each signaling time must be controlled between the minimum and maximum. We consider the four states of the traffic signals, state 1 (T1,2 are red and T3,4 are green), state 2 (T1,2 are red and T3,4 are right-turn arrow), state 3 (T1,2 are green and T3,4 are red), state 4 (T1,2 are right-turn arrow and T3,4 are red). Due to space restriction, we take only the case 1 into account to introduce the traffic signal control in EVALPSN. S5 S6 ? -
S rb 1 (t) : [(2, 0), α] ∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧ ∼ M IN3,4 (b, t) : [(2, 0), α]∧ ∼ S rb 5 (t) : [(2, 0), α]∧ ∼ S rb 7 (t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), γ], (1)
Intelligent Real-Time Traffic Signal Control
721
S rb 3 (t) : [(2, 0), α] ∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧ ∼ M IN3,4 (b, t) : [(2, 0), α]∧ ∼ S rb 5 (t) : [(2, 0), α]∧ ∼ S rb 7 (t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), γ], (2) S rb 2 (t) : [(2, 0), α] ∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧ ∼ M IN3,4 (b, t) : [(2, 0), α]∧ ∼ S rb 5 (t) : [(2, 0), α]∧ ∼ S rb 7 (t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), γ], (3) S rb 4 (t) : [(2, 0), α] ∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧ ∼ M IN3,4 (b, t) : [(2, 0), α]∧ ∼ S rb 5 (t) : [(2, 0), α]∧ ∼ S rb 7 (t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), γ], (4) S rb 6 (t) : [(2, 0), α] ∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧ 0
S rb 7 (t) : [(2, 0), α]∧ ∼ M IN3,4 (b, t) : [(2, 0), α]∧ ∼ S rb 5 (t) : [(2, 0), α] ∧ ∼ S rb 7 (t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), γ], S
rb
8 (t) : [(2, 0), α]
(5)
∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧
0 S rb 5 (t) : [(2, 0), α]∧ ∼ M IN3,4 (b, t) : [(2, 0), α]∧ ∼ S rb 5 (t) : [(2, 0), α] ∼ S rb 7 (t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), γ], S rb 5 (t) : [(2, 0), α] ∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧
∼ M AX3,4 (b, t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), β],
∧ (6) (7)
S rb 7 (t) : [(2, 0), α] ∧ T1,2 (r, t) : [(2, 0), α] ∧ T3,4 (b, t) : [(2, 0), α] ∧ ∼ M AX3,4 (b, t) : [(2, 0), α] → T3,4 (a, t) : [(0, 1), β], (8) M IN3,4 (g, t) : [(2, 0), α] ∧ T3,4 (g, t) : [(2, 0), α] → T3,4 (a, t) : [(0, 2), β], (9) M AX3,4 (g, t) : [(2, 0), α] ∧ T3,4 (g, t) : [(2, 0), α] → T3,4 (a, t) : [(0, 2), γ], (10) T3,4 (g, t) : [(2, 0), α] ∧ T3,4 (a, t) : [(0, 1), γ] → T3,4 (a, t + 1) : [(2, 0), β], (11) T3,4 (g, t) : [(2, 0), α] ∧ T3,4 (a, t) : [(0, 1), β] → T3,4 (g, t + 1) : [(2, 0), β]. (12)
3
Simulation
Suppose that the traffic signals T1,2 are red and the traffic signals T3,4 are green, and the minimum signaling time of green has been already passed. • If the sensors S1,3,5 detect more traffic amount than the criteria and the sensors S2,4,6,7,8 do not detect at the time t, then, the EVALPSN clause (7) is fired and the forbiddance T3,4 (a, t) : [(0, 1), β] is derived, furthermore, the EVALPSN clause (12) is also fired and the obligatory result T3,4 (g, t + 1) : [(2, 0), β] is obtained. • If the sensors S1,3 detect more traffic amount than the criteria and the sensors S2,4,5,6,7,8 do not detect at the time t, then, the EVALPSN clause (8) is fired and the permission T3,4 (a, t) : [(0, 1), γ] is derived, furthermore, the EVALPSN clause (11) is also fired and the obligatory result T3,4 (a, t + 1) : [(2, 0), β] is obtained. We used the cellular automaton model for traffic flow and compared the EVALPSN traffic signal control to fixed-time traffic signal control in terms of the numbers of cars stopped and moved under the following conditions. [Condition 1] We suppose that : cars are flowing into the intersection in the following
722
K. Nakamatsu et al.
probabilities from all 4 directions, right-turn 5%, left-turn 5% and straight 20%; the fixed-time traffic signal control, green 30, yellow 3, right-arrow 4 and red 40 unit times ; the length of green signaling between 3 and 14 unit times, and the length of right-arrow signaling between 1 to 4 unit times. [Condition 2] We suppose that : cars are flowing into the intersection in the following probabilities; from South, right-turn 5%, left-turn 15% and straight 10%; from North, right-turn 15%, left-turn 5% and straight 10%; from West, right-turn, left-turn and straight 5% each ; from East, right-turn and left-turn 5% each, and straight 15%; other conditions are same as the [Condition 1]. We measured the sums of cars stopped and moved for 1000 unit times, and repeated it 10 times under the conditions. The average numbers of cars stopped and moved are shown in Table.1. This simulation results say that ; the number of cars moved when the EVALPSN control is larger than that when the fixed control, and Table 1. Simulation Results
fixed-time control EVALPSN control car stopped car moved car stopped car moved Condition 1 17690 19641 16285 23151 Condition 2 16764 18664 12738 20121 the number of cars stopped when the EVALPSN control is smaller than that when the fixed time control. Taking only the simulation into account, it is concluded that the EVALPSN control is more efficient than the fixed time one.
4
Conclusion
In this paper, we have proposed an EVALPSN based real-time traffic signal control system. The practical implementation of the system that we are planning is under the condition that EVALPSN can be easily implemented on a microchip hardware, although we have not addressed it in this paper. As our future work, we are considering multi-agent intelligent traffic signal control based on EVALPSN.
References 1. Nakamatsu,K., Abe,J.M. and Suzuki,A. : Defeasible Reasoning Between Conflicting Agents Based on VALPSN. Proc. AAAI Workshop Agents’ Conflicts, AAAI Press, (1999) 20–27 2. Nakamatsu,K., Abe,J.M. and Suzuki,A. : Annotated Semantics for Defeasible Deontic Reasoning. Proc. the Second International Conference on Rough Sets and Current Trends in Computing LNAI 2005, Springer-Verlag, (2001) 470–478 3. Nakamatsu,K. : On the Relation Between Vector Annotated Logic Programs and Defeasible Theories. Logic and Logical Philosophy, Vol.8, UMK Press, Poland, (2001) 181–205
Intelligent Real-Time Traffic Signal Control
723
4. Nakamatsu,K., Abe,J.M. and Suzuki,A. : A Defeasible Deontic Reasoning System Based on Annotated logic Programming. Computing Anticipatory Systems, CASYS2000, AIP Conference Proceedings Vol.573, AIP Press, (2001) 609–620 5. Nakamatsu,K., Abe,J.M. and Suzuki,A. : Applications of EVALP Based Reasoning. Logic, Artificial Intelligence Robotics, Frontiers in Artificial Intelligence and Applications Vol.71, IOS Press, (2001) 174–185 6. Nakamatsu,K., Suito,H., Abe,J.M. and Suzuki,A. : Paraconsistent Logic Program Based Safety Verification for Air Traffic Control. Proc. 2002 IEEE Int’l Conf. Systems, Man and Cybernetics (CD-ROM), IEEE, (2002)
Transporting CAN Messages over WATM Ismail Erturk Kocaeli University, Faculty of Technical Education, Izmit, 41300 Kocaeli, Turkey [email protected]
Abstract. A new method describing fixed wireless CAN networking is presented in this paper, exploiting the advantages of WATM as an over the air protocol, which include fixed-small cell size and connection negotiation. CAN over WATM mapping issues using encapsulation technique are also explained. The performance analysis of the proposed scheme is presented through computer simulation results provided by OPNET Modeler.
1 Introduction CAN (Controller Area Network) has become one of the most advanced and important autobus protocols in the communications industry for the last decade. Currently it is prominently used in many other industrial applications due to its high performance and superior characteristics as well as in automotive applications [1], [2]. Researches for cost-effective solutions to overcome threatening complexity of for example a car’s or a factory’s wiring harness appear to be a key point in such applications. For this reason CAN promisingly receives many researchers’ attention in ongoing industrial projects. Extensive use of CAN in automotive and other control applications results in also need for internetworking between CAN and other major public/private networks. In order to enable CAN nodes for them to be programmed and/or managed from any terminal controller at any time globally, requirements of the wireless transfer of CAN messages should be inevitably met as well [3], [4]. This idea establishes the basis to interconnection of CAN and WATM (Wireless Asynchronous Transfer Mode) in view of ATM as a universally accepted broadband access technology for transporting realtime multimedia traffic. WATM as an extension of local area network for mobile users, or for simplifying wiring, or simplifying reconfiguration has an appeal, stated in [5]. In addition, another rationale for this research work is that mobile/wireless networking will be also common and broadband as much as traditional networking in the near future.
2 The CAN, ATM, and WATM Allowing the implementation of peer-to-peer and broadcast or multicast communication functions with lean bus bandwidth use, CAN applications in vehicles are graduG. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 724–729, 2003. © Springer-Verlag Berlin Heidelberg 2003
Transporting CAN Messages over WATM
725
ally extended to machine and automation markets. As CAN semiconductors produced by many different manufacturers are so inexpensive, its widespread use has found a way into such diverse areas as agricultural machinery, medical instrumentation, elevator controls, public transportation systems and industrial automation control components. [1] and [3] supply a detailed overview of the CAN features. CAN utilizes the CSMA/CD as the arbitration mechanism to enable its attached nodes to have access to the bus. Therefore, the maximum data rate that can be achieved depends on the bus’ length. The CAN message format includes 0-8 bytes of variable data and 47 bits of protocol control information (i.e., the identifier, CRC data, acknowledgement and synchronization bits, Fig. 1) [2]. The identifier field serves two purposes: assigning a priority for the transmission and allowing message filtering upon reception. The communication evolution in its final phase has led the digital data transfer technologies to ATM as the basis of B-ISDN concept. Satisfying the most demanding quality of service (QoS) requirements (e.g., of real-time applications) ATM has become internationally recognized to fully integrate networking systems, referring to the interconnection of all the Information Technologies and Telecommunications. ATM networks are inherently connection-oriented (though complex) and allow QoS guarantees. ATM provides superior features including flexibility, scalability, fast switching, and use of statistical multiplexing to utilize network resources efficiently. Information is sent in short fixed-length blocks called cells in ATM. The flexibility needed to support variable transmission rates is provided by transmitting the necessary number of cells per unit time. An ATM cell is 53 bytes, consisting of a 48-byte information field and a 5-byte header, as presented in Figure 2. The cell header contains a label (VCI/VPI) that is used in multiplexing, denoting the routing address. The header also embodies four other fields, i.e. GFC, PTI, CLP and HEC [2], [4].
$UELWUDWLRQ)LHOG 6 2 )
ELWGHQWLILHU
&QWUO)LHOG ELW 5 , 7 ' U '/& 5 (
'DWD)LHOG E\WHV
&5&)LHOG
$&. ELW
ELW&5&
(2) ELW
Fig. 1. CAN message format
*HQHULF)ORZ 93,9&, 3D\ORDG7\SH &HOO/RVV &RQWURO )LHOGV ,QGLFDWRU 3ULRULW\ *)& 37, &/3 %LWV
%LWV
%LWV
%LW
+HDGHU &KHFNVXP +(& %LWV
3D\ORDG
%\WHV
Fig. 2. ATM cell format
WATM technology provides wireless broadband access to a fixed ATM network. As a result, it introduces new functions/changes on the physical, data link, and access layers of the ATM protocol stack. The physical layer realizes the actual transmission of data over the physical medium by means of a radio or an optical transmitter/receiver pair. Therefore, the most challenging problem is to overcome multipath reception from stationary and moving objects, resulting in a space- and time-varying
726
I. Erturk
dispersive channel [2], [5]. The data link layer for WATM is focused on encapsulation, header compression, QoS, and ARQ (Automatic Repeat Request) and FEC (Forward Error Correction) techniques. The proposed data link layer scheme in this work aims to provide reliable transfer of the cells over the wireless point-to-point links where a large number of errors likely to occur, using a combination of the sliding window, and selective repeat ARQ and FEC techniques [2], [5]. In order to provide these new functions, the cell structure in the WATM part of the proposed network is slightly different from a regular ATM network. Therefore, a simple translation is sufficient to transfer ATM cells between wireless and regular ATM networks. In this paper that CAN messages are transported between fixed wireless CAN nodes using WATM is presented. Although there are various challenging wireless technologies, only WATM has the advantage of offering end-to-end multimedia capabilities with guaranteed QoS [5].
3 Interconnection of Fixed Wireless CAN Nodes Using WATM Considering WATM as an issue of access to an ATM network, here in this research work WATM is regarded as an extension of the ATM LAN for remote CAN users/nodes. The aim of this work is to allow wireless service between two end points (i.e., WATM-enabled CAN nodes) that do not move relative to each other during the life time of the connection [2]. The proposed CAN over ATM protocol stack for wireless ATM is shown in Figure 3. In the CAN-WATM mapping mechanism it is proposed that at the WATM-enabled CAN nodes, the Protocol Data Units (PDU) of the CAN protocol are encapsulated within those of the ATM to be carried over wireless ATM channels. Since a CAN message is 108 bits (0-64 bits of which are variable data), it can easily be fitted into one ATM cell payload. Thus, neither segmentation/reassembly of CAN messages nor data compression is necessary for carrying a CAN message in one ATM cell. At the destination WATM-enabled CAN node, header parts of the ATM cells are stripped off, and the CAN messages extracted from the ATM cell payloads can be processed or passed on to the CAN bus. As it may be indicated in the arbitration fields of the CAN messages, different kinds of multimedia application traffic can take advantage of the ATM QoS support. Before the actual data transmission takes place through the ATM network, for example, ABR (Available Bit Rate) traffic is multiplexed into AAL3/4 connections while CBR (Constant Bit Rate) traffic is multiplexed into AAL2 connections.
4 Modeling of the Proposed Scheme and Simulation Results Figure 4 shows the simulation model created to evaluate the proposed fixed wireless access of CAN nodes over WATM channels, using a commercially available simulation package called OPNET 9.0. Standard OPNET 9.0 modules such as terminals, servers, radio transceivers and ATM switches are explored to interconnect WATM-
Transporting CAN Messages over WATM
727
enabled CAN nodes to ATM terminals over the radio medium [2]. As demonstrated in Figure 4, the scenario contains eigth CAN nodes, two WATM-enabled CAN nodes (these are wireless ATM terminals, i.e., Base Station, BS) with a radio transceiver having access to the access point (AP) that is connected to an ATM switch and an ATM terminal. In this topology, a remote WATM-enabled CAN node gains access to a wired ATM network at the access point (AP) through which it can be connected to the other WATM-enabled CAN node. The BS receives CAN frames containing information from the application traffic sources (i.e., video (V) and data (D)), which are destined to the other BS over the radio medium using the ATM terminal in the wired ATM network. Not only does it place these frames in ATM cell payloads but also realizes multiplexing ABR (i.e., D) and VBR (i.e., V) traffic into AAL3/4 and AAL2 connections respectively. Finally, having inserted the wireless ATM DLC header including a Non-Control Message (1 bit), Acknowledgement (1 bit) and FEC data into the ATM cells, they are transmitted from the BS radio transmitter to the point-to-point receiver (the two constitute a radio transceiver) at the AP. 5HPRWH &$1:$70 &$1:$70 5HPRWH &$11RGH (QFDSVXODWLRQ'HFDSVXODWLRQ (QFDSVXODWLRQ'HFDSVXODWLRQ &$11RGH O O $SSOLFDWLRQ $SSOLFDWLRQ $$/ RUW $$/ RUW $SSOLFDWLRQ $SSOLFDWLRQ Q R 0:LUHOHVV QR& 'DWD 0:LUHOHVV & 'DWD 'DWD 'DWD /LQN /LQN /LQN 7$ '/& VVHO /LQN 7$ '/& VVHO /D\HU /D\HU /D\HU /D\HU H H :LUHOHVV :LUHOHVV U U : 0$& L :LUHOHVV : 0$& L '// '// '// '// : $70&KDQQHO : 3K\VLFDO &XVWRP 3K\VLFDO 3K\VLFDO 3K\VLFDO &XVWRP /D\HU :LUHOHVV /D\HU /D\HU /D\HU :LUHOHVV &$1%XV &$1%XV
Fig. 3. The proposed CAN over WATM layered system architecture
99LGHR$SSOLFDWLRQ ''DWD$SSOLFDWLRQ :6$707HUPLQDO ' %6 &$1 9
'9
:6
:,5('$70 1(7:25.
$3
$3
%6%DVH6WDWLRQ $3$FFHVV3RLQW ''99&$1QRGHV ' %6 &$1 9 '9
Fig. 4. Simulation model of the proposed scheme
Preliminary simulation results of the proposed model under varying load characteristics are presented. The simulation parameters used are given in Table 1. For the WATM-enabled CAN node application sources D and V, Figures 5 and 6 show the average cell delay (ACD) and cell delay variation (CDV) results as a function of simulation run time, and ACD results as a function of each CAN node traffic load, respectively. Since the ATM network has offered each application traffic a different and negotiated service, the ACD and CDV results for these two applications differ from each other noticeably. As it can be seen from the Figure 5, the ACD results vary between 6 ms. and 22 ms. and between 2 ms. and 7 ms. for the applications D and V, respectively, and the application V traffic experiences almost four times less CDVs
728
I. Erturk
compared to the other. Figure 6 clearly shows that same increase in both D and V traffic loads resulted in better ACD for the latter as a consequence of WATM’s QoS support.
5 Summary A new proposed scheme for carrying CAN application traffic over WATM is presented. Considering its easy usage in many industrial automation control areas, CAN protocol emerges inevitably to need wireless internetworking for greater flexibility for its nodes to be controlled and/or operated remotely. In this study, two different types of data traffic produced by a WATM-enabled CAN node are transferred to the other one over the radio medium using an ATM terminal. CAN over WATM mapping using encapsulation technique is also presented as an important part of the WATM-enabled CAN and ATM internetworking system. Simulation results show that not only different CAN application traffics can be transmitted over WATM but also required QoS support can be provided to the users. Table 1. Simulation Parameters 5,000* Kbytes/hour
Application Source D
Peak Cell Rate = 100 Kbps Minimum Cell Rate = 0.5 Kbps Initial Cell Rate = 0.5 Kbps Peak Cell Rate = 100 Kbps Sustainable Cell Rate = 50 Kbps Minimum Cell Rate = 20 Kbps 50 Kbps
ABR ATM Connection 10,000* Kbytes/hour
Application Source V
VBR-nrt ATM Connection Uplink Bit Rate
*Produced using Exponential Distribution Function
V P \ ODH ' OH & HJ DU YH$
V P \ ODH ' OH &
6LPXODWLRQ7LPHPLQ
$Y HUDJH&HOO'HOD\ )RU'PV &HOO'HOD\ 9DULDWLRQ)RU'PV
$Y HUDJH&HOO'HOD\ )RU9PV &HOO'HOD\ 9DULDWLRQ)RU9PV
Fig. 5. Cell delays vs. simulation time
/RDGPHVVDJHVHF
$Y HUDJH&HOO'HOD\ )RU'PV
$Y HUDJH&HOO'HOD\ )RU9PV
Fig. 6. Average cell delays vs. load
Transporting CAN Messages over WATM
729
References 1. Lawrenz, W.: CAN System Engineering: from Theory to Practical Applications. SpringerVerlag, New York, (1997) 2. Erturk, I.: Remote Access of CAN Nodes Used in a Traffic Control System th to a CMU over Wireless ATM. IEEE 4 MWCN. Sweden (Sep. 2002) 626–630 3. Farsi, M., Ratckiff, K., Babosa, M.: An Overview of Controller Area Network. Computing and Control Engineering Journal, Vol. 10, No. 3 (June 1999) 113–120 4. Erturk, I., Stipidis, E.: A New Approach for Real-time QoS Support in IP over ATM Networks. IEICE Trans. on Coms., Vol. E85-B, No. 10 (October 2002) 2311–2318 5. Ayanoglu, E.: Wireless Broadband and ATM Systems. Computer Networks, Vol. 31. Elsevier Science (1999) 395–409
A Hybrid Intrusion Detection Strategy Used for Web Security 1
Bo Yang 1, Han Li , Yi Li 1, and Shaojun Yang 2 1
School of Information Science and Engineering, Jinan University, Jinan,250022,P.R.China {yangbo,lihan,liyi}@ujn.edu.cn http://www.ujn.edu.cn 2 Department of Information Industry, Shandong Provincial Govement, Jinan,250011,P.R.China [email protected] Abstract. This paper describes a novel framework for intrusion detection systems used for Web security. A hierarchical structure was proposed to gain both server-based detection and network-based detection. The system consists of three major components. First, there is a host detection module (HDM) in each web server and a collection of detection units (UC) running on background in the host. Second, each subnet has a network detection module (NDM), which operates just like a HDM except that it analyzes network traffic. Finally, there is a central control detection module (CCDM), which is served as a high level administrative center. The CCDM receives reports from various HDM and NDM modules, and by processing and correlating these reports to detect intrusions. Detection rules are inductively learned from audit records and distributed to each detection modules in the CCDM.
1 Introduction As web-based computer systems play increasingly vital roles in modern society, they have become the targets of our enemies and criminals. While most web systems attempt to prevent unauthorized use by some kind of access control mechanism, such as passwords, encryption, and digital signatures, there are still some factors that make it very difficult to keep these crackers from eventually gaining entry into a system[1][2]. Since the event of an attack should be considered inevitable, there is an obvious need for mechanisms that can detect crackers attempting to gain entry into a computer system, that can detect users misusing their system privileges, and that can monitor the networks connecting all of these systems together. Intrusion Detection System (IDS) are based on the principle that an attack on a Web system will be noticeably different from normal system activity. An intruder to a system (possibly masquerading as a legitimate user) is very likely to exhibit a pattern of behavior different from the normal behavior of a legitimate user. The job of the IDS is to detect these abnormal patterns by analyzing numerous sources of information that are provided by the existing systems[3]. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 730–733, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Hybrid Intrusion Detection Strategy Used for Web Security
731
Network intrusions such as eavesdropping, illegal remote accesses, remote breakins, inserting bogus information into data streams and flooding the network to reduce the effective channel capacity are becoming more common. While monitoring network activity can be difficult because of the great proliferation of (un)trusted open networks. So we propose a network intrusion detection model based on hierarchical structure and cooperative agent to detect the intrusive behaviors both on the web server and network to gain the Web security.
2 Model Architecture The proposed architecture for this intrusion detection system consists of three major components. First, there is a host detection module (HDM) in each Web server. This module is a collection of detection units (UC) running on background in the host. Second, each subnet that is monitored has a network detection module (NDM), which operates just like a HDM except that it analyzes LAN traffic. Finally, there is a central control detection module (CCDM), which is placed at a single secure location and served as an administrative center. The CCDM receives reports from various HDM and NDM modules, and by processing and correlating these reports, it is expected to detect intrusions. Fig.1 shows the structure of the system.
Router,Switch
DU
TransAgent
DU DU
HM
TransAgent
HDM
NDM TransAgent
DU
CCDM DU DU
HM
HDM
Net Manager
Database
Fig. 1. Architecture of the hybrid detection model
Knowledge Set
732
B. Yang et al.
3 Components 3.1 Host Detection Module (HDM) There are a collection of Detection Units (DU) and a Host Manager (HM) in each Web server and some special host. Each DU is responsible for detecting attacks that are assaulted to one of the system resources. HM deal with the reports submitted by DU, manager and configure DU.If DU detected an attack, then it alarm the network administrator and record the system log. Otherwise, it endow with a Hypothesis Value (HV) to this kind of accessing and report such information as {Process Symbol, User, CPU Occupancy, Memory Spending, HVi } to HM. HM broadcast this information to other DU. If any other DU also detect this kind of accessing, it increases it’s HVi. If “HVi > Vlimit”, this DU alarm and record log; or report to HM. 3.2 Network Detection Module (NDM) NDM is responsible for monitoring network access, detecting possible attack, cooperating with other Agents to detect intrusive behaviors. NDM’s implementation is similar to HDM’s. First, Local Detection Module detects the data (most of them are datagram information) collected from network. If it detects an attack, then gives alarm message. Otherwise it endows this accessing with a HVi and refers it to cooperative agent module. Cooperative agent module register the network information {Source Address, Destination Address, Protocol Type, User Information (for instance, Telnet user name),Databytes,Packet Header} to an access list and transport it to correlative cooperative agent module. If the increased HV received from local agent and other agents exceed the HV limit, then this module regard this accessing as an intrusive behavior, alarm to local agent and transmit the information to other agent according with the address in access list. Each cooperative agent received this information continue to transmit the intrusive information to other agents besides alarm to local agent. So they form a transmission chain, and all agents in this chain can receive the intrusive information and response the intrusion. 3.3 Central Control Detection Module (CCDM) CCDM is located in network center and monitored by the administrator. By monitoring the data passing through network access point, it can detect data coming from exterior and emitting to exterior. Since it is at the high-level of all ID Systems, it can detect intrusive behaviors that other low-level ID System can’t detect. CCDM can complete intrusion detection more comprehensive and exact than other module.
A Hybrid Intrusion Detection Strategy Used for Web Security
733
3.4 Communications Agent In our model, there is a Communication Agent (namely, TransAgent) in each detection host. TransAgent is the corresponding bridge of the local modules and other cooperative modules in other hosts. It records all collection information about ID modules in local host and cooperative hosts, and can also offer router service for datagrams. When ID module want to transmit datagram, it appoints the destination ID module, and transmits it to TransAgent. TransAgent transmits the datagram to destination agent in remote hosts. The communication content has the following format: ::=<Sender><Time> Because the communicative information between ID modules is brief information, source ID module can encapsulate information in UDP package and send to TransAgent. We also use TCP package to realize mass datagram transmission. In addition, XML is used to improve the functionality of the Web servers by providing more flexible and adaptable information identification. Using XML to transmit datagram will not be influenced by different operating systems.
4 Conclusion This paper presented an architecture for a hybrid intrusion detection strategy based on hierarchical structure and cooperative agent. The target environment is a heterogeneous network that may consist of different hosts, servers, and subnets. This strategy is able to :(1) detect attacks on the network itself, (2) detect attacks involving multiple hosts, (3) track tagged objects including users and sensitive files as they move around the network, (4) detect via erroneous or misleading reports, situations where a host might be taken over by an attacker, and (5) monitor the activity of any networked system that doesn’t have a host monitor, yet generates LAN activity, such as a PC. In the future works, we will analyze a number of different low-level IP attacks and the vulnerabilities that they exploit. Some new data mining algorithms will be used to training the detection models automatically. By adding these new futures to an IDS implementation, the host-based intrusion detection systems can more easily detect and react to low-level network attacks.
References 1. Denning D E. An intrusion-detection model, IEEE Computer Society Symposium on Research in Security and Privacy, Oakland, USA, (1987) 118–131 2. Smaha S E, Haystack:An intrusion detection system, Proc. of the IEEE Fourth Aerospace Computer Security Application Conference . Orlando,FL:IEEE, (1988)37–44 3. Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. Detecting intrusions using system calls:alternative data models, Proc.of the 1999 IEEE Symposium on Security and Privacy, IEEE,(1999)133–145
Mining Sequence Pattern from Time Series Based on Inter-relevant Successive Trees Model Haiquan Zeng, Zhan Shen, and Yunfa Hu Computer Science Dept., Fudan University, Shanghai 200433, P.R.China [email protected]
Abstract. In this paper, a novel method is proposed to discover frequent pattern from time series. It first segments time series based on perceptually important points, then converted it into meaningful symbol sequences by the relative scope, finally used a new mining model to find frequent patterns. Compared with the previous methods, the method is simpler and more efficient.
1 Introduction Time series is a kind of important data existing in a lot of fields. By analyzing time series we can discover changing principle of things and provide help for decisionmaking. Mining patterns from time series is an important method to analyze time series, which discovers useful patterns that appear frequently (i.e. Sequence Pattern). Now researches in this field are attracting more and more attention, and becoming a subject for research with great theoretical and practical application. However, in many existing methods in the literature [1-5], the mined patterns are described in shape, which is difficult to understand and use. Moreover, these methods are generally based on the Apriori algorithm, which has to generate many pattern candidates so that the mining efficiency is degraded. To remedy the problems above, we proposed a sequence mining method based on Inter-Relevant Successive Trees Model. Experiment shows the method is simpler and more efficient.
2 Time Series Segmentation Description Suppose time series is S=<x1=(v1, t1)…xn=(vn,tn)>, where vi is the observed value on time, and ti+1-ti=∆=1(i=1,…,n-1).To facilitate the mining operation; time series is often described segment-wise. Although some approaches are proposed in [3,4,5], they all face the problems of heavy computation, failure of grasping the main changing characteristic. In this paper, we adopt a method to segmentation on the basis of comparatively more important points. Important Points are observation point that has important influence at the vision to series change. In fact, it is the local minimum or maximum points of series. Segmentation through choosing important points can eliminate the impact of noise on series effectively and maintain the main features patterns of time series’ change. G. Wang et al. (Eds.): RSFDGrC 2003, LNAI 2639, pp. 734–737, 2003. © Springer-Verlag Berlin Heidelberg 2003
Mining Sequence Pattern from Time Series
735
Definition 1 Given a constant R(R>1) and time series {<x1=(v1,t1),…,xn=(vn,tn)>},if a data point xm(1≤ m ≤n) is called an Important Minimum Point , it must satisfy one of the following conditions: 1. if 1<m, it’s shape feature is determined by it’s slope ratio ρi and length at time axle li. because the intervals of sample of time series are equal, li is multiple of sample intervals,. As to the slope ratio, the application background of this paper is mining the stock market, so we adopt the relative slope, i.e. ρi =((vi+1-vi)/vi) / (ti+1-ti), which means the rising or falling range in a interval unit. It can not only prevent the inconsistent impaction in price change range but also help categorize and symbolize the slope effectively using domain knowledge in reality. More generally, we suppose Aj (j =1, ,k) is all the change types that defined in practical application, each Aj has a certain change type as well as a certain real meaning. To the slope ρi of a certain linear segment, if ρi is in the area that Aj covers, this linear segment can be described with symbol Aj. In this way, we can convert any C time series into an order symbolic linear segment series S .
3 Mining Based on Inter-relevant Successive Trees Model IRST Model and SIRST Model Inter-Relevant Successive Trees (IRST) model is a novel massive full text index and store model proposed by [6] in term by symbol order and redundancy. The following are its basic conceptions. Definition 2 In the full text subsequences S=a1a2…. an-1an, ,. Suppose that ai occurs in the i1th, i2-th, …, im-th positions of S. we call [ai1+1, ai2+1,…, aim+1] as Successive Expression(SE) of ai, which can be use a tree to describe ,we call the tree as Successive Tree(ST) of symbol ai., ai is the root of the tree and (aij+1 ,tagj) (j=1, …,m) constitute branch leaf nodes. Symbol tagj means that successor of aij+1 is situated in the position of tagj of branch leaf nodes in the successive tree with the root aij+1. Definition 3 To symbol subsequences, different symbol corresponds to different ST. Every leaf node in these STs corresponds to a root node of another ST. In other words, all the STs corresponding to some symbol subsequences are relevant. The
736
H. Zeng, Z. Shen, and Y. Hu
relevancy reflects the relation of positions in which symbols appear in the subsequences. We call all the STs of some symbol subsequences its Inter- Relevant Successive Trees (IRST). Example 1: Suppose a symbol subsequences “abcabaabc”, we mark the end symbol with“*”. 3 different symbols appears in the subsequences: a,b and c. Their SEs respectively are: (b,b,a,b), (c,a,c) and (a,*). Figure 1 shows the successive trees. All these three STs compose the IRST of “abcabaabc”. Theorem 1. IRST is a equivalent description model of symbol subsequences. i.e., to any symbol subsequences we can create it’s IRST, and to any IRST we can also restore it’s original symbol subsequences.(See [6]). D
E
&LGL
F
••• E E D E F D F D (length i1 Cid i1 tag1) (length i2 Cid i2 tag2) (length im Cid i m tagm) Fig. 1. IRST of example 1
Fig. 2. Structure of SIRST
Because of the similar properties of the symbolized time series and symbol subsequences, we can describe symbolized time series with IRST model. But to some extent it differs from IRST of subsequence containing only symbols in detail. We call IRST of time series SIRST (Sequence IRST See Figure 2). In Figure 2, Cidi and Cid i are changing type symbols of the linear segment slope, length i (j=1…m) is the j
j
length of linear segment. Theorem 2: SIRST is an equivalent description model of symbolized time series. From Theorem 2, it is not difficult to deduce the following two properties. C Property 1:If linear segment liA ji occurs f times in S , there are f branch leaf nodes with length li in the ST with the root A
ji
, vice versa.
Property 2: If two neighbor linear segments (liA are at least f leaf nodes which contain A
ji +1
ji
li+1A
C
ji +1
) occur f times in S , there
in the ST with the root A
ji
; Suppose tagt
(t=1,..s, s≥f) are the corresponding successive position symbols of these nodes, then branches tagt have f branch leaf nodes with the length of li+1 in the ST with the root A ji +1 , and vice versa. Discovering algorithm based on Inter-Relevant Successive Trees The aim of mining frequent patterns is to find different lengths change patterns whose frequencies are larger than fmin. Property 1 and 2 can help in discovering frequent patterns: 1) property 1 can discover frequent patterns that occur at least fmin times. 2) property 2 can discover longer frequent patterns from tags and Cid in discovered frequent pattern leaf nodes. The following is two-kernel algorithm steps to mining frequent patterns based on SIRST. 1. Scan all the branch leaf nodes from root of a ST in SIRST. Compute its same length appearing frequency. If the appearing frequency is larger than fmin, the linear segment pattern with length and its root slope is frequent;
Mining Sequence Pattern from Time Series
737
2. In all the frequent pattern leaf nodes discovered in the above step, select every successive slope type symbol Cid in turns, and all corresponding successive position symbol set tags, then discover longer frequent pattern in STs with root Cid and its corresponding branch nodes set tags. The main feature of the algorithm is that it doesn’t need to generate candidate patterns, it turns the mining process into searching, greatly improves the efficiency.
4 Experiments and Conclusion Our experiment data came from. Index price of DOW JONES (Jan, 1,1961- Dec, 31, 1999). As shown in Figure 3, the upper figure illustrates part of its original price series. The bottom is its linear segment based its important points. Figure 4 gives the mining time of two methods (SIRST and Aprior) under different frequency threshold. From the figure 4, we can find out that our method gain an advantage over Aprior.
6,567 $35,25
Fig. 3. Time series & it’s linear segment
Fig. 4. Running time comparison of two methods
The contributions of this paper are: (1) presenting a method of linear segmentation of time series through selecting important points;(2) adopting a description of linear segmentation, which uses both relative slope and domain knowledge (3) proposing a mining method based on Inter-Relevant Successive Trees for the first time.
References 1. 2. 3. 4. 5.
6.
H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrences, In proceedings of KDD’96. Portland Oregon: AAAI Press 146–151. G. Das, K. Lin Rule Discovery from Time Series. In proceedings of KDD’98 1998, 34–43. F. Hoppner. Learning Temporal Rules from State Sequences. IJCAI Workshop on Learning from Temporal and Spatial Data 2001.145–153. J. E. Keogh, P. Smyth: A Probabilistic Approach to Fast Pattern Matching in Time Series Databases. In proceedings of KDD’97 1997: 24–30. S. Park, S. Kim, W. W. Chu. Segment-Based Approach for Subsequence Searches in Sequence Databases, In Proceedings of the 16th ACM Symposium on Applied Computing. 2001248–252. Yunfa Hu. Inter-Relevant Successive Treesüü An Novel Mathematic Model of Full Text Database. Tech. Rept: CIT-02-3. Computer Science Dept., Fudan University, 2002.3
Author Index
Abe, Jair Minoro 719 Ahn, Tae-Chon 437 Ahsan, Kamran 471 An, Qiusheng 342 Apostoli, Peter 205
Guan, J.W.
Babri, H.A. 471 Baek, Joong-Hwan 557 Bazan, Jan 181 Bell, David 274, 507 Beynon, Malcolm J. 287, 354 Butz, Cory J. 682, 686 Cai, Yue-Ru 711 Cao, Cungen 329 Chakrabarty, Kankana Chen, Bo 425, 701 Chen, Ke-Jia 640 Chen, Shifu 589 Chen, Wei 715 Chen, Y.Q. 471 Chen, Zhaojiong 269 Chin, K.S. 264 Chun, Seok-Ju 627
374
Dang, Chuangyin 264 Deng, Dayong 413 Deogun, Jitender 573 Ding, Xiangqian 339 Doherty, Patrick 405 Dong, Cun-xi 652 Dong, Yi-hong 619 Dunphy, Julia R. 453 El-Sayed, Mazen 368 Erturk, Ismail 724 Fan, Hongjian 515 Fan, Ming 515 Fang, Tingjian 334, 363 Feng, H. 213 Gan, Wenyan 603 Gao, Yang 706 Geng, Hongyan 295 Gong, ZhengYu 594 Greco, Salvatore 156
507
Hamilton, Howard J. 686 Han, Jianchao 114 Harms, Sherri 573 Hassan, Yasser 245 Hou, Lishan 138 Hu, Qiang 682 Hu, Xiaohua Tony 114 Hu, Yunfa 734 Hu, Zheng-guo 598 Huang, Houkuan 394, 499, 631, 664 Huang, Jiajing 491 Huang, Jinjie 325 Intan, Rolly 279 Inuiguchi, Masahiro
156
Jee, Dong-kun 462 Jensen, Richard 250 Jiang, Feng 413 Jiang, Ju 466 Jiang, Yuan 640 Jin, Weidong 449 Kamel, Mohamed 466 Kanda, Akira 205 Kang, Seung-Shik 623 Kim, Taiyun 697 Kuo, Tien-Fang 291 Lee, Ju-Hong 462, 627 Lee, Jung-sik 462 Lee, S.K. 697 Lee, Seok-Lyong 627 Li, Ai-Ping 259 Li, Dan 573 Li, Deyi 603 Li, Guozheng 549 Li, Han 730 Li, Hongsong 499, 664 Li, Hongxing 89 Li, Juan-Zi 711 Li, Kai 394 Li, Kenli 484 Li, Liansheng 533 Li, Minqiang 660 Li, Na 449
740
Author Index
Li, Ning 706 Li, Qinghua 484 Li, Shiyong 325 Li, Yi 730 Li, Yuancheng 334, 363 Li, Yuefeng 524 Liang, Jiye 264 Liao, Gui-Ping 259 Liau, Churn-Jung 668 Lin, Tsau Young 16, 114, 358, 403 Lin, Youfang 499, 664 Lingras, Pawan 130 Liu, Chunnian 491 Liu, Fuyan 346 Liu, Jianguo 378 Liu, Jinde 715 Liu, Qing 413 Liu, Xiangdong 589 Liu, Yuhai 339 Liu, Yunxiang 382 Liu, Zongtian 533 Lu, Shaoyi 346 Lu, Yi-nan 694 L ? ukaszewicz, Witold 405 Ma, Chuanxiang 484 Maguire, R. Brien 165, 295 Ma?luszy´ nski, Jan 197 Man, Chuntao 325 Matei, Razvan 458 Mazlack, Lawrence J. 581 Mi, Ju-Sheng 283 Miao, Duoqian 138 Miao, Zhihong 89 Ming, Liang 386 Mitra, Pabitra 104 Moshkov, Mikhail J. 611 Mukaidono, Masao 279 Murai, T. 421 Murphy, Keri S. 453 Nah, Won 557 Nakamatsu, Kazumi 719 Nakata, M. 421 Nastac, Dumitru-Iulian 458 Nguyen, Ngoc Thanh 565 Nguyen, Tuan Trung 221 Nie, Zan-Kan 398 Oh, Sung-Kwun 437 Ou, Chuangxin 491
Pacholczyk, Daniel 368 Pagliani, Piero 146 Pal, Sankar K. 104 Pan, Quan 350 Pancerz, Krzysztof 299 Pawlak, Zdzislaw 1 Pedrycz, Witold 437 Peters, James F. 25, 213, 303, 437 Polkowski, Lech 70, 106, 255 Qi, Guilin 690 Qin, Zhiguang 715 Qiu, HangPing 594, 644 Quan, Yong 607, 648 Raghavan, Vijay V. 541 Ramanna, Sheela 213, 303 Rao, Xian 652 Resconi, G. 421 Salcedo, Jose J. 453 Sato, Y. 421 Semeniuk-Polkowska, Maria 255 Seno, Toshiaki 719 Sha, Sheng-xian 694 Shen, Junyi 342 Shen, Qiang 250 Shen, Zhan 734 Shi, Fang 644 Shi, Hongbo 631 Shi, Jing 594, 644 Skowron, Andrzej 25, 181, 221, 229 ´ ˛zak, Dominik 308, 312 Sle S?lowi´ nski, Roman 156 Sobecki, Janusz 565 Son, Nguyen Hung 181 Sui, Yuefei 320, 329 Sun, Jigui 382 Sun, Zhaochun 706 Suraj, Zbigniew 299 Suzuki, Atsuyuki 719 Synak, Piotr 229 Sza?las, Andrzej 405 Szczuka, Marcin S. 181, 303 Tanaka, Hideo 52 Tang, Jie 711 Tang, Wei 476 Tazaki, Eiichiro 245 Tsumoto, Shusaku 78, 237, 316 Vald´es, Julio J. 615 Viegas Dam´ asio, Carlos
197
Author Index Vit´ oria, Aida
197
Wang, Bingzheng 515 Wang, Fang 358 Wang, Guoyin 122, 342 Wang, Hui 484 Wang, Jiayin 89 Wang, Ju 320 Wang, Ke-Hong 711 Wang, Sheng-sheng 382, 694 Wang, Shoujue 35 Wang, Xinfeng 378 Wang, Yinlong 386 Wang, Yue 589 Wang, Zhenxiao 549 Wang, Zhihai 631 Wei, Li-Li 173 Wen, Lei 660 West, Chad 130 Wong, S.K.M. 99, 676 Woo, Chong-Woo 623 Wr´ oblewski, Jakub 308 Wu, D. 99, 676 Wu, QingXiang 274 Wu, Quan-Yuan 259 Wu, Wei-Zhi 283 Wu, Yu 122 Xia, Youming 320 Xie, Gang 358 Xie, Guihai 386 Xie, Keming 358 Xie, Ying 541 Xing, Liang-liang 598 Xiong, Fenglan 339 Xu, Jiucheng 342 Yajima, Yasutoshi Yan, Rui 130 Yan, Zhong 445
291
Yang, Bo 730 Yang, Jie 549, 607, 648 Yang, Miin-shen 390 Yang, Shao-quan 652 Yang, Shaojun 730 Yang, Xue Dong 682 Yao, Hong 686 Yao, J.T. 430 Yao, Y.Y. 44, 165, 676 Ye, Chenzhou 648 Ye, Dongyi 269 Yi, Yao 635 Yu, Jian 390, 394 Yuan, Chun-Wie 445 Zeng, Haiquan 734 Zeng, Huanglin 635 Zhang, Bo 11 Zhang, Chunlin 466 Zhang, Feng 715 Zhang, Gexiang 449 Zhang, Hongcai 350 Zhang, Ling 11 Zhang, Qing 533 Zhang, Wei-Guo 398 Zhang, Wen-Xiu 173, 283 Zhang, Xianfeng 715 Zhang, Zaiyue 329 Zhao, Deyong 378 Zhao, Y. 165 Zhao, Yongqiang 350 Zheng, Ji-chuan 598 Zheng, Zheng 122 Zhong, Ning 491 Zhong, Y.X. 60 Zhou, Mingtian 425, 701 Zhou, Shijie 715 Zhou, Zhi-Hua 476, 640 Zhu, Meilin 589 Ziarko, Wojciech 189, 312
741