Asli Celikyilmaz and I. Burhan Türksen Modeling Uncertainty with Fuzzy Logic
Studies in Fuzziness and Soft Computing, Volume 240 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 224. Rafael Bello, Rafael Falcón, Witold Pedrycz, Janusz Kacprzyk (Eds.) Contributions to Fuzzy and Rough Sets Theories and Their Applications, 2008 ISBN 978-3-540-76972-9 Vol. 225. Terry D. Clark, Jennifer M. Larson, John N. Mordeson, Joshua D. Potter, Mark J. Wierman Applying Fuzzy Mathematics to Formal Models in Comparative Politics, 2008 ISBN 978-3-540-77460-0 Vol. 226. Bhanu Prasad (Ed.) Soft Computing Applications in Industry, 2008 ISBN 978-3-540-77464-8 Vol. 227. Eugene Roventa, Tiberiu Spircu Management of Knowledge Imperfection in Building Intelligent Systems, 2008 ISBN 978-3-540-77462-4 Vol. 228. Adam Kasperski Discrete Optimization with Interval Data, 2008 ISBN 978-3-540-78483-8 Vol. 229. Sadaaki Miyamoto, Hidetomo Ichihashi, Katsuhiro Honda Algorithms for Fuzzy Clustering, 2008 ISBN 978-3-540-78736-5 Vol. 230. Bhanu Prasad (Ed.) Soft Computing Applications in Business, 2008 ISBN 978-3-540-79004-4 Vol. 231. Michal Baczynski, Balasubramaniam Jayaram Soft Fuzzy Implications, 2008 ISBN 978-3-540-69080-1
Vol. 232. Eduardo Massad, Neli Regina Siqueira Ortega, Laécio Carvalho de Barros, Claudio José Struchiner Fuzzy Logic in Action: Applications in Epidemiology and Beyond, 2008 ISBN 978-3-540-69092-4 Vol. 233. Cengiz Kahraman (Ed.) Fuzzy Engineering Economics with Applications, 2008 ISBN 978-3-540-70809-4 Vol. 234. Eyal Kolman, Michael Margaliot Knowledge-Based Neurocomputing: A Fuzzy Logic Approach, 2009 ISBN 978-3-540-88076-9 Vol. 235. Kofi Kissi Dompere Fuzzy Rationality, 2009 ISBN 978-3-540-88082-0 Vol. 236. Kofi Kissi Dompere Epistemic Foundations of Fuzziness, 2009 ISBN 978-3-540-88084-4 Vol. 237. Kofi Kissi Dompere Fuzziness and Approximate Reasoning, 2009 ISBN 978-3-540-88086-8 Vol. 238. Atanu Sengupta, Tapan Kumar Pal Fuzzy Preference Ordering of Interval Numbers in Decision Problems, 2009 ISBN 978-3-540-89914-3 Vol. 239. Baoding Liu Theory and Practice of Uncertain Programming, 2009 ISBN 978-3-540-89483-4 Vol. 240. Asli Celikyilmaz, I. Burhan Türksen Modeling Uncertainty with Fuzzy Logic, 2009 ISBN 978-3-540-89923-5
Asli Celikyilmaz and I. Burhan Türksen
Modeling Uncertainty with Fuzzy Logic With Recent Theory and Applications
ABC
Authors Prof. I. Burhan Türksen University of Toronto Mechanical & Industrial Engineering 170 College St., Haultain Building Toronto, Ontario, M5S 3G8 Canada E-mail:
[email protected] and TOBB Economy and Technology University Mühendislik Fakültesi Endüstri Mühendisligi Bölümü Sögütözü Caddesi No.43 06560 Ankara Turkey E-mail:
[email protected]
Dr. Asli Celikyilmaz University of California, Berkeley BISC - The Berkeley Initiative in Soft Computing Electrical Eng. and Computer Sciences Department 415 Soda Hall Berkeley, CA, 94709-7886 USA E-mail:
[email protected],
[email protected],
[email protected]
ISBN 978-3-540-89923-5
e-ISBN 978-3-540-89924-2
DOI 10.1007/978-3-540-89924-2 Studies in Fuzziness and Soft Computing
ISSN 1434-9922
Library of Congress Control Number: 2008941674 c 2009 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 987654321 springer.com
To: Fuzzy Logic Research Fuzzy Data Mining Society and Prof. L.A. Zadeh
Foreword
Preface
The world we live in is pervaded with uncertainty and imprecision. Is it likely to rain this afternoon? Should I take an umbrella with me? Will I be able to find parking near the campus? Should I go by bus? Such simple questions are a common occurrence in our daily lives. Less simple examples: What is the probability that the price of oil will rise sharply in the near future? Should I buy Chevron stock? What are the chances that a bailout of GM, Ford and Chrysler will not succeed? What will be the consequences? Note that the examples in question involve both uncertainty and imprecision. In the real world, this is the norm rather than exception. There is a deep-seated tradition in science of employing probability theory, and only probability theory, to deal with uncertainty and imprecision. The monopoly of probability theory came to an end when fuzzy logic made its debut. However, this is by no means a widely accepted view. The belief persists, especially within the probability community, that probability theory is all that is needed to deal with uncertainty. To quote a prominent Bayesian, Professor Dennis Lindley, “The only satisfactory description of uncertainty is probability. By this I mean that every uncertainty statement must be in the form of a probability; that several uncertainties must be combined using the rules of probability; and that the calculus of probabilities is adequate to handle all situations involving uncertainty…probability is the only sensible description of uncertainty and is adequate for all problems involving uncertainty. All other methods are inadequate…anything that can be done with fuzzy logic, belief functions, upper and lower probabilities, or any other alternative to probability can better be done with probability.” What can be said about such views is that they reflect unfamiliarity with fuzzy logic. The book “Modeling Uncertainty with Fuzzy Logic,” co-authored by Dr. A. Celikyilmaz and Professor I.B. Turksen, may be viewed as a convincing argument to the contrary. In effect, what this book documents is that in the realm of uncertainty and imprecision fuzzy logic has much to offer. There are many misconceptions about fuzzy logic. Fuzzy logic is not fuzzy. Like traditional logical systems and probability theory, fuzzy logic is precise. However, there is an important difference. In fuzzy logic, the objects of discourse are allowed to be much more general and much more complex than the objects of
VIII
Foreword
discourse in traditional logical systems and probability theory. In particular, fuzzy logic provides many more tools for dealing with second-order uncertainty, that is, uncertainty about uncertainty, than those provided by probability theory. Imprecise probabilities, fuzzy sets of Type 2 and vagueness are instances of secondorder uncertainty. In many real-world settings, and especially in the context of decision analysis, the complex issue of second-order uncertainty has to be addressed. At this juncture, decision-making under second-order uncertainty is far from being well understood. “Modeling Uncertainty with Fuzzy Logic” begins with an exposition of the basics of fuzzy set theory and fuzzy logic. In this part of the book as well as in other parts, there is much that is new and unconventional. Particularly worthy of note is the authors' extensive use of the formalism of so-called fuzzy functions as an alternative to the familiar formalism of fuzzy if-then rules. The formalism of fuzzy functions was introduced by Professor M. Demirci about a decade ago, and in recent years was substantially extended by the authors. The authors employ their version of the formalism to deal with fuzzy sets of Type 2, that is, fuzzy sets with fuzzy grades of membership. To understand the authors' approach, it is helpful to introduce what I call the concept of cointension. Informally, cointension is a measure of the closeness of fit of a model to the object of modeling. A model is cointensive if its degree of cointension is high. In large measure, scientific progress is driven by a quest for cointensive models of reality. In the context of modeling with fuzzy logic, the use of fuzzy sets of Type 2 makes it possible to achieve higher levels of cointension. The price is higher computational complexity. On balance, the advantages of using fuzzy sets of Type 2 outweigh the disadvantages. For this reason, modeling with fuzzy sets of Type 2 is growing in visibility and importance. A key problem in applications of fuzzy logic is that of construction of the membership function of a fuzzy set. There are three principal approaches. In the declarative approach, membership functions are specified by the designer of a system. This is the approach that is employed in most of the applications of fuzzy logic in the realms of industrial control and consumer products. In the computational approach, the membership function of a fuzzy set is expressed as a function of the membership functions of one or more fuzzy sets with specified membership functions. In the modelization/elicitation approach, membership functions are computed through the use of cointension-enhancement techniques. In using such techniques, successive outputs of a model are compared with desired outputs, and parameters in membership functions are adjusted to maximize cointension. For this purpose, the authors make skillful use of a wide variety of techniques centering on cluster analysis, pattern classification and evolutionary algorithms. They employ simulation to validate their results. In sum, the authors develop an effective approach to modeling of uncertainty using fuzzy sets of Type 2 employing various parameter-identification formalisms. “Modeling Uncertainty with Fuzzy Logic” is an important contribution to the development of a better understanding of how to deal with second-order uncertainty.
Foreword
IX
The issue of second-order uncertainty has received relatively little attention so far, but its intrinsic importance is certain to make it an object of growing attention in coming years. Through their work, the authors have opened a door to wide-ranging applications. They deserve our compliments and congratulations.
Berkeley, CA November 30, 2008
Lotfi A. Zadeh
Preface
Preface
A representation of a system with a model and an implementation of that model to reason with and provide solutions are central to many disciplines of science. An essential concern of system models is to establish the relationships between system input variables and output variables. In cases of complex systems, conventional modeling approaches usually do not perform well because it is difficult to find a global function or analytic structure for such systems. In this regard, fuzzy system modeling (FSM) – meaning the construction of representations of fuzzy systems models – with fuzzy techniques and theories provide an efficient alternative which has proven to be quite successful. In one of his latest works1, Professor Lotfi A. Zadeh – describes remarkable capabilities of fuzzy logic as follows: “..Fuzzy logic may be viewed as an attempt at formalization/mechanization of two remarkable human capabilities. First, the capability to converse, reason and make rational decisions in an environment of imprecision, uncertainty, incompleteness of information, conflicting information, partiality of truth and partiality of possibility – in short, in an environment of imperfect information. And second, the capability to perform a wide variety of physical and mental tasks without any measurements and any computations.”. The capabilities of the fuzzy logic open possibilities for a wide range of theoretical and application problem domains. In spite of its success, implementation of fuzzy techniques and theories has been a difficult task in representing complex systems and building fuzzy system models. It requires identification of many parameters. For instance, an important problem in development of fuzzy system models is to generate fuzzy if-then rules. These rules may be constructed by an extraction of knowledge from human experts and to construct suitable membership functions. However, information supplied by humans suffers from certain serious problems. Firstly, human knowledge is usually incomplete or not organized since different experts usually make different decisions. Even the same expert may have 1
L.A. Zadeh, “Is there a need for fuzzy logic”, Information Sciences, vol. 178, issue 13, July 2008.
XII
Preface
different interpretations of the same observation on different times. Furthermore, knowledge acquisition from experts is not systematic or efficient. These problems have let researchers to build automated algorithms for modeling systems using fuzzy theories via machine learning and data mining techniques. With the above problems in hand, in this book, we propose novel and unique fuzzy-modeling approaches such deficiencies. We mainly focus on algorithms based on the novel fuzzy functions method. The new fuzzy functions approach employs membership values differently than any other fuzzy system models are implemented to date. The membership values can be thought of as ‘atoms’ that hold potential information about a system behaviour to be activated and release its power. Such potential information obtained from membership values are captured in local fuzzy functions as predictors of the system behaviour. Instead and in place of fuzzy if-then rule base structures, Fuzzy functions are implemented to build models of a given system. This book presents essential alternatives of the fuzzy functions approaches and their advancements. The aim is to build more powerful fuzzy models via autonomous approaches that can identify hidden structures via the optimization of their system parameters. The term “fuzzy functions” has been used by researchers to define different things. The building blocks of the fuzzy set theory is proposed by Professor L.A. Zadeh in 1965 especially the fuzzy extensions of classical basic notations such as logical connectives, quantifiers, inference rules, relations, arithmetic operations, etc. Hence, these constitute the initial definitions of fuzzy functions. Later different forms of fuzzy functions have been presented. In 1969, Marinos introduced the concept of well-known conventional switching theory techniques into the design of fuzzy logic systems, based on fuzzy set theory and fuzzy operations of Professor L.A. Zadeh. In their 1972 paper, Siy and Chen explored simplification of fuzzy functions and they defined fuzzy functions as the polynomials, which are formed by possible combinations of operations on fuzzy sets. Hence, fuzzy functions are defined as relations between fuzzy variables. Other researchers have also defined fuzzy functions as being a special case of fuzzy relations. Sasaki in 1993 and later in 1999 Demirci defined fuzzy functions as a special case of fuzzy relations and explored their mathematical representations. Thus, the fuzzy functions we introduce in this paper are different than the latter fuzzy functions. The idea of the fuzzy functions of this study has emerged from the idea of representing each unique rule of Fuzzy Rule Bases in terms of functions. The main structure identification tools of this book capture the hidden structures via pattern recognition methods. As well we present a new improved fuzzy clustering algorithm that would help identify the relationships between the input variables and the output variable in local models mainly by regression type system development techniques. We also focused on classification problem domains by building multiple fuzzy classifiers for each hidden pattern identified with the improved fuzzy clustering method. We present a new fuzzy cluster validity method to demonstrate how the methodologies presented fit to the estimation approaches.
Preface
XIII
Later in the book, we incorporate one of the soft computing tools to optimize the parameters of the fuzzy function models. We focus on a novel evolutionary fuzzy functions approach–the design of “improved fuzzy functions” system models with the use of evolutionary algorithms. Evolutionary algorithms are a broad class of stochastic optimization tools, inspired by biology. An evolutionary algorithm maintains a population of candidate solutions for the problem at hand, and makes it evolve by iteratively applying a set of stochastic operators; know as mutation, recombination, and selection. The resulting process tends to find, given enough time, globally optimal solutions to the problem much in the same way as in nature populations of organisms tend to adapt to their surrounding environment. Hence, using the evolutionary algorithms as the optimization tool, the local structures of the given database are identified with a new improved fuzzy clustering method and represented with novel “fuzzy functions”. Although it has been investigated for many years now, the problem of uncertainty modeling is yet to be satisfactorily resolved in system modeling communities. In engineering problems, building reliable models depends on the identification of important values of variables of model equations. However, in real life cases, these important values may not be obtained due to imprecise, noisy, vague, or incomplete nature of current information. The goal of this book is to build an uncertainty modeling architecture to represent and handle the uncertainty in parameters and structure of the fuzzy functions to capture the most available information. The uncertainty in systems can be captured with higher order fuzzy sets, viz. interval valued type-2 fuzzy sets, which was first introduced by Professor Lotfi A. Zadeh. Type-2 fuzzy systems implement type-2 fuzzy sets to capture the higher order imprecision inherited in systems. In particular, this book introduces the formation of type-2 fuzzy functions to capture uncertainties associated with system behaviour. The central contributions of this work is to expose the fuzzy functions approach and enhance it to capture imprecision in setting system model parameters by constructing a new uncertainty modeling tool. To replace the standard fuzzy rule bases with the new improved fuzzy functions succeeds in capturing essential relationships in structure identification processes and overcomes limitations exhibited by earlier fuzzy inference systems based on if-then rule base methods because there is an abundance of fuzzy operations and hence the difficulty of the choice amongst the t-norms and co-norms, methods of fuzzification and defuzzification. Designing an autonomous and robust fuzzy system model and reasoning with it is the prime goal of this approach. This new fuzzy system modeling (FSM) approach implements higher-level fuzzy sets to identify the uncertainties in: (1) the system parameters, and (2) the structure of fuzzy functions. With the identification of these parameters, interval valued fuzzy sets and fuzzy functions are identified. Finally, an evolutionary computing approach with the proposed uncertainty identification strategy is combined to build fuzzy system models that can automatically identify these uncertainty intervals.
XIV
Preface
After testing the new fuzzy functions tools on various benchmark problems, the algorithms are successfully applied to model decision processes in two real problem domains: the desulphurization process in steel making and stock price prediction activities. For both of these problems, the proposed methods produce robust and high performance results, which are comparable (if not better) than the best system modeling approaches known in the current literature. Several aspects of the proposed methodologies are thoroughly analyzed to provide a deeper understanding. These analyses show consistency of the results. As a final note, the fuzzy modeling approaches demonstrated in this book may supply suitable framework for engineers, and practitioners for a design of complex systems with fuzzy functions instead of crisp functions, that is to say, for a design of artificial systems based on well known functions, viz. regression, classification, etc. Although in this book we only show examples from economy and production industry, we believe that fuzzy modeling can and should be utilized in many fields of science, including biology, economics, psychology, sociology, history and more. In such fields, many models exist in current research literature and they can be directly transformed into mathematical models using the methodologies presented herein.
September 2008
Asli Celikyilmaz University of California, Berkeley, USA I. Burhan Turksen University of Toronto, Canada TOBB Economy and Technology University, Turkey
Contents Contents
1
Introduction………………………………………………… 1.1 Motivation………………………………………………………... 1.2 Contents of the Book……………………………………………... 1.3 Outline of the Book……..................................................................
1 1 3 9
2
Fuzzy Sets and Systems………………………………………………. 2.1 Introduction ……...……...……...……............................................ 2.2 Type-1 Fuzzy Sets and Fuzzy Logic……………………………... 2.2.1 Characteristics of Fuzzy Sets……………………………… 2.2.2 Operations on Fuzzy Sets………………………………….. 2.3 Fuzzy Logic……………………………………………………….. 2.3.1 Structure of Classical Logic Theory.………………………. 2.3.2 Relation of Set and Logic Theory………………………….. 2.3.3 Structure of Fuzzy Logic………………………………….. 2.3.4 Approximate Reasoning……………………………………. 2.4 Fuzzy Relations…………………………………………………… 2.4.1 Operations on Fuzzy Relations…………………………….. 2.4.2 Extension Principle………………………………………… 2.5 Type-2 Fuzzy Sets…………………………………………………. 2.5.1 Type-2 Fuzzy Sets………………………………………….. 2.5.2 Interval Valued Type-2 Fuzzy Sets………………………… 2.5.3 Type-2 Fuzzy Set Operations………………………………. 2.6 Fuzzy Functions…………………………………………………… 2.7 Fuzzy Systems………………………………………………….… 2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems…………. 2.8.1 Adaptive-Network-Based Fuzzy Inference System (ANFIS)…………………………………………………... 2.8.2 Dynamically Evolving Neuro-Fuzzy Inference Method (DENFIS)………………… ……………………………... 2.8.3 Genetic Fuzzy Systems (GFS)…………………………….. 2.9 Summary………………………………………………………..…
11 11 12 13 14 18 18 19 19 21 22 25 25 28 29 31 32 33 36 40
Improved Fuzzy Clustering………………………………………….. 3.1 Introduction………………………………………………………. 3.2 Fuzzy Clustering Algorithms……………………………………...
51 51 52
3
41 44 46 50
XVI
Contents
3.2.1 Fuzzy C-Means Clustering Algorithm ……………………. 53 3.2.2 Classification of Objective Based Fuzzy Clustering Algorithms………………………………………………… 58 3.2.3 Fuzzy C-Regression Model (FCRM) Clustering Algorithm………………………………………………….. 58 3.2.4 Variations of Combined Fuzzy Clustering Algorithms…….. 61 3.3 Improved Fuzzy Clustering Algorithm (IFC)……………………… 64 3.3.1 Motivation …………………………………………………. 64 3.3.2 Improved Fuzzy Clustering Algorithm for Regression Models ( IFC )……………………………………………… 69 3.3.3 Improved Fuzzy Clustering Algorithm for Classification Models (IFC-C) ……………………………………..……. 73 3.3.4 Justification of Membership Values of the IFC Algorithm….. 77 3.4 Two New Cluster Validity Indices for IFC and IFC-C…………….. 85 3.4.1 Overview of Well-Known Cluster Validity Indices ………... 86 3.4.2 The New Cluster Validity Indices…………………………… 90 3.4.3 Simulation Experiments [Celikyilmaz and Turksen, 2007i;2008c]……………………………………………..…. 94 3.4.4 Discussions on Performances of New Cluster Validity Indices Using Simulation Experiments……………………... 100 3.5 Summary…………………………….…………………………….. 103 4
Fuzzy Functions Approach……………………….………….……….. 4.1 Introduction ……………………………………………………….. 4.2 Motivation…………………………………………………………. 4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF…..…………………………………………………… 4.3.1 Structure Identification of FF for Regression Models (T1FF )……………………………………………………… 4.3.2 Structure Identification of the Fuzzy Functions for Classification Models (T1FF-C )……………………......... 4.3.3 Inference Mechanism of T1FF for Regression Models……………………………...……………….…….. 4.3.4 Inference Mechanism of T1FF for Classification Models.... 4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF....……………………………………………..……… 4.4.1 Structure Identification of T1IFF for Regression Models.... 4.4.2 Structure Identification of T1IFF-C for Classification Models……..………………………………………………. 4.4.3 Inference Mechanism of T1IFF for Regression Problems...……………………………………………….... 4.4.4 Inference with T1IFF-C for Classification Problems........... 4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems…………………………………………………………….. 4.5.1 Genetic Learning Process: Genetic Tuning of Improved Membership Functions and Improved Fuzzy Functions…...
105 105 107 112 112 119 121 122 125 125 131 132 135 136 139
Contents
5
6
XVII
4.5.2 Inference Method for ET1IFF and ET1IFF-C……….......... 4.5.3 Reduction of Structure Identification Steps of T1IFF Using the Proposed ET1IFF Method…………………........ 4.6 Summary………………………………………………………….
146 147
Modeling Uncertainty with Improved Fuzzy Functions…………..
149
5.1 Motivation ……...……...……...……........................................... 5.2 Uncertainty ………………………………………………………. 5.3 Conventional Type-2 Fuzzy Systems………….…………………. 5.3.1 Generalized Type-2 Fuzzy Rule Bases Systems (GT2FRB)…........................................................................ 5.3.2 Interval Valued Type-2 Fuzzy Rule Bases Systems (IT2FRB)…………………………………………….….. 5.3.3 Most Common Type-Reduction Methods……………….. 5.3.4 Discrete Interval Type-2 Fuzzy Rule Bases (DIT2FRB)..... 5.4 Discrete Interval Type-2 Improved Fuzzy Functions…………….. 5.4.1 Background of Type-2 Improved Fuzzy Functions Approaches………………………………………………… 5.4.2 Discrete Interval Type-2 Improved Fuzzy Functions System (DIT2IFF)…………………………………..…..... 5.5 The Advantages of Uncertainty Modeling…………………..…... 5.6 Discrete Interval Type-2 Improved Fuzzy Functions with Evolutionary Algrithms…………..………………………………. 5.6.1 Motivation………………………...……………………… 5.6.2 Architecture of the Evolutionary Type-2 Improved Fuzzy Functions……..…..……………………………….. 5.6.3 Reduction of Structure Identification Steps of DIT2IFF Using New EDIT2IFF Method……...………………...…. 5.7 Summary…………………………………………………………
149 154 157
Experiments…….…….…….…….…….…….…….…….…….……. 6.1 Experimental Setup ……...……...……...……............................... 6.1.1 Overview of Experiments…………………………………. 6.1.2 Three-Way Sub-sampling Cross Validation Method……… 6.1.3 Measuring Models’ Prediction Performance……………… 6.1.3.1 Performance Evaluations of Regression Experiments…………………………..…………. 6.1.3.2 Performance Evaluations of Classification Experiments……………………..…………….... 6.2 Parameters of Benchmark Algorithms…………………………… 6.2.1 Support Vector Machines (SVM)…………………………. 6.2.2 Artificial Neural Networks (NN)………………………….. 6.2.3 Adaptive-Network-Based Fuzzy Inference System (ANFIS)……………………………………………………
145
157 160 162 164 167 168 179 193 196 196 197 213 215 217 217 217 219 221 221 223 227 228 229 229
XVIII
6.2.4 Dynamically Evolving Neuro-Fuzzy Inference Method (DENFIS)…………………………………………………. 6.2.5 Discrete Interval Valued Type-2 Fuzzy Rule Base (DIT2FRB)……………………………………………..... 6.2.6 Genetic Fuzzy System (GFS)…………………………..... 6.2.7 Logistic Regression, LR, Fuzzy K-Nearest Neighbor, FKNN…………………………………………................. Parameters of Proposed Fuzzy Functions Algorithms………….. 6.3.1 Fuzzy Functions Methods……………………………….. 6.3.2 Imporoved Fuzzy Functions Methods……………………. Analysis of Experiments – Regression Domain………………… 6.4.1 Friedman’s Artificial Domain…………………………… 6.4.2 Auto-mileage Dataset……………………………..……... 6.4.3 Desulphurization Process Dataset……………………….. 6.4.4 Stock Price Analysis……………………………………... 6.4.5 Proposed Fuzzy Cluster Validity Index Analysis for Regression……………………………………………….. Analysis of Experiments - Classification (Pattern Recognition) Domains…………………………………………………………. 6.5.1 Classification Datasets from UCI Repository…………… 6.5.2 Classification Dataset from StatLib……………………… 6.5.3 Results from Classification Datasets…………………….. 6.5.4 Proposed Fuzzy Cluster Validity Index Analysis for Classification………………………………………..... 6.5.5 Performance Comparison Based on Elapsed Times……... Overall Discussions on Experiments……………………………. 6.6.1 Overall Comparison of System Modeling Methods on Regression Datasets...……….…………………………… 6.6.2 Overall Comparison of System Modeling Methods on Classification Datasets…………………...………………. Summary of Results and Discussions……………...…………….
297 300
Conclusions and Future Work……………………………………… 7.1 General Conclusions...……...……...……..................................... 7.2 Future Work……………………………………………………… References……………………………………………………………..
305 305 310 313
Appendix……………………………………………………………… A.1 Set and Logic Theory – Additional Information………………… A.2 Fuzzy Relations (Composition) – An Example………………….. B.1 Proof of Fuzzy c-Means Clustering Algorithm………………….. B.2 Proof of Improved Fuzzy Clustering Algorithm………………… C.1 Artificial Neural Networks ANNs)……………………………… C.2 Support Vector Machines………………………………………... C.3 Genetic Algorithms………………………………………………
321 321 322 323 326 327 329 338
6.3
6.4
6.5
6.6
6.7 7
Contents
231 231 232 234 234 234 236 238 238 245 251 262 276 278 279 281 281 283 284 289 290
Contents
C.4 Multiple Linear Regression Algorithms with Least Squares Estimation……………………………………….…….….…….. C.5 Logistic Regression…………………………………………….. C.6 Fuzzy K-Nearest Neighbor Approach……………………….…. D.1 T-Test Formula……………………………………………….… D.2 Friedman’s Artificial Dataset: Summary of Results…………… D.3 Auto-mileage Dataset: Summary of Results…………………… D.4 Desulphurization Dataset: Summary of Results………………... D.5 Stock Price Datasets: Summary of Results…………………….. D.6 Classification Datasets: Summary of Results…………………... D.7 Cluster Validity Index Graphs………………………………...... D.8 Classification Datasets – ROC Graphs………………………….
XIX
340 341 343 344 345 354 363 367 388 397 398
List of Tables Contents
Table 2.1 Table 2.2
Some well known t-norms and t-conorms...……………….. The AGE and SALARY attributes of Employees...………..
18 26
Table 3.1 Table 3.2
Distance Measures...………………...................................... Membership values as input variables in Fuzzy Function parameter estimations…………………………………......... Correlation Analysis of FCM clustering and IFC membership values with the output variable……………..… Significance Test Results of fuzzy functions using membership values obtained from FCM clustering and IFC fuzzy functions………………………………………… Functions used to generate Artificial Datasets……………... Optimum number of clusters of artificial datasets for m {1.3,2.0}……………………………………………….. Optimum Number of Clusters of IFC models of stock price dataset identified by different validity indices……….. Optimum Number of Clusters of IFC models of Ionosphere dataset indicated by different validity indices……................
55 79
Table 3.3 Table 3.4
Table 3.5 Table 3.6 Table 3.7 Table 3.8
Table 4.1 Table 4.2
Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5
∈
Number of parameters of a Type-1 Improved Fuzzy Functions – T1IFF experiment Parameter………….…….... The number of parameters of an Evolutionary Type-1 Improved Fuzzy Function experiments……………………. Improved Fuzzy Functions Parameter Representation……………………………………………… Differences between the and Earlier Type-2 Fuzzy System Modeling Approach………………………………...……... The steps of the Genetic Learning Process of EDIT2IFF… Number of parameters of Discrete Interval Type-2 Improved Fuzzy Functions (DIT2IFF) …………...………. The number of parameters of Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions. (EDIT2IFF)...
81
85 95 101 102 103
146 147 177
180 204 214 215
XXII
Table 6.1 Table 6.2
Table 6.3 Table 6.4 Table 6.5 Table 6.6
Table 6.7
Table 6.8 Table 6.9 Table 6.10 Table 6.11 Table 6.12 Table 6.13 Table 6.14
Table 6.15 Table 6.16
Table 6.17
Table 6.18 Table 6. 19
List of Tables
Overview of Datasets used in the experiments…………… Calculation of overall performance of a method based on three way cross validation results. The overall performance is represented with tuple of 〈 PM , stPM 〉…… Contingency Table to calculate accuracy…………………. Learning parameters of Support Vector Machines for classification and regression methods……………………... Learning parameters of 1-Layer Neural Networks………... Learning parameters of Adaptive Network Fuzzy Inference Systems – ANFIS (Takagi-Sugeno) Subtractive Clusteing Method…………………………...... Learning parameters of Dynamically Evolving Neuro-Fuzzy Inference System - DENFIS Online Learning with Higher order Takagi-Sugeno (TS) inference……….………………………………………….. Learning parameters of Type-2 Fuzzy Rule Base Approach - DIT2FRB…..………………………………..... Initial Parameters of Genetic Fuzzy System…………......... The Parameters of Type-1 and Type-2 Fuzzy Functions Methods for Regression Problems………..……………….. The Parameters of Type-1 and Type-2 Fuzzy Functions Methods for Classification Problems………..………..…… The Parameters of Type-1 and Type-2 Improved Fuzzy Functions Methods for Regression………..……………..... The Parameters of Type-1 and Type-2 Improved Fuzzy Functions Methods forClassification Problems………..….. R2 values obtained from the application Type-1 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Freidman’s Artificial Dataset.…………..…………………………….... Optimum Parameters of Variations of Type-1 Fuzzy Functions Approach variations………..……….…..…........ R2 values obtained from the application of Benchmark Approaches on Training-Validation-Testing Datasets of Freidman’s Artificial Dataset and their optimum Mode parameters…………………………………………... R2 values obtained from the application Type-2 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Freidman’s Artificial Data set…………..…………………………….... Optimum Parameters of Variations of Type-2 Fuzzy Functions Approach……..…...……….………………….... R2 values obtained from the application Earlier Type 2 Fuzzy Rule Base –DIT2FRB Approach on TrainingValidation-Testing Datasets of Freidman’s Artificial Dataset…………………………………………………….
218
221 223 228 229
230
231 232 233 235 235 236 237
239 239
240
240 240
241
List of Tables
Table 6.20 Two sample left-tailed t-test results (p<0.05) for Fried Artificial Dataset.FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R)……................................... Table 6.21 Auto-Mileage Dataset Variables…………………………... Table 6.22 R2 values obtained from the application Type-1 Fuzzy Functions Approaches and its variations on TrainingValidation-Testing Datasets of Auto Mileage Data set........ Table 6.23 Optimum Parameters of the variations of Type-1 Fuzzy Functions Approach……..…………..…………………….. Table 6.24 R2 values obtained from the application of Benchmark Approaches on Training-Validation-Testing Datasets of Auto-Mileage Dataset and their optimum model parameters………..…………………………………........... Table 6.25 R2 values obtained from the application variations of Type-2 Fuzzy Functions Approaches on TrainingValidation-Testing Datasets of Auto-Mileage Dataset......... Table 6.26 Optimum Parameters of variations of Type-2 Fuzzy Functions Approach……..………………………………… Table 6.27 R2 values obtained from the application Earlier Type-2 Fuzzy Rule Base –DIT2FRB Approach on TrainingValidation-Testing Datasets of Auto-Mileage Dataset......... Table 6.28 Two-sample left tailed t-test results (p<0.05) for Auto-Mileage Dataset. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R)..………..……………….. Table 6.29 Desulphurization Dataset variables……………………...... Table 6.30 R2 values obtained from the application Type-1 Fuzzy Functions Approaches and its variations on TrainingValidation-Testing Datasets of Reagent1 and Reagent2 Datasets of Desulphurization Process…………..…………. Table 6.31 Optimum Parameters of variations of Type-1 Fuzzy Functions Approaches…..…………………………….…... Table 6.32 R2 values obtained from the application of Benchmark Methods on Training-Validation-Testing Datasets for Reagent1 and Reagent2 of Desulphurization Process and their optimum model parameters………..……………. Table 6.33 R2 values obtained from the application of variations of Type-2 Fuzzy Functions Approach on TrainingValidation-Testing Datasets for Reagent1 and Reagent2 Datasets of Desulphurization Process……………………... Table 6.34 Optimum Parameters of variations of Type-2 Fuzzy Functions Approach……..…………………..……………..
XXIII
243 245
246 246
247
247 247
248
249 254
256 257
257
258 258
XXIV
List of Tables
Table 6.35 R2 values obtained from the application Earlier Type-2 Fuzzy Rule Base –DIT2FRB Approach on TrainingValidation-Testing Datasets of Desulphurization Process… Table 6.36 Two-sample left tailed t-test results (p<0.05) for Reagent 1 Dataset. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R).…………….……………… Table 6.37 Two-sample left tailed t-test results (p<0.05) for Reagent 2 Dataset. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R)……..……………................ Table 6.38 List of variables used in Stock Price Estimation Models..... Table 6.39 Descriptions of Stock Datasets that are used in the experiments………..……………...………………..…….... Table 6.40 STB Values obtained from two different Hypothetical Models of an Artificial Stock Data. Both hypothetical models start trading with $100……….……………………. Table 6.41 RSTB Prediction errors of two Hypothetical Models* of Artificial Stock Dataset. Both hypothetical models start trading with $100.………..………...……………..….. Table 6.42 Predictions obtained from two Hypothetical Models of Artificial Stock Data and the RSTB Values. Both hypothetical models start trading with $100………............. Table 6.43 Average performance measures of models of five Real Stock Prices based on MAPE of Testing datasets. The standard deviations over five different testing dataset is shown in parenthesis………….……………………........... Table 6.44 Average performance measures of models of five Real Stock Prices based on RSTB of Testing datasets. The standard deviations over five different testing dataset Is shown in parenthesis…………..………………............... Table 6.45 Summary of performances of models on five different real stocks based on profitability values of best models of proposed, benchmark and ‘buy&hold’ methods……..… Table 6.46 The effects of broker commission on the average profit obtained from benchmark models of TD Bank stock prices. Different investment amounts are shown..…........... Table 6.47 Description of Classification Datasets from UCI Repository that are used in the experiments……….…….... Table 6.48 Performance Measure in AUC values of proposed and benchmark classification methods on six different classification datasets………….………………………….. Table 6.49 Optimum Parameters of the best classification models of five different classification datasets…………………….....
259
260
261 264 266
268
270
271
272
273
274
275 279
282 282
List of Tables
Table 6.50 Parameters of four different models that are build using four different methods to compare their elapsed times on Desulphurization data set……...…..…………………... Table 6.51 Parameters of different models that are build to measure their elapsed times. Four different methods are paired based on their optimization methods……………………… Table 6.52 Average Profit based on the RSTB Performance Measure and Ranking of system modeling techniques on five Real Stock Price Datasets…...……..………………………….... Table 6.53 Overall two-sample left tailed t-test results (p<0.05) for Stock Price Datasets. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R)…….………………………………..... Table 6.54 TD Canada Trust Buy–Sell-Hold trend and Current Market values between 73- 82th days obtained from SVM and proposed Evolutionary Type-2 Improved Fuzzy Functions model (EDIT2IFF).…………………………...... Table 6.55 Average R2 and Rankings of system modeling techniques on Regression Problems. The values in parenthesis are standard deviation obtained from cross validation iterations. The values in bold indicate the optimum method of the corresponding dataset of each column………...…..... Table 6.56 Overall two-sample left tailed t-test results (p<0.05) for Regression Datasets. R: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below ach decision indicate the probability of observing the decision (FR/R).………………………………………... Table 6.57 Average AUC values and Rankings of system modeling techniques on Classification Problems. The values in parenthesis indicate the rank of each methodology for the corresponding dataset in each column...……..…………..... Table 6.58 Overall two-sample left tailed t-test results (p<0.05) for the Classification Datasets: FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis.……….…..…. Table 6.59 Overall Ranking Results of Classification Datasets…….… Table 6.60 Summary of Results………………………….…………… App. Table D.1 Friedman’s Artificial R2 results on Training Data set from Cross Validation (CV) Trials……………….. App. Table D.2 Friedman’s Artificial R2 results on Validation Dataset from Cross Validation (CV) Trials………….. App. Table D.3 Friedman’s Artificial R2 results on Testing Dataset from Cross Validation (CV) Trials…………………...
XXV
285
287
291
292
295
296
297
298
299 299 302
345 346 346
XXVI
List of Tables
App. Table D.4 Friedman’s Artificial MAPE results on Training Dataset from Cross Validation (CV) Trials………….. App. Table D.5 Friedman’s Artificial MAPE results on Validation Dataset from Cross Validation (CV)Trials…………... App. Table D.6 Friedman’s Artificial MAPE results on Testing Dataset from Cross Validation (CV) Trials………….. App. Table D.7 Friedman’s Artificial RMSE results on Training Dataset from Cross Validation (CV) Trials………….. App. Table D.8 Friedman’s Artificial RMSE results on Validation Dataset from Cross Validation (CV) Trials…………. App. Table D.9 Friedman’s Artificial RMSE results on Testing Dataset from Cross Validation (CV) Trials………….. App. Table D.10 Auto Mileage Dataset R2 values on Training Datasets of Cross Validation (CV) Trials…………… App. Table D.11 Auto Mileage Dataset R2 values on Validation Datasets of Cross Validation (CV)…………………. App. Table D.12 Auto Mileage Dataset R2 values on Testing Datasets of Cross Validation (CV)……………….…. App. Table D.13 Auto Mileage Dataset MAPE values on Training Datasets of Cross Validation (CV)…………….……. App. Table D.14 Auto Mileage Dataset MAPE values on Validation Datasets of Cross Validation (CV)………………….. App. Table D.15 Auto Mileage Dataset MAPE values on Testing Datasets of Cross Validation (CV)…………….……. App. Table D.16 Auto Mileage Dataset RMSE values on Training Datasets of Cross Validation (CV)……………….…. App. Table D.17 Auto Mileage Dataset RMSE values on Validation Datasets of Cross Validation (CV)………………….. App. Table D.18 Auto Mileage Dataset RMSE values on Testing Datasets of Cross Validation (CV)……………….… App. Table D.19 Desulphurization Reagent1 R2 values on Training Dataset of Cross Validation (CV)Trials………...….. App. Table D.20 Desulphurization Reagent1 R2 values on Valida tion Dataset of Cross Validation (CV)Trials………. App. Table D.21 Desulphurization Reagent1 R2 values on Testing Dataset of Cross Validation (CV)Trials…………… App. Table D.22 Desulphurization Reagent2 R2 values on Train ing Dataset of Cross Validation (CV)Trials……….. App. Table D.23 Desulphurization Reagent2 R2 values on Valida tion Dataset of Cross Validation (CV)Trials………. App. Table D.24 Desulphurization Reagent2 R2 values on Testing Dataset of Cross Validation (CV) Trials………....... App. Table D.25 TD Canada Trust Cross Validation (CV) RSTB values for Testing Dataset Buy_and_Hold profit =$102.28……………………………………………
347 347 348 348 349 349 354 354 355 355 356 356 357 357 358 364 365 365 366 366 367
369
List of Tables
App. Table D.26 TD Canada Trust Cross Validation (CV) MAPE% for Training Dataset…………………....... App. Table D.27 Canada Trust Cross Validation (CV) MAPE% for Validation Dataset…………………… App. Table D.28 D Canada Trust Cross Validation (CV) MAPE% for Testing Dataset………………………. App. Table D.29 Canada Trust – Market Value estimation of two prediction algorithms. The shaded region indi cates that from that day forward the two method ologies starts to predict different values for the next day stock prices………………………………. App. Table D.30 Bank of Montréal Cross Validation (CV) RSTB values for Testing Dataset…………………………. App. Table D.31 Bank of Montréal Cross Validation (CV) MAPE values for Training Dataset………………………... App. Table D.32 Bank of Montréal Trust Cross Validation (CV) MAPE values for Validation Dataset……………… App. Table D.33 Bank of Montréal Trust Cross Validation (CV) MAPE values for Testing Dataset…………………. App. Table D.34 Sun Life Cross Validation (CV) RSTB values for Testing Dataset…………………………………….. App. Table D.35 Sun Life Cross Validation MAPE for Training Dataset………………………………………........... App. Table D.36 Sun Life Cross Validation MAPE % for Valida tion Dataset………………………………………… App. Table D.37 Sun Life Cross Validation MAPE % for Testing Dataset……………………………………………... App. Table D.38 Enbridge Cross Validation (CV) RSTB values for Testing Dataset………………………………… App. Table D.39 Enbridge Cross Validation (CV) MAPE % for Training Dataset…………………………………… App. Table D.40 Enbridge Cross Validation (CV) MAPE % for Validation Dataset…………………………............. App. Table D.41 Enbridge Cross Validation (CV) MAPE % for Testing Dataset…………………………………….. App. Table D.42 Loblaw Cross Validation (CV) RSTB values for Testing Dataset…………………………………….. App. Table D.43 Loblaw Cross Validation (CV) MAPE % for Training Dataset…………………………………… App. Table D.44 Loblaw Cross Validation (CV) MAPE % for Validation Dataset………………………………… App. Table D.45 Loblaw Cross Validation MAPE % for Testing Dataset…………………………………………....... App. Table D.46 Cancer Dataset AUC values from testing dataset………………………………………………
XXVII
369 370 370
371 374 374 375 375 376 376 377 377 378 378 379 379 380 380 381 381 388
XXVIII
List of Tables
App. Table D.47 Cancer Dataset Accuracy values from testing Dataset…………………………………………….. App. Table D.48 Diabetes Dataset AUC valued from testing Dataset…………………………………………....... App. Table D.49 Diabetes Dataset Accuracy values from testing Dataset…………………………………………....... App. Table D.50 Liver Dataset AUC values from testing dataset…… App. Table D.51 Liver Dataset Accuracy values from testing Dataset…………………………………………....... App. Table D.52 Ion Dataset AUC values from testing dataset……… App. Table D.53 Ion Dataset Accuracy values from testing dataset……………………………………………… App. Table D.54 Credit Dataset AUC values from testing dataset……………………………………………… App. Table D.55 Ion Dataset Accuracy values from testing Dataset…………………………………………....... App. Table D.56 California Dataset AUC values from testing Dataset……………………………………………... App. Table D.57 California Dataset Accuracy values from testing Dataset……………………………………………...
388 389 389 390 390 391 391 392 392 393 393
List of Figures Contents
Fig. 1.1 Contents of the book at a glance………………………………. Fig. 2.1 Membership functions of fuzzy sets “young” (green), “middle age” (red) and “old” (blue)…………………………… Fig. 2.2 Fuzzy Set Properties on a “Trapezoidal” membership function example…………………………………………........ Fig. 2.3 Union and Intersection………………………………………... Fig. 2.4 Complement of Fuzzy Set A,Ac. (--- line) ………………….... Fig. 2.5 Crisp Binary Relation (Equalityrelation, E, between x and y) ……………………………………………………...…. Fig. 2.6 Cartesian product of two fuzzy sets (A and B)……………….. Fig. 2.7 Fuzzy set of young employees (A(x)) ………………………... Fig. 2.8 Fuzzy set of salary of young employees (B(y))………………. Fig. 2.9 Comparison of Type-1 and Type-2 Fuzzy sets. The shaded region on the right graph is the secondary membership function of x…………………………………………………… Fig. 2.10 Example of a full type 2 membership function. The shaded area is the ‘Footprint of Uncertainty’ (FOU). The amplitude of the sticks is the secondary membership values……………….. Fig. 2.11 Interval Valued Type-2 Fuzzy Set…………………………….. Fig. 2.12 Generalized Type 1 Fuzzy Inference Systems (Fuzzy Logic System)……………………………………………………….. Fig. 2.13 TSK Fuzzy Reasoning…………………………..……….……. Fig. 2.14 A Sample ANFIS Structure……………………..…………….. Fig. 2.15 Chromosome structure of a Generic Fuzzy System. Each ai j and bi j indicates the triangular membership function parameters….………………………………………………… Fig. 2.16 The effect of changing parameters on membership function……………………………………………………......
Fig. 3.1 Scatter Plot of artificial data with two different random funtions.f1(x)=y+radom_noise,f2(x)=0.3+random_Noise….... Fig. 3.2 Types of Membership value calculation equations…………... Fig. 3.3 Scatter plot of the artificial dataset………………………........
3
13 14 16 16 23 24 27 27 29
30 31 39 42 42
48 49
59 78 80
XXX
List of Figures
Fig. 3.4 (White Surface) Actual Decision Surface using Member Ship values (from the FCM clustering), (Black Linear Surface) the estimated linear decision surface. ‘u’ indicates the membership values.……………………………. Fig. 3.5 (White Surface) Actual Decision Surface using Member ship values (from the IFC), (Black Linear Surface) the estimated linear decision surface. ‘u’ indicates the improved membership values..………………………............. Fig. 3.6 Compactness and Separation concepts of ratio-type CVI index.……………………………………………………......... Fig. 3.7 Four Different Functions with Varying Separability Relations……………………………………………………… Fig. 3.8 Dataset 1 - Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values;m=1.3 (.-) and m=2.0(*-) using 4-patterned dataset………………………….. Fig. 3.9 Dataset 2- Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c fortwo m values; m=1.3 (.-) and m=2.0(*-) using 5-patterned dataset………………………….. Fig. 3.10 Dataset 3- Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 7-patterned dataset.…………………………. Fig. 3.11 Dataset 4-Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 9-patterned dataset…………...………..……. Fig. 3.12 Density graph of stock price dataset using two financial indicators. Two components that are well separated are indicated…………...…………………………...……….…….. Fig. 3.13 Stock Price Dataset - Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-)……………...……………….…….. Fig. 3.14 Ionosphere Dataset- Cluster validity measures, XB, XB*, Kung-Lin and cviIFCC, versus c for two m values; m=1.3 (.-) and m=2.0(*-)………...………………….….……. Fig. 4.1 Framework of fuzzy system models with “Fuzzy Functions” approach…………………………………………... Fig. 4.2 Evolution of Type-1 Fuzzy Functions Strategies …...………... Fig. 4.3 Modeling with Fuzzy Functions Systems …………................. Fig. 4.4 Mapping the Input-Output space onto individual clusters using a one dimensional Toy Example. Support Vector Regression is used to find the decision surfaces for two clusters in (1+1) space, i.e., only membership values are used as additional dimensions. ………………..……………… Fig. 4.5 Type-1 Improved Fuzzy Functions Framework based on sub-sampling cross validation method………………………...
83
84 88 93
95
96
96
97
97
99
100
110 112 113
116 125
List of Figures
Fig. 4.6 Structure of the T1IFF approach. (Top) Structure Identification with Improved Fuzzy Functions, (Bottom) Improved fuzzy function i.………………………………....... Fig. 4.7 Implementation of Improved Fuzzy Clustering in T1IFF………………………………………………………….. Fig. 4.8 Topology of Type-1 Improved Fuzzy Functions Systems for Classification Models…..…………………………………. Fig. 4.9 Proposed Inference Mechanism of T1IFF..…………..…… Fig. 4.10 Evolutionary Improved Fuzzy Functions (ET1IFF) System Modeling Framework –Extension of the Improved Fuzzy Functions (T1IFF) Approach to Genetic Fuzzy Systems. i=1…c………………………………………..……… Fig. 4.11 Genome Encoding for ET1IFF systems. (Top) Hierarchi cal structure genome encoding when linear regression is used (Middle) Encoding when SVM is used (Bottom) Example chromosome structure. ……………………………... Fig. 4.12 Decision surfaces of one chromosomes of ET1IFF model using two different fuzzy function structures. (m,c, Creg, )={1.75,3,54.5,0.115}. uci represents membership values of corresponding cluster. ……………………………… Fig. 4.13 Genome Two different chromosomes from the GLP algorithm of ET1IFF modeling approach applied on the Artificial Dataset. The dark colored token is the only difference between the two chromosomes. ‘1’:Linear Kernel Model, ‘2’:Non-Linear Kernel Model. ……...………... Fig. 4.14 Genome Encoding for ET1IFF-C systems for Classification Domains. (Top) Hierarchical structure genome encoding for linear regression (Middle) Encoding when SVM is used (Bottom) Example chromosome structure...................................................................................... Fig. 5.1 Type-1 Improved Fuzzy Functions Systems. ‘*’ are Mendel’s terminology which correspond to terminology used for fuzzy function strategies in this book..….................. Fig. 5.2 Type-2 Improved Fuzzy Functions System. * indicate difference between Mendel’s terminology and Fuzzy Function terminology.…........................................................... Fig. 5.3 Different Types of Uncertainty................................................ Fig. 5.4 Example of a full type 2 membership function. The shaded area is the ‘Footprint of Uncertainty’ (FOU). The amplitude of the sticks is secondary membership values......... Fig. 5.5 The structure of Generalized Type-2 Fuzzy Rule Base Systems (GT2FRB)…………………………………….……. Fig. 5.6 DIT2FRB inference strategy by [Uncu, 2003; Uncu and Turksen, 2007]…………………………………………...
XXXI
127 127 131 134
138
140
143
144
145
151
152 155
158 158 165
XXXII
List of Figures
Fig. 5.7 Interval Type-2 membership function with fixed mean and uncertain standard deviation. The shaded area in cludes infinite number of membership values……………….. Fig. 5.8 Artificial dataset with one input variable (x1) and a single output (y) ………………………………………………......... Fig. 5.9 Membership values for cluster 1 using the artificial data set determined by changing m values………………………... Fig. 5.10 Changing cluster center locations of cluster 1 based on changing m values. The square dots in bold indicate the cluster centers for changing m values………………………... Fig. 5.11 Uncertainty interval of a cluster represented with membership functions of 3-tuples, < c,mr,τs>. c is the number of clusters, mr defines the level of fuzziness, τs represents the interim fuzzy function structures of the Improved Fuzzy Clustering (IFC).…………………………... Fig. 5.12 A general topology of the DIT2IFF structure identification. The shaded fuzzy functions represent the fuzzy functions of the corresponding cluster. The output value of kth data point, yk, is calculated using each output obtained from these fuzzy functions. …………………….….. Fig. 5.13 DIT2IFF Inference Framework. Grey boxes depict enhanced/newly introduced operations compared to earlier DIT2FRB approach.……………………………….…. Fig. 5.14 Cased Based Type reduction for inference method of DIT2IFF system. The dark dots (●) represent the opti mum membership values and fuzzy functions determined for each testing vector, l, l=1nte……………………………... Fig. 5.15 RMSE values of improved fuzzy functions for a specific cluster i. Each line represents one type of fuzzy functions denoted in the legend of the figure. u represents membership values.………………………………………….. Fig. 5.16 The architecture of Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions…………………………….…….. Fig. 5.17 EDIT2IFF design – Phase 1. The dark dot (●) represent the tokens (each gene) corresponding to the parameters specific to the function type, e.g., if SVM is chosen, these parameters would include the regularization (Creg), epsilon (ε), Kernel Type (K(.)). Darker colored tokens represent control genes………………………………….… Fig. 5.18 Genetic Learning Process – Phase 1 of Evolutionary Interval Type-1 Fuzzy Functions models, identifying (TOP) interval valued membership functions, (BOTTOM) interval for fuzzy functions……………..…..
172 174 175
175
181
183
188
190
195 197
199
201
List of Figures
Fig. 5.19 Decision surfaces obtained from GLP for changing fuzzy function structures of chromosomes. K(.)={Linear or Non-linear} and (mlow,mup,Creg,ε)= {1.75,2.00,54.5,0.115},c*=3. uclusi represents improved membership values of corresponding cluster i.…………........ Fig. 5.20 Two different chromosomes from the GLP algorithm of EDIT2IFF modeling approach applied on the Artificial Dataset. The dark colored token is the only difference between the two chromosomes. ‘1’:Linear Kernel Model, ‘2’:Non-Linear Kernel Model.……………………………… Fig. 5.21 Interval valued membership values of evolutionary type2 fuzzy functions strategy. Uncertainty interval repre sented with membership values dispersion induced by each tuple <mr,τs> (a) Optimized uncertainty interval from GLP-Phase 1 of EDIT2IFF, (b) Discrete Improved Membership Values- Phase2 EDIT2IFF (c) magnified view of different discrete membership values for any x′…….. Fig. 5.22 Uncertainty interval of Fuzzy Function Structures. (a) Uncertainty interval represented with different output values obtained from list of fuzzy function structures induced by each tuple <mr,τs,Φp> (b) Optimized uncertainty interval from GLP-Phase 1 of EDIT2IFF, (c) magnified view of different output values, yik r,s,ψ, for a specific x′ vector obtained form optimized list of fuzzy functions. ……………………….…………………………… Fig. 6.1 General Framework of the Three-Way Cross Validation Method……………………………………………………….. Fig. 6.2 Three-way cross validation used in Fuzzy Functions approaches…………………………………………………… Fig. 6.3 Schematic view of the Three-Way Cross Validation Process that is used for stock price estimation models………. Fig. 6.4 A sample Receiver Operating Characteristic – ROC curve………………………………………………………….. Fig. 6.5 The ROC “curve” created by varying threshold val ues.The table at right shows 20 data points and the score (probabilities) assigned to each of them. The graph on the left showsthe corresponding ROC curve with each point labeled by the threshold that produces it………………. Fig. 6.6 Scatter diagram of selected variables from Friedman Artificial dataset………………………..…………………….. Fig. 6.7 Average Cross Validation Test R2 values of Optimum Models –Freidman’s Artificial Dataset. Standard errors of five repetitions are shown on the curve………………………
XXXIII
205
207
210
211
219 220 220 224
225 238
241
XXXIV
List of Figures
Fig. 6.8 Cross Validation Test R2 values of Optimum Models – Friedman’s Artificial Dataset. Standard errors of five repetitions are shown on the curve…………………..………. Fig. 6.9 Correlation between Performance Measures of each Friedman’s Training, Validation and Testing Dataset for five cross validation models.cviXXX represent the cross validation models, TR: training, VR: validation, TE: testing datasets……………………………………………….. Fig. 6.10 Average Cross Validation Test R2 values of Optimum Models – Auto Mileage Dataset. Standard errors of five repetitions are shown on the curve……………………..……. Fig. 6.11 Auto-Mileage Dataset Cross Validation Test Performance measures of Optimum Models using R2, MAPE, and RMSE measures. Standard errors of five repetitions are shown on the curve…………………….…….. Fig. 6.12 Correlation between Performance Measures of AutoMileage Dataset Training, Validation and Testing Data sets for five cross validation models. cviXXX represent the cross validation models, TR: training, VR: validation, TE: testing datasets.………………………………………..… Fig. 6.13 Process flow of Hot Metal Pretreatment…………………….. Fig. 6.14 A Blast Furnace at a Hot Metal Pre-treatment plant……….... Fig. 6.15 Torpedo Car Desulphurization Process…………………….... Fig. 6.16 Distribution of Reagent1 versus Reagent2 variables of Desulphurization Dataset…………………………………….. Fig. 6.17 Kernel density estimation of Reagent 1 for four grades of Aim-sulphur (discrete variable) …………………………...… Fig. 6.18 Average cross validation R2 values of optimum models of Desulphurization Reagent 1 Dataset………………........… Fig. 6.19 Average cross validation R2 values of optimum models of Desulphurization Reagent 2 Dataset…………………....… Fig. 6.20 Two financial indicators of a selected Stock’s Price versus Closing Stock Price………………………………...… Fig. 6.21 TD Canada Trust – 400 days historical price (23 Nov 2005-8 June 2007). First 300 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model. ……………………………………….…….. Fig. 6.22 Loblaws – 444 days historical price (Nov. 14 2005Aug.21, 2007). First 344 days are used for training and Validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model.………………………………………………
244
244
248
250
251 251 252 252 254 255 259 259 263
265
265
List of Figures
Fig. 6.23 Bank of Montreal – 400 days historical price (Nov 11 2005-Aug 20 2007). First 200 days are used for training and validation o optimize the model parameters, and the last 100 days are used for testing the Performance of the optimum model………………………………………………. Fig. 6.24 Enbridge – 400 days historical price (Nov 14 2005-Aug 21 2007). First 200 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model……………………………………………..... Fig. 6.25 Sun Life – 400 days historical price Nov. 11, 2005-Aug 20, 2007). First 345 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum mode………………………………………………………….. Fig. 6.26 Actual and Predicted Stock Prices of an artificial Dataset of Stock Prices for a 10 days period. Predicted stocks from Hypothetical Model 1, Hypothetical Model2 and Actual Prices are shown. Model 2 is not a good model of the actual stock prices……………………………………...… Fig. 6.27 The effect of broker commission on profit for different values of investments………………………………………… Fig. 6.28 Proposed cviIFC cluster validity values of proposed IFC results for Friedman’s Artificial Dataset for two different fuzziness values; m={1.3,2.0}…………………………..…… Fig. 6.29 Automobile Mileage Dataset cviIFC graphs for two different fuzziness values; m={1.3,2.0}…………………. Fig. 6.30 The Elapsed times (min) of structure identification (learning) and reasoning algorithms of optimum methodologies using different samples of training and testing datasets, respectively. (left) the training (learning) periods, (right) the inference periods……………………….... Fig. 6.31 Elapsed times in minutes of structure identification of different fuzzy function systems……………………………... Fig. 6.32 TD Canada Trust Stock Prices of Testing dataset (100 day trading period) – (Top) Actual and Predicted using two different algorithms, SVM and Proposed EDIT2IFF, (bottom) current-market-values at each t – trading day obtained from estimated models……………... Fig. 6.33 Magnified View of TD Canada Trust Stock Prices in Figure 6.32- Actual versus Predicted and Current Market Values………………………….…………………………… Fig. 6.34 Comparison of System modeling techniques based on classification Ranking Results………………….…………….
XXXV
265
265
265
270 276
277 278
286 288
293
294 300
XXXVI
List of Figures
App. Fig. A.1 Composition of binary fuzzy relations R and S……….. App. Fig. C.1 A Neural Network with an input layer, hidden layer and one-output neurons………………………………... App. Fig. C.2 Separating Hyperplane with relaxed constraints…......... App. Fig. C.3 Soft Margin Loss Function……………………….…… App. Fig. C.4 One-point Crossover………………………………...… App. Fig. C.5 Uniform Crossover…………………………………….. App. Fig. D.1 Idealized distributions for treated and comparison group values……………………………………….. App. Fig. D.2 Diabetes Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}……..…… App. Fig. D.3 Ion Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}……………..... App. Fig. D.4 Liver Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}………...... App. Fig. D.5 Credit Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}………….. App. Fig. D.6 Cancer Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}…………. App. Fig. D.7 Ionosphere Dataset– Receiver Operating Characteristic (ROC) Graphs of Optimum model DIT2FF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines…………………………………………..…… App. Fig. D.8 Credit Dataset – Receiver Operating Characteristic (ROC) Graphs of Optimum model - DIT2IFF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines………. App. Fig. D.9 Cancer Dataset – Receiver Operating Characteristic (ROC) Graphs of Optimum model - DIT2FF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines…….... App. Fig. D.10 Liver Dataset – Receiver Operating Characteristic (ROC) Graphs of Proposed DIT2FF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines…………….... App. Fig. D.11 Diabetes Dataset- Receiver Operating Characteristic (ROC) Graphs of Proposed DIT2IF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Netorks, (d) Support Vector Machines……..….
323 328 330 336 339 340 344 397 397 397 398 398
398
399
399
400
400
Notation
Preface
A
norm matrix A
A,B,C
type-1 fuzzy sets
A, B , C
complement of type-1 fuzzy sets.
A , B , C
type-2 fuzzy sets
Ai
aggregated input fuzzy set characterizing the antecedent of the ith rule
Aij
fuzzy set associated with the jth input variable in ith rule.
Ai
aggregated type-2 fuzzy set characterizing the antecedent of the ith rule.
Aij
type-2 fuzzy set associated with the jth input variable in ith rule.
Air
aggregated fuzzy set characterizing the antecedent of the ith rule identified by level of fuzziness mr
α
A, α+A
The alpha-cut and strong alpha-cut of fuzzy set A
ANFIS
Adaptive Network Based Fuzzy Inference System
a,b,c
membership values
B*
model output fuzzy set
Bi
output fuzzy set in ith rule
Bi
type-2 output fuzzy set in ith rule
c
number of clusters
c*, c-optimal
optimum number of clusters
XXXVIII
Notation
C-reg, C-regularization
regularization parameter of support vector machines methodology
Core(A)
The core of fuzzy set A
cviIFC
proposed cluster validity index, i.e., formula, for improved fuzzy clustering
cviIFC-C
proposed cluster validity index, i.e., formula, for improved fuzzy clustering for classification problem domains
CNF
Conjunctive Normal Forms
Dr
column matrix of distance values characterizing the nearest κ data-points to the rth testing data vector
DENFIS
Dynamically Evolving Neural Fuzzy Inference System
DNF
Disjunctive Normal Forms
DIT2FRB
Discrete Interval Valued Fuzzy Rule Base
DIT2FF
Discrete Interval Type-2 Fuzzy Functions
DIT2IFF
Discrete Interval Type-2 Improved Fuzzy Functions
drq
distance between the rth testing vector and qth training vector
d=||xk-vi(x)||, d(xk,vi(x)), dik(x)
the distance value between the kth data vector and ith cluster center vector, vi(x).
di(xy)
the distance value between the kth input-output data vector and ith cluster center vector, vi(x|y).
det
determinant
ET1FF
evolutionary type-1 fuzzy functions
ET1IFF
evolutionary type-1 improved fuzzy functions
ET1FF-C
evolutionary type-1 fuzzy functions for classification
ET1IFF-C
evolutionary type-1 improved fuzzy functions for classification
EDIT2FF
evolutionary discrete interval type-2 fuzzy functions
EDIT2IFF
evolutionary discrete interval type-2 improved fuzzy functions
Notation
XXXIX
EDIT2FF-C
evolutionary discrete interval type-2 fuzzy functions for classification
EDIT2IFF-C
evolutionary discrete interval type-2 improved fuzzy functions for classification
FCM
fuzzy c-means clustering algorithm
FCNF
Fuzzy conjunctive normal forms
FCRM
fuzzy c-regression clustering method, i.e., fuzzy switching regression
FDNF
Fuzzy disjunctive normal forms
FIS
Fuzzy Inference System
FRB
fuzzy rule bases
fi(Φi)
type-1 fuzzy function identified by the input matrix of ith cluster in feature space
fi(Φi,Ŵi)
type-1 fuzzy function identified by the input matrix of ith cluster in feature space and the regression coefficients represented with Ŵi.
fir,s,ψ
type-1 fuzzy function of cluster i characterizing the regression function of ith rule identified by using tuples of 〈mr,τis,Φiψ〉
fip(Φip Ŵip)
Fuzzy functions of cluster i identified by the parameters of the pth chromosome of evolutionary algorithm model.
GFS
Genetic Fuzzy System
GT2FRB
General Type-2 Fuzzy Rule Bases
gi(τi), gi(τi,ŵi),
interim fuzzy classifier functions identified by improved fuzzy clustering algorithm for classification problem domains, whose coefficients are represented with ŵi
hi(τi), hi(τi ⎜ŵi), hi(⋅)
interim fuzzy functions identified by improved fuzzy clustering algorithm, characterized by coefficients of ŵi
H(A)
Height of fuzzy set A
IFC
improved fuzzy clustering algorithm for regression problem domains
IFC-C
improved fuzzy clustering algorithm for classification problem domains
XL
Notation
IFF
improved fuzzy functions
IT2FRB
interval type-2 fuzzy rule base model
J
objective function value of FCM clustering algorithm
J(t)
objective function value of FCM clustering algorithm at tth iteration
JmIFC
objective function value of the improved fuzzy clustering (ifc) given a level of fuzziness (m)
K(.)
kernel function
K(xk,xj)
m* r
kernel function between two vectors, i.e., vector i and j optimal level of fuzziness
m
rth level of fuzziness value in interval type-2 fuzzy membership values
mk*
the value for which the corresponding type-2 fuzzy system model yields the least prediction error for kth input data vector in training dataset.
m-Col
collection table that holds the optimum level of fuzziness values of each input data vector
NN
neural networks
NNC
neural networks for classification problems
NUP
Number of uncertainty parameters to identify the uncertainty interval of membership values and fuzzy functions
n
number of data points (objects or observations or patterns) in training dataset.
niα
reduced number of data points of an input matrix of ith cluster characterized by a threshold, α, based on membership values
ndv
number of data points in validation dataset
nf
total number of discrete values of fuzzy function structures (Φψ), ψ=1…nf
nif
total number of discrete values of interim fuzzy function structures (τs), s=1…nif, of improved fuzzy clustering.
Notation
XLI
nr
total number of discrete values of level of fuzziness (mr), r=1…nr
nte
number of objects in testing dataset
nv
number of input variables in the dataset
OTV
optimum threshold value
Pi(yk=1|Φik)
predicted posterior probability of the kth data vector in feature space of ith cluster, Φik, that its output will be 1
Pμ
indicates the domain of parameters of membership function
PL
domain of learning parameters of fuzzy system modeling approaches
PmL
domain of learning parameters identified by level of fuzziness
PτL
domain of interim fuzzy function structures identified by interim matrix structure of IFC
PΦL
domain of fuzzy function structures identified by input matrix structure in the feature space
Pˆi
posterior probability values of data vectors obtained from local fuzzy classifier functions of ith cluster
Pˆik (yk=1)
predicted posterior probabilities of the output value yk=1 for a vector k from local fuzzy classifier functions.
pˆ i
posterior probability values of data vectors obtained from interim fuzzy classifier functions of ith cluster
ˆpik (yk=1) or ˆpik (yk=1|τik)
predicted posterior probabilities of the output value yk=1 for a vector k from interim fuzzy classifier functions.
q
the number of different embedded type1 improved fuzzy function (T1IFF) models to be constructed using parameter set of {c*,mr,τs,Φp}, c*=2,..n, r=1…nr, s=1…nif, p=1,…(nf)c*,q=1,…,nif×nr×(nf)c*
r
the identifier for different level of fuzziness parameter, r=1…nr
RP
recognition performance
XLII
Notation
Ri
ith rule in type-1 fuzzy rule base
RSTB
Robust simulated trading benchmark
Ri
ith rule in type-2 fuzzy rule base
S
S-norm connective
SEiq
square error between the actual output and the predicted output of data vector q, using the local model or interim fuzzy function model of ith cluster.
Supp(A)
Support of fuzzy set A
SVC
support vector machines for classification models
SVR
support vector machines for regression models
s
indicator for support vector
s
the identifier for different interim fuzzy function structures, τs, s=1..nif
T1FF
type-1 fuzzy functions
T1IFF
type-1 improved fuzzy functions
T1FF-C
type-1 fuzzy functions for classification
T1IFF-C
type-1 improved fuzzy functions for classification
T
T-norm connective
T
transpose operation
TSK
Takagi-Sugeno-Kang fuzzy inference system
t
indicator for iteration number
t
the number of different embedded IFC models to be constructed using parameter set of {mr,τs}, r=1…nr, s=1…nif, t=1,…,nif×nr
te
testing dataset indicator
U
partition (membership value) matrix, u⊂[0,1]n×c
U*
optimal partition (improved membership value) matrix, u*⊂[0,1]n×c
u
membership value
Notation
XLIII
V
matrix of cluster centers of vectors of the same form as xk , V⊂ℜc×nv
v
indicator for validation matrix
vi(t)
cluster center vector, i.e., cluster prototype, of a cluster I identified at the tth iteration of fuzzy clustering algorithm
vi(x), vi
cluster center vector, i.e., cluster prototype, of a cluster i
W
objective function of FCM after Lagrange transformation
Ŵi
coefficients of a (system) fuzzy function of a particular cluster i obtained from any regression approximation function e.g., Ŵi = [Ŵ0i Ŵ1i … Ŵnv+nm,i], where there are nv+nm different variables including nv dimensional input variables and nm dimensional membership values and their transformations of a cluster i. Ŵip*
optimal fuzzy function parameters of cluster i identified with the pth chromosome
Ŵir,s,ψ
Parameters of local improved fuzzy functions of cluster i identified with mr, τs, Φψ
Wˆi*,k
optimum parameters of local fuzzy functions of cluster i identified for kth training data vector
Wˆ j*,i ,k
The optimum coefficient of fuzzy functions of cluster i characterized for the jth input variable of kth input data vector
ŵi
coefficients of an interim fuzzy functions obtained from improved fuzzy clustering algorithm, e.g., ŵi = [ŵ0i ŵ1i … ŵnm,i] where there are nm different membership values and their transformations used to estimate the interim fuzzy function of a cluster i, not including original input variables.
ŵi*
Optimum coefficients of an interim fuzzy functions obtained from improved fuzzy clustering algorithm, e.g., ŵi = [ŵ0i ŵ1i … ŵnm,i] where there are nm different membership values and their transformations used to estimate the interim fuzzy function of a cluster i, not including original input variables.
XLIV
Notation
ŵj,i
the interim fuzzy function coefficient of the jth input variable in cluster i
ŵ*j,k
The optimum interim fuzzy function coefficient for the jth input variable of kth data vector
Xv
The domain of validation dataset
X
test
The domain of testing dataset
XB
Xie-Beni cluster validity index
XB*
improved Xie-Beni cluster validity index
xk
kth input data vector of the training dataset, xk=(x1,k, …, xnv,k).
xkv
kth input data vector of the validation dataset, xkv=(x1kv, …, xnvkv).
xktest
kth input data vector of the testing dataset, xktest=(x1ktest, …, xnvktest).
xj,k
jth variable of kth data vector.
x'
singleton observation
xj'
The value of jth input variable in the given singleton observation
Y
column matrix of output variable of the given system.
yk
output variable of the kth input data vector.
ykv
output variable of the kth input data vector in validation dataset.
yktest
output variable of the kth input data vector in testing dataset.
r , s ,ψ
yˆik
predicted output value of kth input data vector using the fuzzy function of cluster i for tuples of 〈c*, mr,τis,Φiψ〉
Z+
positive integers
Z
empirical dataset of all system variables. z={x1, …, xnv, y}⊆ℜ(nv+1) in (nv+1) dimensions.
zk, , z(xk,yk)
kth data vector of inputs and the output, e.g., zk=[xk,yk]
Φi , Φi(x,μi)
phi: the training matrix of ith cluster in feature space, composed of original input vectors x, of nv different input variables, and nm different transformations of membership values (μi) to approximate local fuzzy functions.
Notation
XLV
Φik
phi: the kth vector from the training matrix of ith cluster in feature space.
Φψ
(ψ)th input matrix structure in feature space comprised of original input variables (x) membership values and their transformations to be used in type-2 fuzzy functions strategies.
Φiψ(x,mr,τs) or Φir,s,ψ
(ψ)th input matrix structure of cluster i in feature space characterized with parameters of mr and τs
Φ p = {Φψi =1 , Φψi = 2 ...Φψi =c*}
(p)th embedded T1IFF model identifies different local fuzzy function structures, ψ=1,…,nf, in each of its clusters, i=1…c*.
Φi(x,μi), Φi(x,τi)
the training input matrix of ith cluster in feature space to identify “fuzzy functions”
Φi′
Sample input matrix of cluster i in feature space
Φ-Col
The collection table that holds the optimum parameters and structures of the local fuzzy functions identified for each training data vector
α, α-cut
alpha:the threshold for membership values
(α
* ik
lagrange multiplier of kth data vector in ith cluster used in support vector machines algorithm.
(α
r ,s ,ψ ik
,α ik )
,α ik* r ,s ,ψ
)
lagrange multiplier of kth data vector in ith cluster used in support vector machines algorithm identified by the T1IFF model with parameters mr, τs, Φψ
βk
beta: Lagrage multipliers of k th data vector identified by SVM. IT also indicates the parameters of switching regression function of cluster k
βik
beta: Lagrage multipliers of k th data vector in cluster i identified by SVM.
δ
gamma: Gaussian radial basis kernel function parameter indicating the spread of the Gaussian distribution of similarities
ε, epsilon
epsilon parameter of an SVM model indicating the error between the actual and predicted output. It also indicates the error value of regression equation
XLVI
Notation
εi, epsilon
epsilon parameter of SVM model for ith cluster, viz., rule.
η
eta: indicates a user defined constant of membership values calculation function of an hybrid clustering methodology
κ
kappa: Number of nearest training vectors identified for each testing data vector during inference
μi
mu:The list of membership values of data points in cluster i
μiimp
the improved membership values scatter diagram of ith cluster
μik (t)
membership value of kth input vector in cluster i identified during tth iteration of FCM clustering method
μik , μik(x)
membership value of input vector xk in cluster i.
μikimp , μikimp(x)
improved membership value of input vector xk in cluster i.
μik(xy)
membership value of input-output vector xk in cluster i.
μikimp
improved membership value of kth input vector in cluster i obtained from ifc algorithm.
μikimp*
Optimum improved membership value of kth input vector in cluster i obtained from IFC algorithm.
μi c*,r , s ( x) or μr,s(x)
membership value scatter diagram of ith cluster identified by parameters of c*, mr and τs
μ A ( x)
Membership function of type-1 fuzzy set A
μ A ( x)
Membership function of type-2 fuzzy set Ã
μi ( x)
aggregated type-2 membership values of cluster i characterizing the antecedent of the ith rule
μi ( x j )
type-2 membership function of the jth antecedent variables in cluster i
μ R ( x, y )
membership function of type-2 fuzzy relation
i
R of rule i
Notation
XLVII
μir ( x)
Membership values of input matrix in ith rule identified by using mr assuming the number of clusters is fixed
μ A ( x, u )
Primary membership function of type-2 fuzzy
set A
μiL ( x j )
lower membership function of interval type-2 fuzzy set of jth antecedent variable
μiU ( x j )
upper membership function of interval type-2 fuzzy set of jth antecedent variable
μ i* ( y )
interval valued type-2 model output fuzzy set
μ i*L ( y )
lower membership function of interval valued type-2 model output fuzzy set
μ i*U ( y )
upper membership function of interval valued type-2 model output fuzzy set
θ
primary membership values of type-2 fuzzy sets
τs
interim matrix identified with (s)th interim fuzzy function structure to construct interval interim fuzzy functions
τ*i , τ *i ( μiimp* )
optimal interim matrix used by IFC method
τi′
a user defined interim matrix
τ-Col
the collection table that holds the interim fuzzy functions structure and parameters identified for each training vector
ψ
the identifier for different (system, primary) fuzzy functions, Φψ, ψ=1..nf
ℜ
real numbers
℘
Constant to calculate the maximum number of clusters of a fuzzy clustering algorithm
Æ
mapping or implication operator
Λ
Level set of a fuzzy set
°
meet operator join operator composition operator
XLVIII
★
Notation
general t-norm
∧ ∨
min operator
∪
union (disjunction) operator
∩
intersection (conjunction) operator
max operator
Chapter 1
Introduction 1 Introduction
This chapter presents the motivation of this book. The new methodologies are briefly presented and the contents of each chapter are given.
“I coined the word –fuzzy- because I felt it most accurately described what was going on in the theory. I could have chosen another term that would have been more "respectable" with less pejorative connotations. I had thought about "soft", but that really did not describe accurately what I had in mind. Nor did – unsharp-, -blurred-, or –elastic-. In the end, I couldn't think of anything more accurate so I settled on –fuzzy-.” Prof. Lotfi A. Zadeh
1.1 Motivation Uncertainty exists in almost every real world problem. In general, uncertainty is inseparable from measurement. It emerges from a combination of limits of measurement with instruments and unavoidable errors in measurement. In cognitive problems, uncertainty emerges from the vagueness and ambiguity inherent in natural languages. In social intercourse, it may emerge from shared meanings obtained from people in social interactions. Thus, uncertainty is essential to human beings at all levels of their interaction with the real world. It is believed that an important turning point in the evolution of the modern concept of uncertainty is the publication of Prof. Lotfi A. Zadeh’s seminal paper in 1965. In his paper, Zadeh introduced a theory of fuzzy sets, which are sets with imprecise boundaries. The unique property of fuzzy sets is that, memberships in a fuzzy set are not a matter of acceptance or denial, but rather a matter of degree. Fuzzy System Modeling has been studied to deal with complex, not clearly explained and uncertain systems, in which conventional mathematical models may fail to reveal satisfactory results. Well-known fuzzy systems are based on fuzzy sets and logic, which are especially useful for systems in which the concepts are vague. In most fuzzy models of fuzzy systems, fuzzy rule bases are used together A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 1–10. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
2
1 Introduction
with membership functions of fuzzy linguistic terms for input-output variables. In these models, the input and output relationships are represented by IF…THEN rules. In current literature, there are variations of fuzzy inference systems with different structure identification techniques, some of which will be reviewed in the following chapters. Fuzzy rule bases comprise a series of fuzzy operators to fuzzify, aggregate and then map the input membership values onto output domain and then aggregate and defuzzify the membership values of the output variables. Identification of fuzzy sets, and their parameters, the shape and number of membership functions, and their parameters, etc. have been the focus of most fuzzy researchers. Numerous fuzzy system-modeling approaches have been introduced to combine other soft computing approaches to optimize the parameters, or to use higher order fuzzy sets to capture uncertainties. Structurally, these approaches are based on fuzzy rule bases, which should not be the only structure to build fuzzy systems. We believe that these approaches have some difficulties, which will be addressed throughout the following chapters. Thus, the main aim of this book is to present an alternative approach to fuzzy rule bases, entitled, “Fuzzy Functions”. The new Fuzzy Functions approach employs membership values differently than any other fuzzy system models implemented to date. The membership values can be thought of as ‘atoms’ that hold potential information about a system behavior to be activated to release its power. Such potential information obtained from membership values are captured in local Fuzzy Functions as predictors of the system behavior. Instead of using Fuzzy Rule Base (FRB) structures, Fuzzy Functions are implemented to build models of a given system. This book presents essential alternatives of the Fuzzy Functions approaches and their advancements. The aim is to build more powerful fuzzy models via autonomous approaches that can identify hidden structures via the optimization of their system parameters. Although it has been investigated for many years now, the problem of uncertainty modeling is yet to be completely solved in system modeling communities. In engineering problems, building reliable models depends on the identification of important values of variables of model equations. However, in real life cases, these important values may not be obtained due to imprecise, noisy, vague, or incomplete nature of current information. The goal of this book is to build an uncertainty modeling architecture to capture the uncertainty in parameters and structure of the Fuzzy Functions to capture the most available information. The uncertainty in systems can be captured with higher order fuzzy sets, viz. interval valued type-2 fuzzy sets, which was first introduced by Zadeh [1975a,b]. Type-2 fuzzy systems implement type-2 fuzzy sets to capture the higher order imprecision inherited in systems. In particular, this book introduces the formation of “Type-2 Fuzzy Functions” to capture uncertainties associated with system behavior. Initially, “Type-1 Fuzzy Functions”, which is presented instead of type-1 fuzzy rule bases, will be thoroughly investigated. Then type-2 extensions of the “Fuzzy Functions” will be discussed in terms of structure identification and inference strategies. Furthermore, to capture uncertainty intervals of membership
1.2 Contents of the Book
3
values and fuzzy function structures and their type-2 fuzzy functions optimally, evolutionary algorithms will be discussed. One of the aims of this book is to introduce alternative methods to current fuzzy system modeling methodologies, in effective and versatile manner. They can be used both for regression and classification domains easily (with slight changes in the modeling structure). The underlying structure of these introduced methodologies will be thoroughly examined in the following chapters. To give a general view of the system modeling structures, the following items are considered in this book: ¾
Multi-input single output systems,
¾
Regression and binary classification system representation,
¾
Structure identification of a fuzzy system model that is based on a fuzzy clustering method and “Fuzzy Functions” approach,
¾
Small to medium size datasets,
1.2 Contents of the Book This book concentrates on new fuzzy system modeling approaches, which implement type-1 and type-2 fuzzy sets and use other soft computing methods such as genetic algorithms and support vector machines for parameter optimization, membership function identification, uncertainty interval capturing, and local function approximation. Two different versions of the new fuzzy clustering method along with their validity measures are presented. All these methods are depicted in Figure 1.1. The new methodologies presented in this book are listed in more detail as follows: Structure Identification With Improved Fuzzy Functions IFC*
Inference with Type 1 Improved Fuzzy Functions
New Cluster Validity functions for IFC*
Modeling Uncertainty with Type 2 Improved Fuzzy Functions
Parameter Optimization with Evolutionary Fuzzy Functions
Capturing Uncertainty Bounds with Evolutionary Type 2 Improved Fuzzy Functions
Case Based Type Reduction Method
Extensions of each new method to classification problem domains
Fig. 1.1 Contents of the book at a glance
¾ Type-1 Fuzzy Systems Based on Improved Fuzzy Functions Fuzzy systems based on Fuzzy Rule Bases (FRB) have been successfully used to model human problem-solving activities. A classical way to represent the human knowledge is using ‘IF…THEN’ fuzzy rules. It is mostly the interpretability property that makes fuzzy systems preferable to other soft computing approaches. There
4
1 Introduction
are various FRB systems, based on the way the antecedent and consequent parameters are formed. One should note that in these FRB systems, membership values of fuzzy sets may represent different attributes such as degree of belongingness, degree of fire, degree of compatibility, weight or strength of local functions, or individual objects. For instance in Fuzzy c-Regression methods [Hathaway and Bezdek, 1993], membership values are used as weights (strengths) assigned to each local function identified by the system model. Nonetheless, to formulize the structure of fuzzy function approaches, in this book a new representation of membership values is introduced. Fuzzy Functions approaches, initially presented by Turksen in 2005 [Turksen, 2008] and analyzed in [Celikyilmaz, 2005], have emerged from the idea of representing each unique rule of a FRB system in terms of ‘fuzzy functions’. One of their prominent features is that; the degree of belongingness of each sample vector in a fuzzy set has a direct affect on how the local fuzzy functions of the particular set are defined. The standard Type-1 Fuzzy Functions approaches are multi-variable crisp valued functions, which implement type-1 fuzzy sets. In a sense, membership values are used as additional dimensions in approximating hyper-surfaces for each cluster identified by any type of fuzzy clustering algorithm. In initial forms, we used well-known fuzzy c-means clustering algorithm of Bezdek [1981a]. Then a new fuzzy clustering method, to be introduced in chapter 3, is implemented into fuzzy functions approaches to enhance membership values to predict the relationship between independent and dependent variables in local structures. This new fuzzy clustering algorithm introduces two new membership functions for regression and classification (pattern analysis) models. We hypothesize that membership values obtained from improved clustering algorithms can increase the predictive power of fuzzy models of each cluster, when used together with the original input variables to explain the behavior of the input and output variables in local models. In this sense, the resulting fuzzy functions are referred to as “Improved Fuzzy Functions” and this approach is denoted as “Type-1 Improved Fuzzy Functions”. The novel type-1 improved fuzzy functions approach introduces a new structure identification (training) and inference (reasoning) methods, different from training and inference methods of standard type-1 fuzzy functions, which use Fuzzy c-Means (FCM) clustering method. It should be pointed out that fuzzy functions and improved fuzzy functions has been introduced instead of FRB methods to eliminate fuzzy operations all together and most importantly improve prediction power.
¾ Novel Fuzzy Clustering Methods Fuzzy clustering algorithms, especially the Fuzzy C-Means (FCM) Clustering algorithm of Bezdek [1981a] has been the most commonly used and extended method in fuzzy systems research area. In FCM clustering algorithm, the local structures are represented with fuzzy sets, which are defined by membership functions. In the initial fuzzy function practices, we used FCM clustering algorithm. Later, to enhance the power of membership values obtained from membership functions, a new fuzzy clustering algorithm is introduced. The new membership functions can produce enhanced membership values.
1.2 Contents of the Book
5
Since supervised multi-input single-output (MISO) models are the main interest of this book, we designed a clustering algorithm that can approximate local models to explain the relationship between inputs and outputs. The FCM clustering algorithm is designed to find local grouping of data vectors by leaving out any linear or non-linear relationships between them. However, data objects could be grouped by considering not only the local groupings of input-output data (clusters) but also their relationships. Here a new clustering algorithm, namely Improved Fuzzy Clustering (IFC), is introduced to find local fuzzy partitions with fuzzy functions based on combined clustering structures. The IFC structure is influenced by Höppner & Klawonn [2000, 2003], Chen et al. [1998] and Menard [2001]. The structure of these earlier versions of fuzzy clustering algorithms and introduced IFC algorithm are similar in the sense that they share a similar objective function, which combines standard fuzzy clustering and fuzzy c-regression methods. However, there are many structural differences between introduced IFC of this book and others. These differences are discussed in the following chapters. It should be noted that, new IFC is implemented in a system of Fuzzy Functions, where membership values are used as additional predictors in approximating the local inputoutput relationships. Earlier similar research [Höppner and Klawonn, 2003; Chen et al.,1998] use FRB structures to build fuzzy models. The new IFC carries out two objectives: (i) to find good representation of partition matrix, which captures multiple model structures of the given system by identifying hidden patterns, (ii) to shape membership functions, so that the membership values obtained from these functions can also help to explain the local input-output relationships. An extension of IFC, namely Improved Fuzzy Clustering for classification (IFC-C) systems is presented as an alternative clustering methodology. ¾
Cluster Validity Functions for the New Fuzzy Clustering Methods
In general, fuzzy clustering methods, including introduced Improved Fuzzy Clustering (IFC) algorithm, assumes that some initialization parameters are known prior to the model run. This is usually a problematic issue since different parameters may reveal different results, which could eventually affect system performance. Literature indicates that many cluster validity functions have been introduced to validate number of clusters of especially FCM [Bezdek, 1981a] clustering models. Among well-known validity functions, [Fukuyama and Sugeno, 1989; Xie and Beni, 1991; Pal and Bezdek, 1995; Bezdek, 1976] are the most commonly used FCM clustering validation functions. In later years, many variations of these functions, e.g., [Boguessa et al., 2006; Dave, 1996; Kim et al., 2003; Kim and Ramakrishna, 2005], are presented by modifying or extending earlier validity functions. These validity functions measure characteristics of a point-wise fuzzy clustering method, i.e., FCM clustering algorithm. Most validity indices are designed to validate FCM clustering algorithm. They use characteristics of FCM to indicate the optimum number of clusters. In this sense, earlier validity indices designed for FCM clustering may not be suitable for other variations of fuzzy clustering algorithms, which are designed for different purposes, e.g., Fuzzy C-Regression (switching regression) algorithm (FCRM)
6
1 Introduction
[Hathaway and Bezdek, 1993] or novel improved fuzzy clustering (IFC) approach. For these types of clustering algorithm variations, different validity measures have been introduced. For instance, in [Kung and Lin, 2004] a new validity index is introduced to measure optimum number of clusters of FCRM applications. Their validity function is a modification of Xie-Beni [1991] ratiotype validity function. The validity functions should be designed based on the objectives and structure of the corresponding fuzzy clustering method. In this book, two new cluster validity functions are introduced to measure the optimum number of clusters of the models of the new clustering algorithms, IFC and IFC-C. These validity functions are ratio type indices, which measure the compactness to separability ratio of clusters as well as performance of local fuzzy functions. ¾
An Evolutionary Algorithm to Optimize Novel Type-1 Fuzzy Functions
An important characteristic of fuzzy function systems is that they require a few key parameters that may affect the performance of a fuzzy system’ behavior. Optimization of each of these parameters usually requires analysis of different components. For instance, the number of clusters is analyzed based on cluster validity functions, whereas the optimum degree of fuzziness parameter should be analyzed by observing the behavior of membership functions or the performance of the overall model. In a system model, these parameters may affect each other; therefore, they should be optimized interactively, rather than separately. In addition, a procedure that learns the fuzzy function system automatically from data must respect the above properties. There are several optimization methods and yet there are no universal methods. The standard gradient based optimization methods might not be effective in the context of fuzzy systems given their non-linear character (in particular the form of the membership functions) and the modularity of the systems. This forces us to explore other optimization methods of more global optimization capabilities such as, e.g., genetic algorithms. In this book, we implement evolutionary algorithms using genetic algorithms, to encode system parameters as part of a hybrid genetic code. Then fuzzy function system models are evolved to capture optimum system parameters based on a stochastic search method. Hence, a new evolutionary type-1 fuzzy structure identification and inference method are introduced by combining evolutionary algorithms with improved fuzzy functions. The stochastic optimization model within fuzzy functions approach would reduce the amount of computation steps of parameter optimization compared to type-1 improved fuzzy functions that is based on exhaustive (grid) search method. ¾
A New Type-2 Improved Fuzzy Functions Approach to Capture Parameter Uncertainties
Type-1 improved fuzzy function systems and the evolutionary type-1 fuzzy functions are introduced to improve system model performance and eliminate fuzzy operators of fuzzy inference systems. However, they have limited capacity to identify uncertainties, because type-1 fuzzy sets are used in these systems. Quite often, the knowledge that we use to construct the fuzzy functions is uncertain.
1.2 Contents of the Book
7
Type-1 systems are not able to deal with fuzzy function uncertainties directly. Such uncertainties cause the generation of fuzzy functions whose membership values are uncertain. Type-2 fuzzy sets on the other hand can be very useful to identify uncertainty interval of membership values of a fuzzy set, which encapsulates the unknown exact membership function. In this respect, type-2 fuzzy sets can be used to handle fuzzy function uncertainties, while determining membership values. In addition, in fuzzy function systems, some of the essential system parameters (such as the number of clusters, degree of fuzziness, parameter and structure of fuzzy functions of each cluster) are provided by the designer before the model construction, which potentially introduces various sorts of uncertainties. Hence to capture some of the uncertainties of system models, a new type-2 improved fuzzy function system is introduced. ¾
A New Cased-Based Inference and Cased-Based Type Reduction Method for Type-2 Fuzzy Functions
The membership values of the new improved fuzzy clustering algorithm use a hybrid distance measure, which combines the distance between each data vector to each cluster with the error measure of the interim fuzzy functions. Thus, to compute the membership values of new observations, one needs to use the training cases to obtain the information on the error of these functions. A case based inference method is presented where one uses the nearest training vectors to fuzzify the new observations. This method is an example of semi-parametric inference algorithm. When higher order fuzzy sets are used in fuzzy systems, the type of the fuzzy sets should be reduced before defuzzification method. Then, defuzzification reduces the type from one to zero to get a crisp value. In type-2 improved fuzzy functions approach, after the fuzzification step, the type of the interval type-2 membership functions are reduced down to type-1 based on the new Case-Based Type-Reduction method. This method is based on a search method, which does not require complex fuzzy operators as the other type reduction methods of standard type-2 fuzzy logic systems that are based on fuzzy rule bases [Karnik et al. 1999; Mendel 2001]. After the type of fuzzy sets is reduced down to type-1, the rest of the steps of introduced type-2 improved fuzzy functions system are similar to type-1 improved fuzzy functions system. In the paragraph above, we stated expressions such as “type-1 or type-2 membership functions”. In fact one ought to state “type-1 or type-2 membership values” generated from a clustering algorithm. It should be noted that “membership functions” are idealized representations of membership value scatter clouds. Therefore, in this writing whenever “membership functions” are stated, the intention is the idealized, curve-fitted representation of scatter clouds of membership values. In particular, in this book, the aim is to point out that in earlier type-2 fuzzy logic system approaches usually an expert defines membership functions of fuzzy sets that form the type-2 fuzzy rule bases. Whereas in the introduced “Fuzzy Functions” approach, one does not curve-fit over the scatter clouds as extracted from clustering methods and one does not define idealized membership functions
8
1 Introduction
but one uses membership values and their transformations as new variables in generating the form of the fuzzy functions. ¾
An Evolutionary Algorithm to Optimize the Type-2 Improved Fuzzy Functions
The system parameters of type-2 improved fuzzy functions, e.g., the uncertainty interval of membership values, the number of hidden structures, the list of fuzzy functions structures, the type of the function estimation and their corresponding parameters, etc., are identified based on a grid search. It is possible to find an optimal model when a wider search space is used, yet it is computationally inefficient when there are many system parameters to optimize. An evolutionaryalgorithm method using genetic algorithms is implemented into type-2 improved fuzzy functions approach to identify such system parameters automatically. With the new method, the number of iterations of the type-2 fuzzy functions optimization, which is based on a grid search, is reduced and the optimum parameters are automatically determined based on a stochastic search algorithm. With the implementation of evolutionary search algorithms in uncertainty modeling of improved fuzzy functions approaches, we are trying to locate the uncertainty interval where the optimum model would be. Evolutionary model is capable to optimize this interval based on the given data and interval valued parameters. ¾
A New Performance Measure to Be Used in Evaluation of the Prediction Problems of Stock Prices
In any trading system modeling, the main goal is to improve the profitability. A profitable prediction is a better prediction even it has less accuracy based on different criteria, e.g., accuracy in predicting the next day directions of a stock. In [Deboeck, 1992] it was shown that a neural network that correctly predicted the next-day direction 85 percent of the time, consistently lost money. Although the system correctly predicted market direction, the prediction accuracy was low. Hence, one can conclude from such instances that evaluation of trading models should not just be based on predicted directions of stock price movements. In addition, as will be shown in the results of the prediction of stock prices, the accuracies of benchmarking methods are not always significantly different from one another. Thus, it can be difficult to identify the best model based on error reduction. Since the aim of stock trading models is to return profit, the profitability should be the performance measure. For these reasons, on top of the wellknown performance measures used in this book, here we introduce a new criterion – Robust Simulated Trading Benchmark (RSTB), based on profitability of models to predict the stock prices. The RSTB combines three different properties in one performance measure including the market directions, prediction accuracy and robustness of models. RSTB is driven by a conservative trading approach.
1.3 Outline of the Book
9
In summary, the new methodologies presented in this book can be listed as follows: • • • • • • • • • • •
Two novel improved fuzzy clustering methods, for regression and classification problem domains. A new type-1 improved fuzzy functions approach instead of fuzzy rule bases. A new type-2 improved fuzzy functions approach for uncertainty modeling. Extensions of type-1 and type-2 fuzzy functions to classification problem domains. Two new hybrid type-1 and type-2 fuzzy functions approaches using evolutionary computing methods. Extensions of hybrid methods to classification problem domains. Two novel cluster validity functions to evaluate the new improved fuzzy clustering methods, for regression and classification problem domains. A new semi-parametric cased based inference method for type-1 improved fuzzy functions. A new cased based type reduction for type-2 improved fuzzy functions. Extensions of the type-1 and type-2 improved fuzzy functions, and hybrid type-1 and type-2 improved fuzzy functions approaches, using standard FCM clustering method instead of improved fuzzy clustering method. A new performance measure specifically implemented for the stock price prediction systems.
1.3 Outline of the Book In chapter 2, the basic concepts of fuzzy set and logic theory, which are required to explain the introduced algorithms, are presented. In this book, a new fuzzy functions approach is introduced; therefore, details of the introduced fuzzy functions are given. Reviews of some of the well-known fuzzy rule base structures and their extensions that are used in the chapter of experiments for benchmarking purposes (Chapter 6) are presented. In chapter 3, a detailed literature review on fuzzy cluster algorithms are given, since fuzzy clustering is the core of the structure identification method of the new fuzzy functions method of this book. Next, the introduced improved fuzzy clustering algorithm for prediction models and its extension to classification problem domains are presented. Starting from its motivation, the mathematical structure and differences from similar fuzzy clustering algorithms and justification of its performance improvement as opposed to well-known fuzzy c-means clustering is thoroughly analyzed with simulation experiments. After reviewing literature on well-known cluster validity indices and discussing their strengths and weaknesses on different fuzzy clustering methods, two new cluster validity indices are presented for the two novel fuzzy clustering algorithms. Their strengths are justified
10
1 Introduction
through simulation experiments. Finally, a summary of the contributions are given at the end of the chapter. In chapter 4, the presented fuzzy system with fuzzy functions are explained comprehensively for the domains of prediction using standard fuzzy c-means clustering algorithm and improved fuzzy functions approach with the novel improved fuzzy clustering algorithm. The presented algorithms are explained at each step of the fuzzy system modeling in detail. In addition, extensions of the new algorithms are presented for classification problem domains. Then, the structure identification and inference methods of the new fuzzy modeling with improved fuzzy functions methodology are presented based on evolutionary algorithms at the end of this chapter. Reduction of the number of the structure identification steps is discussed with the presented evolutionary fuzzy functions as compared to the earlier fuzzy functions. Similar to other chapters, a summary is provided at the end of the chapter. In chapter 5, uncertainty modeling is presented with improved fuzzy functions. Three-phase type-2 fuzzy system architecture is introduced. Then a new type reduction method is introduced using evolutionary algorithms to identify the uncertainty interval of fuzzy sets. A summary is given at the end of the chapter. In chapter 6, the new algorithms are applied on ten different datasets of different structures, i.e., regression and classification problem domains. The results are compared with the results obtained from other system modeling techniques, such as adaptive network fuzzy inference system, neural networks, dynamically evolving network fuzzy inference system, support vector machines, and other fuzzy systems models of closest structures to the presented system modeling approaches. The performance of each modeling technique is discussed in terms of performance improvement, robustness, and statistical significance. In addition, the effectiveness of each presented algorithm is discussed separately based on the results. In the final chapter, chapter 7, the conclusions and remarks are given about the presented algorithms and possible extensions.
Chapter 2
Fuzzy Sets and Systems 2 Fuzzy Sets and Systems
This chapter reviews basic principles of the fuzzy sets and fuzzy logic as well as the inference methodology of the approximate reasoning and the extension principle theories that are fundamental parts of structure identification with traditional fuzzy rule base systems. Also presented is the “Fuzzy Functions” as defined in the literature and as presented in this work. Extensions of some well-known fuzzy inference systems including structures of well known hybrid fuzzy systems are also presented at the end of this chapter.
"No assertion is ever known with certainty... but that does not stop us making assertions." Carneades, 214-129 BCE
2.1 Introduction Professor Lofti A. Zadeh [1965] has introduced the concept of “Fuzzy Sets”, which are sets with imprecise amplitudes. He stated that the “membership” in a fuzzy set is not a matter of affirmation or denial, but rather a matter of degree. Over the last forty years, his proposal has gained recognition as an important point in the evolution of modern concept of imprecision and uncertainty and his innovation represent a paradigm shift from the classical sets or the crisp sets to “Fuzzy Sets”. Fuzzy set and fuzzy logic theory are renowned theories, with which one can capture the natural phenomenon of the imprecision and uncertainty. The characteristic of fuzzy set, namely the membership function, is a function whose range is an ordered membership set within the closed unit interval. Therefore, a fuzzy set is usually characterized as a function. This chapter displays a brief introduction to fuzzy sets and fuzzy logic theories. Firstly, traditional fuzzy sets, i.e., type-1 fuzzy sets and their fundamental properties A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 11–50. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
12
2 Fuzzy Sets and Systems
will be introduced. Operations on fuzzy sets and the connectives used to implement these operations are explained. Definitions of some basic concepts of conventional fuzzy system modeling approaches including the extension principle, fuzzy numbers, operations on fuzzy sets and fuzzy relations are presented. Then, higher order fuzzy sets, i.e., type-2 fuzzy sets, which were introduced by Zadeh [1975a], will be reviewed. Following, the definition of fuzzy functions, as recognized in the literature and as described in this work, is a new form of fuzzy system modeling, will be explained in detail. Finally, we give a brief summary of the structures of extensions of well-known fuzzy rule bases, and hybrid fuzzy systems that are used as benchmarking methodologies in the chapter of experiments we have carried out (Chapter 6).
2.2 Type-1 Fuzzy Sets and Fuzzy Logic A (crisp) set is a collection of distinct objects. A crisp set is defined in a way to partition objects in a given universe of discourse into two groups: members (those that do belong to a certain set) and non-members (those that do not belong to a certain set). In traditional set theory, there exists a sharp distinction between members and non-members of a set. Contrary to crisp sets, a fuzzy set can be formed by assigning a membership value to each object in the interval of [0,1]. Membership values represent the degree to which an object belongs to a fuzzy set. Let X denotes the universe of discourse, where x represents an element of the universe, X and A denote a fuzzy set. A fuzzy set is hence characterized by its membership function, μA(x) as:
μA(x) : X Æ [0, 1]
(2.1)
Membership function, as described in (2.1), states that values assigned to the elements of the universal set, X, fall within a specified range. In the meantime, it indicates the membership grade of these elements in fuzzy set A. Further details of membership functions can be found in [Turksen, 1991; 1999; 2001; 2002]. A fuzzy set A, on universe of discourse of X can also be defined as a set of ordered pairs as: A={(x, μA(x))| x ∈X }
(2.2)
Membership functions have been commonly formulized with straight lines, which are the idealized representations. In real life applications, membership values form scatter-clouds, which are then normalized into idealized shapes such as triangular, trapezoidal, or Gaussian membership functions. Figure 2.1 shows membership functions of fuzzy sets “young”, “middle age” and “old’. Horizontal axis of graph in Figure 2.1 represents “age” in years (the universal set X) and vertical axis represents the degree to which a person can be labeled “young”, “middle age” or “old”. Hence, membership graph in Figure 2.1 represents membership function of a fuzzy set, which represents “the group of people that can be considered to the young”, “the group of people that can be considered to the middle aged” and the “group of people that can be considered to the old aged”.
2.2 Type-1 Fuzzy Sets and Fuzzy Logic
13
μΑ
1
Membership function of Young
25
Membership function of Middle Age
50
Membership function of Old
75
100
AGE
Fig. 2.1 Membership functions of fuzzy sets “young” (green), “middle age” (red) and “old” (blue)
2.2.1 Characteristics of Fuzzy Sets Conventionally, there are two alternate notations used to represent fuzzy sets: discrete or continuous. A fuzzy set A on a discrete universe of discourse, X, is represented by; A = ∑x∈X µA(x)/x
(2.3)
In (2.3), ‘∑’ does not represent mathematical addition operation. It is an aggregation operator that represents collection of membership values that belong to a fuzzy set A on continuous universe of discourse, X, which is denoted by following equation, A = ∫x∈X µA(x)/x
(2.4)
Again ‘Û’ is not an integration operation but an aggregation operation. One of the most important concepts of fuzzy sets is the concept of an “α-level set (α-cut)” and “strong α-cut”. Given a fuzzy set A defined on X and any number α∈[0,1], an α-cut, αA, and the strong α-cut, α+A are defined as: αA={x | µ (x)≥α } A
(2.5)
α+A={x | µ (x) > α } A
(2.6)
“α-cut” and “strong α-cut” of a fuzzy set A is a crisp set that contains all the elements of a universal set X, whose membership grades in A are ‘greater than or equal to’ or ‘greater than’ the specified value of α, respectively. The set of all levels of α∈[0,1] that represent distinct α-cuts of a given fuzzy set A are called the level set of A. Any fuzzy sets can be represented using its level sets. This representation is called
14
2 Fuzzy Sets and Systems
the α-cut decomposition (representation) of the fuzzy set and is formulated as a theorem. The theorem states that for any fuzzy set A on U,
{
}
μ A ( u ) = sup min α , μ α A ( u ) , ∀ u ∈ U α ∈[0 ,1 ]
(2.7)
Notice that μ α A ( u ) is a membership function of a crisp set and assumes either 0 or 1. The “support” of a fuzzy set A within a universal set X is a crisp set that contains all the elements of X that have non-zero membership grades in A. It should be noted that, the support of A is as same as strong α-cut of A for α=0 and is defined as Supp(A)={x∈X | μA(x) > 0 }
(2.8)
‘1-cut’ of A is called the “core” of A and is denoted by Core(A)={x∈X | μA(x) = 1 }
(2.9)
The “height of” fuzzy set A, h(A), is the largest membership grade obtained by any element in the set and it is defined as follows: (2.10)
h(A) = supx∈X (μA(x))
A fuzzy set A is called “normal” when h(A)=1 and it is called “subnormal” when h(A)<1. The fuzzy set characteristics defined above is displayed with the aid of an example of the trapezoidal membership function as shown in Figure 2.2 below.
Core of fuzzy set A 1
Height of fuzzy set A
-cut
a1
a2
a3
a4
x
Support of fuzzy set A
Fig. 2.2 Fuzzy Set Properties on a “Trapezoidal” membership function example
2.2.2 Operations on Fuzzy Sets In classical set theory, four fundamental operations are defined on sets; complement, containment, intersection and union operations. These four operations
2.2 Type-1 Fuzzy Sets and Fuzzy Logic
15
are also defined for fuzzy sets as standard fuzzy set operations along with many other fuzzy set operators. Two of the most widely used operators are triangular norm (t-norm) and triangular conorm (t-conorm), which are widely accepted and has been widely used in fuzzy sets literature. Hence, in this section, a brief definition of t-norms and t-conorms is also presented. In addition, the following definitions of standard fuzzy sets are based on De Morgan Triplets with (Min, Max, Complement). Let entire set of finite elements denoted as S={x1,…,xn}. A fuzzy set A can be defined as A⊆S. Based on the membership function definition in (2.3), membership values of fuzzy set A can be expressed as:
A= μA(x1)/x1 + μA(x1)/x1 + …+ μA(xn)/xn = ∑k μA(xk)/xk
(2.11)
In (2.11), note that the ‘+’ symbol does not refer to the ordinary addition, thus it is a special representation of discrete fuzzy sets, symbolically. Definition 2.1. Equality of fuzzy sets: It can be defined by the equality of membership functions. Let A,B ⊆ S represent two fuzzy sets and equality of two fuzzy sets is defined as follows:
A=B ⇔ μA(x) = μB(x), ∀x∈S
(2.12)
Definition 2.2. Inclusion of fuzzy sets: Defined by the inequality of membership functions as follows:
A⊆B ⇔ μA(x) ≤ μB(x), ∀x∈S
(2.13)
It should be remarked that that ’A⊂ B’ is defined in fuzzy sets as follows:
A⊂B ⇔ μA(x) ≤ μB(x), ∀x∈S and ∃y∈S
μA(y) ≤ μB(y)
(2.14)
Definition 2.3. Union of fuzzy sets: Defined by the maximum of membership functions as:
A∪B : μA∪B(x) = max [μA(x), μB(x)]
(2.15)
Definition 2.4. Intersection of fuzzy sets: Defined by the minimum of membership functions as:
A∩B : μA∩B(x) = min [μA(x), μB(x)]
(2.16) c
Definition 2.5. Complement of fuzzy set: The complement of fuzzy set A, A ,is defined as follows: μAc(x)=1- μA(x)
(2.17)
16
2 Fuzzy Sets and Systems
Figure 2.3 displays graphical representation of union and intersection of fuzzy sets and Figure 2.4 displays the graphical representation of complement of fuzzy sets. Logical operations have been the focus of considerable discussions over the evolutionary history of fuzzy logic. Originally, “min” and “max” functions were presented to model logical conjunction and disjunction. These functions obviously generalize traditional Boolean operators, but it was immediately recognized that they were not the only possible functions. In the last three decades, many different operators including triangular norms (t-norms for short) and triangular conorms (tconorms for short) have been presented. The following definitions of t-norms and t-conorms can be found in Klir and Yuan [1995].
Fig. 2.3 Union and Intersection Fig. 2.4 Complement of Fuzzy Set A, Ac. (--- line) A
AC
Let a, b, c be membership values in the unit interval [0, 1]. Definition 2.6. Intersection / triangular norm (t-Norm) of Fuzzy Sets: A triangular norm (t-norm) is a function T on the unit interval [0,1], (i.e., z=T(a, b), 0
Axiom T1. T(a,1)=a (Boundary Condition) Axiom T2. T(a, b) = T(b, a) (Commutativity) Axiom T3. If b1≤b2 then T(a,b1)≤T(a,b2) (Monotonicity) Axiom T4. T(a, T(b, c)) = T(T(a, b), c) (Associativity) T-norms are generally accepted as equivalent to the class of fuzzy intersections. Therefore, these axioms are called the axiomatic skeleton for fuzzy intersections/t-norms.
2.2 Type-1 Fuzzy Sets and Fuzzy Logic
17
The boundary condition in axiom T1 ensures that when one argument of a t-norm is 1 (in other words membership value is 1), membership grade of the intersection is equal to the other argument. Commutativity ensures that the fuzzy intersection is symmetric. Monotonic property of t-norm ensures that a decrease in the value of membership in set A or B cannot increase the degree of membership in intersection. The axiom T4 states that intersection of any number of sets can be taken in any order. In order for the above four axioms of t-norm to hold true, the following three axioms must also hold true.
Axiom T5. T must be a continuous function (Continuity) Axiom T6. T(a, a) < a (Sub-idempotency) Axiom T7. a1≤ a2 and b1≤ b2 implies T(a1,b1)≤T(a2,b2) (Strict Monotonicity) The axiom of continuity prevents large changes in the membership value of A ∩ B when there is a very small change in the membership value of either set A or B. The axiom of sub-idempotency forces membership grade in A∩B not to exceed the value of a, in the case when both membership grades in A and B have the same value of a. Definition 2.7. Union / triangular conorm (t-Conorm) of Fuzzy Sets: A triangular conorm (t-conorm) is a function C on the unit interval [0, 1], (i.e., z=C(a, b), 0
Axiom C1. C(a,b) =C(b,a) (Commutativity) Axiom C2. C(a,C(b,c)) = C(C(a,b),c) (Associativity) Axiom C3. If b1≤ b2 then C(b1,c) ≤ C(b2,c) (Monotonicity) Axiom C4. S(a,0)=a (Boundary condition) Fuzzy union function returns membership grade of A∪B. Axioms T1-T4 differ from axioms C1-C4 in boundary condition. In order for axioms C1-C4 to hold true, below requirements must also be satisfied.
Axiom C5.C must be a continuous function (Continuity) Axiom C6.C(a, a) < a (Superidempotency) Axiom C7. a1≤ a2 and b1≤ b2 implies C(a1,b1)≤ C(a2,b2) (Strict Monotonicity) Definition 2.8. Combination of intersection and union (t-norms and t-conorms): According to classical set theory rules, intersection and union operation on sets are dual with respect to each other if they satisfy De Morgan laws such that:
A ∩ B = A ∪ B and A ∪ B = A ∩ B
(2.18)
18
2 Fuzzy Sets and Systems
However, this duality is satisfied for only some combinations of t-norms and tconorms and fuzzy complements. A t-norm, T, and a t-conorm, C, are dual with respect to fuzzy complement if and only if:
T(a,b) = 1-C(1-a,1-b), or
(2.19)
C(a,b) = 1-T(1-a,1-b)
(2.20)
is true. The triplet formed by a t-norm, its dual t-conorm, and standard complement is called a De Morgan triplet. De Morgan triplets were first introduced by Zadeh [1965] and they are one of the most widely used triplets. Fuzzy De Morgan Triplet can be explained as follows: Let A and B be two fuzzy sets on a universe of discourse X. Membership functions of fuzzy set A and fuzzy set B is defined as a=μA(x) and b=μB(x). According to Zadeh’s De Morgan Triplet formula the intersection, union and complement operations on these sets can be defined as follows: Intersection: A∩B: μA∩B(x)= a∧b =min(a,b), ∀x∈X Union: A∪B: μA∪B(x)= a∨b =max(a,b), ∀x∈X Complement: Ā: μĀ(x)= c(a) =1-a, ∀x∈X There are also other De Morgan Triplets which use different t-norm and t-conorm operators, some of which are listed in Table 2.1. Table 2.1 Some well known t-norms and t-conorms Names Zadeh Probabilistic Lukasiewicz Schweizer and Sklar Weber (λ > -1)
t-norms min(x, y) xy max(x+y -1, 0) max(0, (xp + yp - 1)1/p), p≠ 0 max((x + y – 1 + λxy) / (1 + λ), 0))
t-conorms max(x, y) x + y - xy min(x + y, 1) 1 - max((1 - x)p + (1 – y) p -1), 0) 1/p min(x + y + (λ/1-λ)xy, 1)
2.3 Fuzzy Logic 2.3.1 Structure of Classical Logic Theory Classical logic is the study of methods and principles of reasoning. It deals with the propositions that are required to be either true or false. One area of logic is propositional logic that deals with combinations of variables that stand for arbitrary propositions. The main concept in propositional logic is the study of rules by which new logic variables can be produced as functions of some given logic variables. Propositional logic is the expression of all the logic functions of n variables based on simple logic operations. Basic set operations are given in Appendix A.1. There are five main logic operations, which are defined by negation (!), conjunction (∧), disjunct ion (∨), implication (⇒) and equivalence (=). Using these
2.3 Fuzzy Logic
19
operations, one can define other operations. These five operations are the most commonly employed operations in formal logic. By combining different operators, for example, negations, conjunctions, and disjunctions in logic formulas, we can form another logic function. When a variable represented by a logic formula is always true, it is called tautology. Various forms of tautologies, which are called inference rules, can be constructed to deduce inferences. Let p and q be two propositions. The most frequently used basic inference form in classical logic is “Modus Ponens” inference form which is denoted as in the following,
Modus Ponens 1.p⇒q 2.p ∴q
(2.21)
The two propositions above the line in (2.21) are called “premises” and the proposition below the line is “conclusion”. If two premises are true than conclusion is also true. Modus Ponens can also be represented by [(pÆq)∧p]Æq ,
(2.22)
where p and q are two propositions, “Æ” implies implication operation and “∧” implies conjunction operation in logic.
2.3.2 Relation of Set and Logic Theory Study of logic goes back to more than two thousand years and up until now many symbols and diagrams have been devised. Around 300 BC Aristotle introduced letters as term-variables. Modern era of mathematical notation in logic began with George Boole (1815-1864), although none of his notation survived. Set theory came th th into being in the late 19 and early 20 centuries, largely a creation of Georg Cantor (1845-1918). It should be emphasized about set and logic theory that, although developed independent from each other, there is a relationship between them. Anything that can be expressed with set notations can also be expressed with linguistic terms in logic theory. In set theory, an element belongs to a particular set, or does not belong to a particular set and it is expressed linguistically as “x is element of A”; whereas in logic theory, a linguistic expression is assigned to be “true” or “false” and linguistically can be expressed as “x is element of A” is true or false. In order to express complex reasoning structures, simple atomic propositions (one or two premises at most) can be composed. The main idea that the conclusions are made from set of premises is to discover different ways in which the conclusion can be either found or formed by the given premises. (The most frequently used basic inference form is “Modus Ponens” inference form.)
2.3.3 Structure of Fuzzy Logic The fundamental difference between classical propositions and fuzzy propositions is in the range of their truth-values. Each classical proposition is required to be
20
2 Fuzzy Sets and Systems
true or false. However, the truth or falsity of fuzzy propositions is represented with a degree of truth, which is expressed by a number unit interval [0, 1]. In this sense, the term, “Fuzzy Logic” has emerged with the development of fuzzy set theory by Zadeh [1965] utilizing basic logic theory. In fuzzy sets, defined earlier, a fuzzy subset A of a crisp set X is characterized by assigning to each element x of X a degree of membership of x in A. If X is a set of propositions representing fuzzy logic propositions, then we can assign a degree of truth such that: “absolute true”, “absolute false”, or an “intermediate truth” like a proposition is truer than another proposition. The following definitions are adopted from [Klir and Yuan, 1995; Turksen, 2006]. Fuzzy logic is expressed as a system of concepts, principles and methods for dealing with approximate modes of reasoning rather than exact modes of reasoning. Hence, as in the analogy of fuzzy sets, there is the approximate truth concept in fuzzy logic. Therefore; fuzzy logic and fuzzy sets has a relation in the sense that, fuzzy logic is an application of fuzzy set theory. Fuzzy Logic utilizes the concepts, principles and methods of fuzzy set theory by formulating various forms of approximate reasoning. Hence, the concept of degree of membership in fuzzy sets must be utilized as a “degree of truth” in fuzzy logic. For a given fuzzy set A, the truth of a fuzzy proposition “x is a member of A” can be expressed depending on circumstances in two ways [Turksen, 2006]: (i) (ii)
In classical logic, it is the case that ‘x is A’ has an absolute degree of “truth”, i.e., t(A(x))=1, or alternatively, in fuzzy logic, it is the case that ‘x is A’ has a degree of truth, i.e., t(A(x))=tœ[0,1].
In the second case, we assign a fuzzy truth-hood to the expression “x is A”. In classical logic theory (namely the two-valued logic), all the propositions are either true or false [Klir and Yuan, 1995]. Fuzzy logic theory, on the other hand, allows inferences to include propositions of which truth-values are all real numbers in the unit interval [0.1]. This explains the uncertainty of truth-values. These values are interpreted as “degrees of truth”. The focus of fuzzy logic is on reasoning with fuzzy propositions involving imprecise concepts, which are expressed in natural language. This reasoning is called the approximate reasoning, which is to be explained in the next part. Fuzzy logic propositions, aside from classical logic propositions, utilize the assessment of truth-value of fuzzy propositions, in which the fuzziness of the propositions may vary from a combination of different linguistic components. Most commonly used propositions are the fuzzy implications, which contain simple fuzzy propositions as antecedent and consequent and they are expressed as: P: IF X is A, then Y is B,
(2.23)
where X and Y are variables that take values x and y from sets X and Y, respectively, A and B are relevant linguistic values, i.e., hot, warm, cold, etc, characterized by fuzzy sets. This conditional fuzzy proposition declares the truth-hood of a proposition p in two different ways.
2.3 Fuzzy Logic
(i)
21
The two truth-hood is declared as an absolute truth as “true”:
px ,y :“IF “X is A is true”, THEN ”Y is B is true,”
(2.24)
in which specific instances of “X is A is true” and “Y is B is true” are represented as A(x) and B(y) and the implication between them is symbolized as:
A(x)⇒ B(y) is true
(2.25)
A particular function must be chosen to find the degree to which a fuzzy proposition holds true. The most common implication function is the Lukasiewicz implication, denoted by I, and the degree to which propositions Px,y holds true can be determined by using the Lukasiewicz implication is as follows:
I(Px,y)=I[A(x),B(y)]=min[1,1-A(x)+ B(x)]
(2.26)
In words, (2.26) implies that; the degree of truth value of the conditional proposition Px,y is calculated by first finding the individual degree of truth values, A(x) and B(x), and then by applying the chosen fuzzy implication method (in this case Lukasiewicz implication) to the determined values of A(x) and B(y). The role of I is to provide with a bridge between fuzzy sets and fuzzy propositions. (ii) Alternatively we have the truthhood declared as fuzzy truthhood as a value tœ[0,1] as follows:
Px,y : IF “‘X is A’ is true to the degree t1”, THEN “‘Y is B’ is true to the (2.27) degree t2” Specific instances of ‘X is A’ is t1 and ‘Y is B’ is t2 are represented as (A(x), t1) and (B(y), t2) and the implication between them is symbolized as: (A(x),t1) ⇒ (B(y),t2) is true
(2.28)
With Lukasiewicz we get:
I(px,y)=I[(A(x) ,t1),(B(y),t2)]=[min[1,1-A(x)+ B(x)], min[1,1-t1+t2]]
(2.29)
Note that case (i) can also be represented in this form as follows:
I(px,y)=I[(A(x) ,1),(B(y),1)]=min[1,1-A(x)+ B(x)]
(2.30)
In conclusion, fuzzy set and logic theory are connected as in classical set and logic theory. In fuzzy theory, with the introduction of degrees of truth-values, the boundaries of classical sets and logic theory are more relaxed and the overlapping of fuzzy sets is allowed.
2.3.4 Approximate Reasoning Approximate reasoning is an important concept in fuzzy logic in the sense that, it represents the application area of fuzzy set theory. In classical logic theory
22
2 Fuzzy Sets and Systems
(two-valued logic), one of the most important tools of reasoning are the inference rules, which helps to derive new information from the given propositions. As mentioned earlier that Modus Ponens is the most commonly used inference method. Hence, Modus Ponens, as all other concepts of classical set and logic theory, is generalized to fuzzy set and logic theory. With the introduction of a calculus of fuzzy restrictions, Zadeh [1975b] paved the way towards a reasoning scheme called Generalized Modus Ponens (GMP). He defined a methodology known as Compositional Rule of Inference (CRI), which is used to infer fuzzy consequents utilizing GMP. Generally, GMP is defined as follows:
Premise 1: A→B Premise 2: A′ Deduction: B*
(2.31)
A and A′ are linguistic values defined on the universe of discourse of antecedent variable x with membership functions μA(x): x∈X Æ[0,1], and B and B* are linguistic values on the universe of discourse of consequent variable y with membership functions, μB(y):y∈YÆ[0,1]. “Æ” denotes implication relation operator, i.e., fuzzy implication and each premise is a relation and denoted as Ri:AÆB, i=1,..,number of relations. Note that, in this short hand version of GMP, each premise is assumed to be true and sets are defined as fuzzy sets. The short form of (2.31) is as follows: B =A' D(AÆB) *
(2.32)
In (2.32) “o” is the composition operator. Equation (2.32) can further be defined using the membership values, as presented by Zadeh [1965], as follows: µB*(y): supx∈X AND (μA'(x), μAÆB(x,y))
(2.33)
The proposition in (2.32) ○ is generally taken as t-norm connective. Therefore in (2.33) ‘AND’ operator is used to represent “MIN” operation in Zadeh rule base and sup is ‘MAX’ operator. Hence, this inference is also called max-min inference. Alternative representations of reasoning of (2.31) to (2.33) are more complicated that the one shown here. Since it is not the scope of this work, we will not discuss them any further. Readers are advised to read [Zadeh, 1965; 1975a; Klir and Yuan, 1995; Turksen, 2006] for further discussions on this topic.
2.4 Fuzzy Relations Fuzzy relations are first introduced by Zadeh [1965] in his seminal paper on fuzzy sets. They are defined as extensions of ordinary relations and they are applied to wide range of applications including clustering, pattern recognition, inference, systems and control. In order to explain the concept of the fuzzy relations, the crisp relations should be introduced first. A crisp relation defines the similarity between two or more crisp sets. The definitions of crisp relations are hence confined only to crisp binary
2.4 Fuzzy Relations
23
relations, which involve just two sets without loosing generality of elements. Crisp binary relations represent the presence or the absence of association, interaction, or any connection between elements of two sets. These two sets are considered as a set of pairs, the first one of which is called “domain” and second one is called the “range” set. The relation between these two pairs is explained as Cartesian product1, A×B. Let sets A and B be crisp sets on universe of discourse of X and Y, respectively. A crisp relation R can be expressed via a membership function, μR(x,y), where x∈X and y∈Y. Membership function of crisp relation R can be defined as follows: μR(x,y): X×YÆ{0,1}, ∀x∈X, ∀y∈Y
(2.34)
Namely, an element (x,y) in Cartesian product space X×Y, is assigned or is not assigned to the relation R. In other words, there are precisely two choices available, either 0 or 1. For instance, a crisp binary relation, an equality relation, E, can be defined between x and y, such that; x=y, and x∈[x1,x2] and y∈[y1, y2]. Equality relation can be denoted as E⊂[x1,x2]×[y1, y2]. Figure 2.5 shows the crisp relation between X and Y the product of two Cartesian. On the other hand, the fuzzy relations are crisp relations with additional capability of expressing the strength of this relation as well. Consequently, the fuzzy relations are fuzzy sets defined on the universal sets. Fig. 2.5 Crisp Binary Relation (Equality relation, E, between x and y)
[x1 , x2] X [y1 , y2] Y y1 R(x,y) y2
X1
x2
X
Similar to crisp binary relation, the fuzzy binary relation explains the relationship between two fuzzy sets such that each element of the domain of a fuzzy set is associated with one or more elements of the fuzzy set. It should be noted that, a fuzzy relation also defines a degree of association between elements of domain and range sets. Hence, let A and B be redefined as two fuzzy sets on the same universe of discourses, X and Y, respectively. The membership function of a fuzzy relation R, μR(x,y), is a mapping that can be defined as:
1
Cartesian Product of two sets A and B is defined to be the set of all points (a, b) where a ∈ A and b ∈ B It is denoted A×B, and is called the Cartesian product since it originated in Descartes formulation of analytic geometry. In the Cartesian view, points in the plane are specified by their vertical and horizontal coordinates, with points on a line being specified by just one coordinate. (definition obtained from www.mathworld.com).
24
2 Fuzzy Sets and Systems
μR(x,y): X×Y→[0,1], ∀x∈X, ∀y∈Y
(2.35)
and the binary fuzzy relation, R, can be defined as:
R = ∫X×Y µR(x,y)/(x,y)
(2.36)
Similarly, an example of a fuzzy binary relation can be defined as the following: Let E be the equality relation between variables x and y. A fuzzy relation the captures concepts such as “x is equal to y” or “x is close to y”. This concept can be captured by any membership function (in our case a continuous membership function), μR(x,y), defined over x and y. A fuzzy relation, R is defined as the Cartesian product R⊂[x1,x2]×[y1,y2] and the fuzzy relation can be shown graphically in Figure 2.6, by means of a “gray” level where A(x)œ[0,1], B(y)œ[0,1] and C(x,y)œ[0,1] and C(x,y)=Min(A(x), B(y)). Another way of expressing a fuzzy relation can be accomplished as shown in Appendix figure A.1 of the Appendix A.2, Fuzzy Relations. Binary fuzzy relation definition can now be extended to generalize n-dimensional fuzzy relation as follows: Let the Cartesian product of an universe of discourse in n-dimensional space be given as X1×X2×…×Xn. Then, the membership function of an n-dimensional fuzzy relation, R, defined over this space is shown as follows: Fig. 2.6 Cartesian product of two fuzzy sets (A and B)
C(x,y)
a' a '' a ''' a''''
µR(x1,x2,…,xn): X1×X2×…×XnÆ[0,1], ∀xi∈Xi, i=1,…,n
(2.37)
and similarly, an n-dimensional relation, R, can alternatively be designated symbolically as follows:
R = ∫ X1×X2×…×Xn µR(x1,x2,…,xn)/( x1,x2,…,xn)
(2.38)
2.4 Fuzzy Relations
25
2.4.1 Operations on Fuzzy Relations Since fuzzy relations are being built upon fuzzy sets, all operations (i.e., complements, intersections, or unions) on fuzzy sets are naturally applicable to fuzzy relations as well. It is important to note that some exceptional operations on binary fuzzy sets are not applicable to ordinary fuzzy sets. In this section, only those operations those are not applicable to ordinary fuzzy sets, especially the inverse operation and composition operation warrant special attention. Inverse operation of a fuzzy binary relation R on X×Y, which is denoted sym-1 bolically as R , is the relation between Y×X as shown in the following equation; -1
R (y,x)= R(x,y)
(2.39)
For all pairs of 〈y,x〉∈Y×X. The composition operation of two binary relations R and S, defined over X×Y∈ℜ and Y×Z∈S, respectively, is denoted as RDS. Since R and S are the binary fuzzy relations, there is an indirect link between each x and z and a matter of degree can then be defined accordingly. Therefore, a membership function that characterizes the composition operation, RDS, is denoted as μRDS(x,z) and in fact, it can be evaluated as in the following equation: μRDS(x,z) =supy∈Y (T(µR(x,y), µS(y,z))), ∀x∈X, ∀z∈Z
(2.40)
Similarly, the RDS may also symbolically be defined as follows:
RDS=∫X×Z [supy∈Y [T (μR(x,y), μS(y,z))] ]/(x,z)
(2.41)
In (2.41) the T represents a particular t-norm. For instance, if “MIN” t-norm is selected for T, then in (2.41), the membership degree of the chain 〈x,y,z〉 is determined by the degree of the weaker of the two links.
2.4.2 Extension Principle Extension principle, as introduced by Zadeh in 1975[1975a], is one of the most basic ideas of fuzzy set theory. Extension Principle helps to extend non-fuzzy mathematical concepts to deal with fuzzy quantities or variables. In order to develop the computation with fuzzy sets, we need to find a way to convert the crisp functions into fuzzy functions (i.e., fuzzification). Therefore, the concept of “fuzzifying” crisp functions is done with the aid of Extension Principle. There are other ways to fuzzify some crisp functions. However, extension principle is a wellknown and widely used principle among others. A crisp function f:X→Y is fuzzified when it is extended to act on fuzzy sets defined in X and Y. In other words, the fuzzified function, for which the same symbol f is usually used, has the form f: Ƒ(X)→Ƒ(Y). Thus, the crisp functions are fuzzified with the Extension Principle. Before reviewing the extension principle, a special case in which the extended functions are restricted to crisp power sets Ƥ(X) and Ƥ(Y) is reviewed first.
26
2 Fuzzy Sets and Systems
Given a crisp function from X to Y, its extended version is a function from P(X) to P(Y) that, for any A∈P(X), is defined by
f(A)={y | y=f(x), x ∈ A}
(2.42) -1
Additionally, the extended version of inverse of f, defined by f , is a function from P(Y) to P(X) that, for any B∈P(X), is defined by
f (B)={x | f(x) ∈ B} -1
(2.43)
-1
When the sets f(A) and f (B) can be expressed with their respective membership functions, we obtain [ f(A)](y)=supx|y=f(x) A(x)
(2.44)
-1
[ f (B)] (x)= B(f(x))
Allowing sets A and B in (2.44) to be fuzzy functions, we arrive at the following extension principle using membership functions as follows: Any given function f : X→Y induces two functions,
f: Ƒ(X)→Ƒ(Y).
(2.45)
f : Ƒ(Y)→Ƒ(X). -1
which are defined by ∀A∈ Ƒ(X), [ f(A)](y)=supx|y=f(x) A(x) or,
(2.46)
∀A∈ Ƒ(X), [ f(A)](y)=sup(x1,…,xnv)|z=f(x1,…,xnv) min (A1(x1),… An(xnv))
If we consider a Cartesian product of the universe X=X1×…× Xn , then the fuzzy mapping is defined between multiple inputs, and their fuzzy subsets are defined as Aj(xj), j=1,…,nv, then the t-norm of their membership functions are obtained first before the mapping is defined. Let X be the set of ages of the employees in a company and Y be the salaries of these employees. We would like to find out the salary of a “young employee”. It should be noted that there should be a function between the X and Y in order to give an answer to the previous question. In Table 2.2 suppose the Ages and the corresponding salaries are given. Table 2.2 The AGE and SALARY attributes of Employees
Age (in years)
20
25
30
35
40
45
50
55
60
65
Salary (in $K)
2.5
2.5
3.0
3.5
3.5
4.0
4.0
4.5
4.5
5.0
Let a fuzzy set A be the set of young people. Then the fuzzy set A with its membership values assigned can be defined as; A(“young employee”) =
A( x ) X
= 201 + 251 + 0.8 + 0.6 + 0.4 + 0.2 + 500 + 550 + 600 + 650 30 35 40 45
(2.47)
2.4 Fuzzy Relations
27
Using the young employees’ fuzzy set A, a fuzzy set B has to be defined to explain the young people’s salaries. Let function f define the relationship between the fuzzy sets A and B, which for each x in X assigns a particular y=f(x) in Y as shown in Table 2.2. It should be noted that there is a dependency between fuzzy sets A and B and it can be shown below as:
B= A(x) / f(x)
(2.48)
Therefore, from (2.48), the fuzzy set B can be explained as: B( x )
B= X
= 2.51 + 2.51 + 0.83 + 0.6 + 0.4 + 0.24 + 04 + 4.50 + 4.50 + 05 3.5 3.5
(2.49)
It should be noted from (2.49) that; some salaries are associated with more than one age (i.e., 3.5 K is assigned to ages 35 and 40, which have membership grades of 0.6 and 0.4 respectively). According to the disjunction nature of the association of salaries with ages, we take the maximum of the two membership grades as shown in (2.46). Thus, the final value of fuzzy set B, which defines the salaries of young employees, can be defined as: (2.50)
B = (1/2.5) + (0.8/3) + (0.6/3.5) + (0.2/4) + (0/4.5) + (0/5)
The graphical representation of fuzzy set A, which represents the young employees and fuzzy set B, which represents the salaries of young employees are given in Figure 2.7 and Figure 2.8. Fig. 2.7 Fuzzy set of young employees (A(x))
Fig. 2.8 Fuzzy set of salary of young employees (B(y))
1.2
Membership Value
1 0.8 0.6 0.4 0.2 0 2.5
3
3.5
4 4.5 Salary
5
5.5
28
2 Fuzzy Sets and Systems
2.5 Type-2 Fuzzy Sets The concept of type-2 fuzzy sets was introduced by Zadeh [1975a] as an extension of ordinary type-1 fuzzy sets, almost 10 years after he introduced type-1 fuzzy sets. Zadeh refers to fuzzy sets of type-2 as the fuzzy sets, whose membership functions are mappings from U to [0,1] classified as type-1. He proposes to use extension principle in order to do fuzzy operations, e.g., union, intersection, complementation, etc, on fuzzy sets of type-2. This can be done in two stages: first by defining type-1 fuzzy sets using interval valued membership functions and then to generalize from interval to fuzzy sets using the level set form of the extension principle. Type-2 fuzzy set properties described by Zadeh [1975a] will be summarized in the following parts. Type-1 fuzzy sets, whose membership functions are crisp sets (an example is shown on the left of Figure 2.9), do not provide sufficient support for many kinds of uncertainty necessary that appears in linguistic descriptions of numerical quantities or in subjectively expressed amounts, dimensions. In 1998, Mendel and Karnik [1998] have proposed to increase the number of degrees of freedom for fuzzy logic systems. Their argument was that by adding at least one higher degree to type-1 fuzzy sets may provide a measure of dispersion for totally certain membership functions of type-1. Hence, type-2 fuzzy sets capture the extensions of the type-1 fuzzy sets to a higher degree. Type-2 fuzzy sets have grades of membership that are themselves defined by type-1 membership functions, which are called secondary membership functions as being shown in the right graph of the Figure 2.9. At each value of a given variable, the membership is a function, whose domain, that is to say, the primary membership function, Jx is in interval [0,1], and whose range, secondary grades, also fall within [0,1] [Mendel, 2003] (right graph on Figure 2.9 and Figure 2.10). Interval valued fuzzy sets are just another form of type-2 fuzzy sets, whose secondary membership functions are uniform functions that only take on values of 1. Turksen’s [1986; 1995; 1989; 1999; 2006] approach is a pioneering representation of interval valued type-2 fuzzy sets. Turksen has shown that, normal forms, e.g., Disjunctive Normal Forms (DNF) and Conjunctive Normal Forms (CNF), which are equal in classical set theory, i.e., DNF≡CNF, are not equal in fuzzy set theory. In turn, he has shown that, Fuzzy Disjunctive Normal Forms (FDNF) is not equal to Fuzzy Conjunctive Normal Forms (FCNF) in fuzzy set theory, i.e., FDNF≠FCNF, which define the interval valued type-2 fuzzy sets. Mizumoto and Tanaka [1976], worked on the set theoretic operations on type-2 sets, and properties of membership grades of those sets accordingly. They also examined algebraic properties of fuzzy sets of type-2 under the operations of algebraic product and algebraic sum using the extension principle. Dubois and Parade [1979] has also worked on type-2 fuzzy sets and further proposed a formula for the composition of the fuzzy sets as an extension of type-1 sub-star composition, which only satisfies for the minimum t-norm. Mendel and Karnik [1998], Karnik et al. [1999], and Mendel and Liang [2000] have further
2.5 Type-2 Fuzzy Sets
29
extended this formula and named it as extended sub-star composition of type-2 fuzzy sets, to be briefly presented in Chapter 5. Karnik et al. [1999] have created a complete type-2 Fuzzy logic theory, which handles uncertainties in fuzzy logic system parameters. In this respect, in this work, we analyze the research of Karnik et al. [1999] on type-2 fuzziness as a separate category of the variations of type-2 fuzzy logic systems in Chapter 5.
Age_Person
Fig. 2.9 Comparison of Type-1 and Type-2 Fuzzy sets. The shaded region on the right graph is the secondary membership function of x.
In type-1 fuzzy logic approach, membership values are crisp values. The distinction between type-1 and type-2 fuzzy sets is shown in Figure 2.9. Type-2 fuzzy sets define an interval of primary membership values for each object, x, and they too have membership functions of type-1. In the following basic definitions of type-2 fuzzy sets will be introduced based on the definitions of Mizumoto and Tanaka [1976] and Mendel [2001]. Depending on the types of the membership functions of type-2 fuzzy sets, two different types of type-2 fuzzy sets have been defined: (i)
Type-2 fuzzy sets or full type-2 fuzzy sets [Karnik et al., 1999];
(ii)
Interval valued type-2 fuzzy sets [Liang and Mendel, 2000; Turksen, 1986].
2.5.1 Type-2 Fuzzy Sets A type-2 fuzzy set, denoted by Ã, is a fuzzy set, which is characterized by a type-2 membership function µÃ by (2.51) µÃ : X →[0,1] Mendel [2001] defines a type-2 fuzzy set denoted by Ã, which is characterized by a type-2 membership function, µÃ(x,u), as follows:
Ã={((x,u), µÃ(x,u)) | ∀x∈X, ∀u∈Jx ⊆[0,1]}
(2.52)
0≤ µÃ(x,u)≤1, Jx is the primary membership function in [0,1] interval and u are the primary membership values, as shown in Figure 2.10. à can also be expressed as
30
2 Fuzzy Sets and Systems
Ã=∫x∈X ∫u∈Jx µÃ(x,u) / (x,u) = ∫x∈X ∫u∈Jx fx(u) / (x,u)
(2.53)
where ∫∫ indicate the union over all admissible x and u, fx(u) is the secondary membership function. For discrete universe of discourse ∫ is replaced by ∑ as follows;
Ã=∑x∈X ∑u∈Jx µÃ(x,u) / (x,u) = ∑x∈X ∑u∈Jx fx(u) / (x,u)
(2.54)
Fig. 2.10 Example of a full type 2 membership function. The shaded area is the ‘Footprint of Uncertainty’ (FOU). The amplitude of the sticks is the secondary membership values. This figure is adopted and modified from Mendel[2003].
Secondary membership function is a vertical slice of µÃ(x,u), that is a function that maps membership values of universe of discourse X onto unit interval [0,1]. Type-2 fuzzy sets can be interpreted as the union of secondary membership functions, fx(u). Thus type-2 membership function can be represented with the secondary membership functions for continuous domains by μÃ(x,u) = ∫u∈Jx fx(u) / u
(2.55)
and for discrete universe of discourse by μÃ(x,u) = ∑u∈Jx fx(u) / u
(2.56)
In basic terms mapping of the type-2 membership function onto its secondary membership functions can be reinterpreted as: μÃ(x) : X → fx(u) / u, u∈Jx , Jx ⊆ [0,1]
(2.57)
Equation (2.57) states that, at each x there is a membership value assigned to u for each fuzzy set and it also has a secondary membership function, fx(u). u is the domain of the secondary membership function and it is called the primary membership function of x where Jx ⊆[0,1] and ∀x∈X.
2.5 Type-2 Fuzzy Sets
31
2.5.2 Interval Valued Type-2 Fuzzy Sets Although different representations exist, interval valued type-2 fuzzy sets derived by Karnik, Mendel, and Liang [1999], are commonly preferred methods because full type-2 fuzzy sets are computationally complex when the number of variables is large. Hence, interval valued type-2 fuzzy sets are a type of type-2 fuzzy sets where the membership function is defined by μÃ(x) : X → 1 / u, u∈Jx , Jx ⊆ [0,1]
(2.58)
which satisfies the following condition:
fx(u) = 1, ∀x ∈ X , u∈Jx , Jx ⊆ [0,1]
(2.59)
From (2.59) one can observe that, at each value of x, all the secondary membership values are equal to 1, as shown in Figure 2.11.
Fig. 2.11 Interval Valued Type-2 Fuzzy Set
However, there may be more than 1 membership function at any value of x. The membership functions, which form the boundary values, are defined as the lower and upper membership functions. The upper and lower membership functions of types in (2.59) define the upper and lower membership values at each value of x and using (2.59) they are defined by:
μ AL ( x) = min(u ), ∀x ∈ X
(2.60)
μ UA ( x) = max(u ), ∀x ∈ X
(2.61)
u∈J x
u∈J x
in which μ AL ( x) represents the lower membership function and μ UA ( x) defines the upper membership function. Consequently a type-2 fuzzy set is denoted by μÃ(x): X →1/u, u∈[ µ Ã(x), µ Ã(x)] L
U
(2.62)
32
2 Fuzzy Sets and Systems
2.5.3 Type-2 Fuzzy Set Operations Type-2 fuzzy set operations were first introduced in Zadeh’s paper [1975a]. In [Mendel, 2001] mathematical definitions of Type-2 fuzzy sets and operations are discussed in detail. This section presents basic operations necessary to conduct type-2 fuzzy logic systems, which is going to be reviewed in Chapter 5. Let two Type-2 fuzzy sets, à and B have membership functions, A A μÃ(x):X→fx(u)/u, u∈Jx , Jx ⊆[0,1] and μ B ( x) : X → f x ( w) / w, w ∈ J xB , J xB ⊆ [0,1].
Intersection of type-2 fuzzy sets (MEET OPERATOR): The membership function, μ A B ( x) obtained from the application of intersection
and operator, of two Type-2 fuzzy sets, A μ A B ( x) =
u
w
B , is denoted as follows:
min ( f x (u ), g x ( w) ) / T (u , w), ∀x ∈ X
(2.63)
In (2.63) T is the t-norm operator. Union of type-2 fuzzy sets (JOIN OPERATOR): The membership function, μ A B ( x) obtained from the application of union operator, of two type-2 fuzzy sets, Ã and B , is denoted with the following:
μ A B ( x) = ∫u ∫w min ( f x (u ), g x ( w) ) / S (u, w), ∀x ∈ X
(2.64)
In (2.64) S is the t-conorm operator. Complement of type-2 fuzzy sets: The complement of a type-2 fuzzy set à is denoted with A whose membership functions are denoted with μ A ( x) . The complement is computed as follows:
μ A ( x) = ∫u f x (u ) /(1 − u ), ∀x ∈ X
(2.65)
It should also be noted that, due to computational complexity of operations with type-2 fuzzy sets, interval type-2 fuzzy sets have been widely preferred to full type-2 fuzzy sets. Let à and B denote two interval valued type-2 fuzzy sets with membership functions, μÃ(x):X→1/u, u∈[ μ AL , μ UA ], where μ AL ( x) , μ UA ( x) , μ BL ( x) , and μ BU ( x) represent lower and upper membership functions respectively. The operators on interval valued type-2 fuzzy sets are briefly summarized next. Intersection of interval valued type-2 fuzzy sets: Membership function, μ A B ( x) obtained from the application of intersection operator, of two interval valued type-2 fuzzy sets, à and B , is denoted by:
2.6 Fuzzy Functions
33
μ A B ( x) : X → 1/ u , u ∈ ⎡⎣ μ AL B ( x), μ UA B ( x) ⎤⎦
(2.66)
In (2.66) the lower membership functions are defined as :
μ AL B ( x) = T ( μ AL ( x), μ BL ( x) ) ,
(2.67)
and upper membership functions are defined as:
μ UA B ( x) = T ( μ UA ( x), μ BU ( x) )
(2.68)
Union of interval valued type-2 fuzzy sets: Membership function, μ A B ( x) obtained from the application of intersection operator, of two interval valued type-2 fuzzy sets, Ã and B , is denoted by:
μ A B ( x) : X → 1/ u , u ∈ ⎡⎣ μ AL B ( x ), μ UA B ( x ) ⎤⎦
(2.69)
In (2.69) the lower membership functions are defined with:
μ AL B ( x) = C ( μ AL ( x), μ BL ( x) )
(2.70)
where C represents the t-conorm operation, and upper membership functions are defined with:
μ UA B ( x) = C ( μ UA ( x), μ BU ( x) )
(2.71)
Complement of interval type-2 fuzzy sets: The complement of an interval type-2 fuzzy set à is denoted with A , whose membership functions are denoted with μ A ( x) . The complement is computed as follows:
μ A ( x) : X → 1/ u , u ∈ ⎡⎣ μ AL ( x), μ UA ( x) ⎤⎦
(2.72)
In (2.72) lower membership function is expressed as follows:
μ AL ( x) := 1 − μ UA ( x) ,
(2.73)
and upper membership function is expressed as follows:
μ UA ( x) := 1 − μ AL ( x)
(2.74)
2.6 Fuzzy Functions “Fuzzy Functions” has been used by researchers to define different things. One of the many uses of the term refers to membership functions. It is impossible to classify
34
2 Fuzzy Sets and Systems
the researchers who use “Fuzzy Functions” to denote membership functions since most of the fuzzy theory researchers use them interchangeably. However, in literature there are other definitions of “Fuzzy Functions”. The building blocks of the fuzzy set theory is proposed by Zadeh [1965], especially the fuzzy extensions of classical basic notations such as logical connectives, quantifiers, inference rules, relations, arithmetic operations, etc. Hence, these constitute the initial definitions of fuzzy functions. Marinos [1969] introduced the concept of well-known conventional switching theory techniques into the design of fuzzy logic systems, based on fuzzy set theory and fuzzy operations of Zadeh [1965]. Marinos developed algebra for fuzzy sets, where the membership functions are interpreted as fuzzy numbers. Generally speaking, the processes with fuzzy attributes are represented with fuzzy logic functions. Hence, Marinos’s paper is one of the examples of the application of fuzzy inference mechanisms onto real world engineering problems based on multi-valued fuzzy functions. Later, Siy and Chen [1972] and Demirci [1999] has explored and presented many different arithmetic operators on complex fuzzy functions. These researches on fuzzy functions are the conceptual origin of the fuzzy functions as proposed in this work. Nonetheless, they did define the mathematical foundations of “Fuzzy Functions”. The following is the basic definition of fuzzy function with respect to fuzzy sets [Marinos, 1969]. Let X and Y be two fuzzy sets and x and y be the membership grades of an “object” with respect to sets X and Y, respectively. Definition 1. Two fuzzy sets X and Y are equal (X=Y) if, and only if, for every object i one has that its membership grade xi in X, i.e., µ(xi), is equal to its membership grade yi in Y, i.e., µ(yi). Definition 2. A fuzzy sets is a complement of another fuzzy set X and is denoted by X' if, and only if, for every object i one has that its membership grade xi' in X' is equal to (1-xi), where xi is the membership grade of the object i in X. Definition 3. A fuzzy set X is contained in another fuzzy set Y if, and only if, for every object i one has the xi≤ yi. Definition 4. Two fuzzy sets X and Y form a union, denoted by z=X+Y, if, and only if, for every object i one has that z=max(xi,yi). Definition 5. Two fuzzy sets X and Y form an intersection, denoted with Z=X⋅Y, if, and only if, for every object i one has the z=min(xi,yi).
Based on the above definitions, algebra for fuzzy sets is developed analogous to the Boolean algebra of two valued logic. In the sequel, the term membership grade will be replaced by more appropriate term “fuzzy variable” of a sample fuzzy function, which can be defined as follows: F(x,y)=x⋅y' + x'⋅y
(2.75)
f(x,y)=max[min(x,1-y), min(1-x,y)]
(2.76)
which implies that:
2.6 Fuzzy Functions
35
In other words, (2.76) is a function defined between two fuzzy sets identified with fuzzy numbers, i.e., membership values, and the membership values of this function is determined by the (⋅) and (+) operators among each membership values of its objects. Arithmetic operations on latter types of fuzzy functions get complicated as the number of variables and the number of operations increase. Additionally, there is infinite number of assignments to the grade of memberships, which may then increase the complexity. Kandel [1974; 1977] has worked on the minimization of the fuzzy functions to build simplified reasoning algorithms with fuzzy logic functions. Later in 1981, Ziwei examines the properties of fuzzy switching functions of n variables. It should be pointed that, these fuzzy functions are represented only via membership values. These types of fuzzy functions are used in proposed improved fuzzy clustering algorithm and denoted as “Interim Fuzzy Functions”, to be later presented in Chapter 3. The “Fuzzy Functions” have also been used to refer to fuzzy rules in fuzzy rule base systems, particularly to refer to the Takagi-Sugeno fuzzy inference systems, where the consequents are the linear or non-linear combinations, or functions between the input variables and the output variables. These systems will be briefly described in the next section. In these systems, each local model, i.e., fuzzy rule, is identified with a separate function. Different approximators, i.e., fuzzy function approximators, are used to identify fuzzy functions, such as linear regression functions [Takagi and Sugeno, 1985; Sugeno and Kang, 1988], multi-layered neural networks2, [Jang, 1993; Abe, 1999; Kasabov and Song, 2002] or genetic algorithms [Cordon et al., 2001,2004; Camaro et al., 2004]. The application of these types of “Fuzzy Functions” is the closest to the “Fuzzy Functions” strategies as being presented in this work. In 2005, Turksen introduced a novel representation and reasoning with “Fuzzy Functions”. These “Fuzzy Functions” are multi-variable crisp valued functions. The prominent feature of these functions, f(X,µ), are that they use the degree of membership, µ, of each object to the specified fuzzy set as an additional attribute just as the rest of the input variables, X. In a sense, the gradations (membership values) become the predictors. This type of “Fuzzy Functions” emerged from the idea of representing each unique fuzzy rule in terms of functions. One of the aims of formulizing this type of “Fuzzy Functions” is that they would require less fuzzy operators and only require the knowledge of how fuzzy sets of the given system and the formulation of “Fuzzy Functions” are determined. Any function approximation method, such as least squares or neural networks tool can be used to find the parameters of these “Fuzzy Functions”. Needless to say, empirical applications [Turksen and Celikyilmaz, 2006] of the system modeling with “Fuzzy Functions” using simple linear regression methods has shown promising results as compared to the conventional fuzzy rule base systems. Later, these functions are extended using machine learning algorithms, e.g., support vector machines [Celikyilmaz, 2
A brief background of Artificial Neural Network methods is presented in Appendix C.1.
36
2 Fuzzy Sets and Systems
2005; Turksen and Celikyilmaz, 2006] in performance improvement analysis. This work introduces a novel design of “Fuzzy Functions”, and presentes new improvements and the algorithms to explain the uncertainties in fuzzy system models in greater detail.
2.7 Fuzzy Systems Based on the type of fuzzy sets used, i.e., type-1 and type-2, the conventional fuzzy systems are denoted as type-1 or type-2 fuzzy systems accordingly. Here, we define conventional type-1 fuzzy system models based on fuzzy rule bases and their wellknown extensions that are used as benchmarking tools in chapter of experiments (Chapter 6). Type-2 extensions will be briefly presented in chapter 5 later as entitled Modeling Uncertainty with Improved Fuzzy Functions, where a novel type 2 fuzzy system modeling based on fuzzy functions approach is also presented. Fuzzy system models define relationships between input and output variables of a system by using linguistic labels (i.e., fuzzy sets) in a collection of IF-THEN rules. Among several fuzzy system model structures, the ones as proposed by Zadeh [1965], Takagi-Sugeno [1985], and Mizumoto [1989] are considered renowned and most commonly used approaches. Over the years, many variations of these fundamental fuzzy systems have been proposed. Nonetheless, in the next subsection, renowned fuzzy rule base structures proposed by Zadeh, TakagiSugeno and Mizumoto Fuzzy Rule Base structures will be briefly reviewed. Zadeh Fuzzy Rule Base Structure
Zadeh’s [1965] fuzzy rule base structure is formulated as: c nv ⎡ ⎤ R : ALSO ⎢ IF AND ( x j ∈ X j isr A ji ) THEN y ∈ Y isr Bi ⎥ i =1 j 1 = ⎣ ⎦
(2.77)
In (2.77); ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾
c is the number of rules in system model, xj is jth input variable, j=1,…,nv, nv is total number of input variables, Xj is the domain of xj, Aji is the linguistic label associated with input variable, xj in rule i with membership function μAji (xj):XjÆ[0,1], y is the output variable, Y is the domain of y, Bi is the linguistic label associated with the output variable in the ith rule with the membership function μBi (y):YÆ[0,1], AND is the logical connective used to aggregate membership values of input variables for a given observation in order to find the degree of fire of each rule, THEN (→,⇒) is the logical IMPLICATION connective,
2.7 Fuzzy Systems
37
¾
ALSO is the logical connective used to aggregate model outputs of fuzzy rules, ¾ ‘isr’ is introduced by Zadeh and it represents that the definition or assignment is not crisp, it is fuzzy. Inference parameters that one should identify before using fuzzy logic structure identification method are; c*, types of membership functions of each input variable, membership function of the output variable, and types of operators for AND, THEN, and ALSO. Mizumoto Fuzzy Rule Base Structure
Mizumoto fuzzy rule base structures [Mizumoto, 1989] is a simplified version of Zadeh fuzzy rule base structures, where the consequent of each rule is represented with a scalar, bi, instead of a fuzzy set Bi. These types of inference structures are defined as follows: c nv ⎡ R : ALSO ⎢ IF AND x j ∈ X j isr A ji i =1 ⎣ j =1
(
)
⎤ THEN yi = bi ⎥ ⎦
(2.78)
In (2.78) bi is the scalar associated with ith rule. Inference parameters to be determined by using Mizumoto fuzzy inference system are c, membership function of each input variable in each rule, scalars associated with each rule, bi, i=1,…,c, AND, THEN and ALSO connectives. The generalized fuzzy inference structure is explained as follows: Let an input vector x'=(x'1, x'2… x'nv) represent a testing vector, whose output value is unknown. Using a fuzzy rule base structure and an empirical data (xk,yk)= (xk,1, xk,2… xk,nv ,yk), k=1…n, we would like to apply fuzzy inference methods and obtain an approximate crisp output value for x'. Let Xj be the domain of xj, j=1…nv, Aij represent membership function associated with input variable xj in rule i denoted with membership function μi(xj):XjÆ[0,1], c be the number of rules in FRB, and nv be the number of attributes/input dimensions in defined system. Based on the general structure of fuzzy rule bases as shown in Figure 2.12 five common steps are followed: FUZZIFICATION: Fuzzification assigned a membership value to each new data vector, x', for each fuzzy set Aij in each rule i, and is represented with a membership value, μi(x'j), ∀i=1…c, ∀j=1…nv. AGGREGATION OF ANTECENDENTS: The membership values measured in Fuzzification step are aggregated to obtain a single degree of fire, viz. membership value. The aggregation operation is processed using fuzzy operators, the most common of which are t-norms using MIN connective, denoted with ∧. Hence, the degrees of fire of rule i for the new input vector, x', can be calculated as follows:
38
2 Fuzzy Sets and Systems
τ i ( x′ ) = AND μi ( x′j ) ,∀i = 1...c nv
j =1
(2.79)
IMPLICATION: Identification of Model Output Fuzzy Sets. As much as the degree of fire represents a measure of compatibility among the new data objects and the antecedents, it also defines the level of contribution of the output fuzzy set of each rule on identifying the final model output fuzzy set. In Mamdani type inference method [Mamdani and Assilian; 1974], AND connective is used as IMPLICATION connective, where MIN operation is used to obtain the membership function of the model output fuzzy set in rule i for a given observation, x' as follows: μi′ ( y ) = τ i ( x′ ) AND μi ( y ) = τ i ( x′) ∧ μi ( y ) ,∀i = 1...c
(2.80)
AGGREGATION OF CONSEQUENTS: One fuzzy set for each rule is identified for the model output in IMPLICATION step. Here ALSO operator is used to aggregate those output fuzzy sets to obtain one single fuzzy output set. In Mamdani Fuzzy Inference schema, ALSO is taken as OR operator – mainly used for MAX operator – and it is calculated as follows: c
μ ′ ( y ) = ∨ μi′ ( y )
(2.81)
i =1
DEFUZZIFICATION: The last step of the fuzzy inference method is to obtain a crisp output from Type 1 fuzzy inference. Here we define the most commonly used defuzzification method that is Center of Gravity (COG) by y′ =
y
y μ ′( y ) y
μ ′( y )
(2.82)
Takagi-Sugeno-Kang Fuzzy Rule Base Structure
One of the most commonly implemented and investigated fuzzy inference systems are Takagi-Sugeno-Kang (TSK) fuzzy rule base systems [Takagi and Sugeno, 1985; Sugeno and Kang, 1986], which is slightly different as compared with that of Zadeh’s fuzzy rule base. The only difference is the consequent part of the rule base structure. As Zadeh used fuzzy sets to represent both antecedent and consequent parts (arguments of fuzzy variable) of fuzzy rules; whereas in TSK fuzzy rule base, the antecedent part of a rule is characterized with an aggregation operator and the consequent is characterized with a regression line. Fuzzy rule base structure by TSK can be given as follows: c
nv
i =1
j =1
R : ALSO IF AND ( x j ∈ X j isr A ji ) THEN yi = ai xT + bi
(2.83)
2.7 Fuzzy Systems
39
In (2.83), ¾ ¾ ¾ ¾
ai and bi are regression line coefficients associated with ith rule, yi is the model output of ith rule, Ri, THEN is the connective, which weights yi for each rule by using corresponding degree of firing of a given observation in order to find the model output from each rule, ALSO is the connective, which takes the weighted average of the model output of each rule in order aggregate the model outputs of fuzzy rules.
The inference parameters of TSK fuzzy rule base method are c, membership function of each input variable in each rule, regression line coefficients for each rule, and AND, THEN, and ALSO connectives used in the inference method. Although the consequent part of the TSK fuzzy rule base in (2.83) is a linear function, any non-linear polynomial function with higher orders can be implemented. As mentioned in the earlier chapters, “Fuzzy Functions” have also been seen in the literature to denote approximators for the argument, usually the consequence as a part of the of fuzzy rules. The function can be linear or non-linear function. A well-known sample to such systems is the Takagi-Sugeno-Kang (TSK) [Takagi and Sugeno, 1985; Sugeno and Kang, 1986] type fuzzy inference system, where the consequent parts of fuzzy rules are represented with functions between inputs and output variables. The well-known fuzzy system modeling tools that are used as benchmarking methodologies in this work, implement Takagi-Sugeno type fuzzy rule bases. Before we review these systems, the structure of conventional Takagi-Sugeno type fuzzy inference systems will be explained next. In a fuzzy rule base structure each antecedent is assumed to be non-interactive, therefore separate fuzzy sets are identified for each antecedent input variable separately. The generalized structure of fuzzy inference systems or fuzzy rule base systems is shown in Figure 2.12. Identification of Degree of Fire Fuzzification + Aggregation of Antecendets
Implication
Fuzzy Input Sets
Rules + Aggregation of Deduced output Fuzzy sets
Defuzzification
Fuzzy Output Sets
Crisp Output Defuzzifier
Fig. 2.12 Generalized Type 1 Fuzzy Inference Systems (Fuzzy Logic System)
Fuzzy Inference is the mathematical procedure to deduce model output from a given set of fuzzy rules. One of the first real-life applications of fuzzy rule base structure on control systems is studied by Mamdani and Assilian [1974]. They used a fuzzy rule base in order to control a cement pilot plant. Today, there are many different applications of fuzzy inference systems including financial, health-care, robotics, web data mining, and many more. In the following, Takagi-Sugeno-Kang type fuzzy inference system structure will be explained in detail as an example.
40
2 Fuzzy Sets and Systems
Using a fuzzy rule base structure and an empirical data (xk,yk)= (xk,1, xk,2… xk,nv ,yk), k=1…n, we would like to apply fuzzy inference methods and obtain an approximate crisp output value for x'. Let Xj be the domain of xj, j=1…nv, Aij represent membership function associated with input variable xj in rule i as denoted with membership function μi(xj):XjÆ[0,1], where c is the number of rules in fuzzy rule base system, and nv be the number of attributes/input dimensions in defined system. FUZZIFICATION: A membership value is assigned to each new data vector, x', in each fuzzy set Aij of each rule i, and is represented with a membership value, μi(x'j), ∀i=1…c, ∀j=1…nv. AGGREGATION OF ANTECEDENTS: The membership values measured in Fuzzification step are aggregated to obtain a single degree of fire, viz. membership value. The aggregation operation is processed using fuzzy operators, the most common of which are t-norms using MIN connective, denoted with ∧. Hence, the degrees of fire of rule i for the new input vector, x', is calculated as follows:
μi ( x′ ) = AND μi ( x′j ) ,∀i = 1...c nv
j =1
(2.84)
IMPLICATION: Depends on how the consequent part of a fuzzy rule base structure is defined. For TSK type fuzzy rule bases, this step is merely used to weight th * the scalar model output of each rule. Thus, the model output of an i rule, yi in (2.83) is weighted as follows:
y*i = μi ( x′ ) × yi ,∀i = 1...c
(2.85)
AGGREGATE THE MODEL OUTPUT: This step also depends on the consequent structure of the rule base. In TSK fuzzy rule base structures, the model output of each rule is aggregated by taking the weighted average of the scalar output of each rule in the fuzzy rule base as follows:
∑ y* y = c i =1 i ,∀i = 1...c ∑ i =1 μi ( x′ ) c
*
(2.86)
DEFUZZIFICATION: It is used to reduce the fuzzy set into a crisp value in the universe of discourse of the output variable. This step is required for fuzzy inference systems, which define fuzzy sets in the consequent part of the rules. Since TSK rule bases consequent parts are identified with linear functions from which scalar values are obtained as in (2.83), defuzzification step is not required for TSK fuzzy systems.
2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems Classical system modeling based on two-valued logic, i.e., x∈ {0,1} has long been shifted to more sophisticated models, i.e., soft computing methods, where
2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems
41
boundaries of decisions are more relaxed to explain the concepts such as uncertainty and imprecision. Soft computing is an association of computing methods making the good use of the fuzzy logic, neural networks, rough sets, genetic algorithms, etc. In the last few decades, various synergetic combinations of soft computing methods to form hybrid models, such as neuro-fuzzy inference systems or genetic fuzzy systems, [Cordon et al, 2001;Deboeck, 1992; Tettamanzi and Tomassini, 2001; Roland, 2003; Wang et al. 2005; Oh and Pedrycz, 2006; Celikyilmaz and Turksen, 2007g], have been proposed to improve system modeling performances based on different aspects. Many benefits can be described, such as the increase of accuracy, the decrease of complexity or shortening of the computation time. In this section we briefly review structure identification methods of extensions of well known hybrid fuzzy inference systems, in particular adaptive network fuzzy inference system (ANFIS), dynamically evolving neural fuzzy inference system (DENFIS) and Genetic Fuzzy System (GFS)s. These hybrid fuzzy system modeling tools based on conventional fuzzy rule base structures are used as benchmarking tools in experiments to evaluate the performance of the presented hybrid fuzzy system modeling tool based on novel “Fuzzy Functions” strategies, to be discussed in Chapter 4.
2.8.1 Adaptive-Network-Based Fuzzy Inference System (ANFIS) In 1993, Jang [Jang, 1993] proposed the Adaptive-Network-Based Fuzzy Inference Method (ANFIS), which has a special place in fuzzy learning society; because it has been one of the most commonly used and investigated algorithms among the rest of the fuzzy learning algorithms in the last decades. ANFIS is a neuro-fuzzy technique that brings learning capabilities of neural networks to fuzzy inference systems. In ANFIS, the most commonly implemented fuzzy inference method is Takagi-Sugeno-Kang (TSK) Fuzzy Inference System (FIS) [Takagi and Sugeno, 1985; Sugeno and Kang, 1986]. The parameters of ANFIS are optimized in an Adaptive Neural Network framework. An adaptive network is a network structure consisting of nodes and directional links through which the nodes are connected. The adaptive network forces the output of each node to depend on the parameter(s) pertaining to the corresponding node, and the learning algorithm indicates how these parameters should be changed to minimize a defined error measure. Let us assume (for simplicity) that the fuzzy system under consideration has two inputs x1 and x2 and one output variable y and only two fuzzy rules are defined. The common structure of ANFIS is the TSK type fuzzy inference system [Takagi and Sugeno, 1985], which is defined as follows: Rule 1: IF x1 is A1 and x2 is B1 THEN y=p1x1 + q1x2 + r1 Rule 2: IF x1 is A2 and x2 is B2 THEN y=p2x1 + q2x2 + r 2
42
2 Fuzzy Sets and Systems
The architecture of the FIS defined above and its equivalent corresponding ANFIS structure are displayed in Figure 2.13 and Figure 2.14 respectively. Next, we will briefly explain each layer of ANFIS as shown in Figure 2.14. Layer 0: This layer corresponds to each individual input variable in the dataset, e.g., x1, x2. Layer 1 (Fuzzification): Each node in this layer (denoted with a square) represents a node function where Ai and Bi are linguistic labels associated with a node function, viz., membership function, μAi(x). In ANFIS usually bell-shaped or Gaussian membership functions are utilized as follows:
f1 = p1 x1 + q1 x2 + r1 f =
w1 f1 + w2 f 2 w1 + w2
= w1 f1 + w2 f 2
f 2 = p2 x1 + q2 x2 + r2
Fig. 2.13 TSK Fuzzy Reasoning
x1
A1
w1(x1,x2) ∏
A2
w1 =
w1 w1 + w2
⊗
f1(x,y)
w1 f1
∑ x2
B1 B2
Fuzzification
Layer
0
1
∏
w2(x1,x2) Aggregation of Antecedents
2
Fig. 2.14 A Sample ANFIS Structure
⊗
f2(x,y)
w2 f 2
w2 w2 = w1 + w2 Normalization of Degrees of Fire
3
Implication
4
Aggregation of Consequents
5
f(x,y)
2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems
μ Ai (x)=
43
1 1+ ⎡ ⎣⎢
( ) x − ci ai
2
⎤ ⎦⎥
bi
(2.87)
The parameter set {ai, bi, ci} in (2.87) represent the premise parameters. Layer 2 (Aggregation of Antecedents): The circle nodes represented with ∏ in Figure 2.14 denotes a fuzzy operator, e.g., product t-norm, which multiplies the incoming signals such as
wi = μ A ( x1 ) ∧ μ B ( x2 ), i = 1, 2 i
(2.88)
i
and sends the output to the next layer. Layer 3 (Normalization of Degrees of Fire): The nodes in this layer denoted with ⊗ calculate the ratio of the ith rule’s firing strength to the sum of firing strengths of all rules by
wi =
wi w1 + w2
(2.89)
Layer 4 (Implication): The nodes in this layer are denoted with a rectangle and they calculate the weighted output of each linear function as follows:
wi fi = wi ( pi x1 + qi x2 + ri )
(2.90)
The parameter set, {pi, qi, ri } in (2.90) will be denoted as consequent parameters. Layer 5 (Aggregation of Consequents): The single node denoted with a ∑ computes overall output as follows:
yˆ = ∑ i wi fi =
∑ i wi fi ∑i wi
(2.91)
In (2.91), ŷ is the estimated overall output value. Since ANFIS implements an adaptive learning algorithm, it uses either back propagation or a combination of least squares estimation and back-propagation for parameter estimation. Here, we describe the learning rules and their parameters as implemented in MATLABANFIS. Adaptive network structure, maps inputs through input membership functions and associated parameters, and then onto outputs through output membership functions and associated parameters. The parameters associated with the membership functions changes through the learning process. The computation of these parameters (or their adjustment) is facilitated by a gradient vector. This gradient vector provides a measure of how well the fuzzy inference system models the input/output data for a
44
2 Fuzzy Sets and Systems
given set of parameters. When the gradient vector is obtained, any of several optimization routines can be applied in order to adjust the parameters to reduce some error measure. This error measure is usually defined by the sum of the squared difference between actual and desired outputs. ANFIS in MATLAB implements two methods to identify the inference parameters, namely grid partitioning and subtractive clustering. Grid partitioning is a subjective approach, in which the number of clusters, into which each variable is segmented, is initially provided by the user. The number of rules is determined by multiplying the number of clusters of each input variable. This introduces dimensionality problem, in which as the number of variables increase the number of rules also increase. Subtractive clustering, proposed by Chiu [1994], is much more intuitive and does not suffer from dimensionality issues. In that, the rule extraction method first determines the number of rules and antecedent membership functions and then uses linear least squares estimation to determine each rule's consequent equations. Subtractive clustering method of ANFIS identifies inference parameters of TSK fuzzy inference system structure. It should be noted that, in subtractive clustering ANFIS method, one fuzzy set is identified for each input variable in each rule.
2.8.2 Dynamically Evolving Neuro-Fuzzy Inference Method (DENFIS) In 2002, Kasabov and Song proposed an alternate algorithm to ANFIS entitled Dynamically Evolving Neuro-Fuzzy Inference Method. Neuro-Fuzzy inference systems such as ANFIS consist of set of rules and an inference method that are embodied or combined with a connectionist structure for a better adaptation (the learning mechanism is explained in the latter section). On the other hand Evolving Neuro-Fuzzy Inference systems are such systems where both the knowledge and the inference mechanism evolve, change in time, with more examples presented to the systems. Hence, DENFIS has successfully applied on time series applications such as stock price estimation problem [Kasabov and Song, 2002]. In chapter of experiments we present stock price estimation models. Thus, we used DENFIS as a benchmarking hybrid fuzzy modeling tool. DENFIS has two modifications, online and offline learning using TSK type fuzzy inference method [Takagi and Sugeno, 1985; Sugeno and Kang, 1986]. In the online learning, linear functions in the consequence parts are created and updated through learning from data by using a linear least square estimator. In this work, we applied offline learning strategy because we use empirical data to build system models. In off-line model, the data is initially clustered to find c cluster centers by using on-line evolving clustering method (ECM) and off-line evolving clustering method with constraint optimization (ECMc). Then antecedent part of TSK style rules is created. Then c datasets are formed, each of them including one cluster center, and k learning pairs that are closest to the center in the input space. In the following we briefly explain each of these steps of DENFIS leaning algorithm.
2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems
45
Offline Evolving Clustering Method with Constraint Optimization –ECMc
The offline ECMc [Kasabov, Song, 2002] applies an optimization procedure to the cluster centers obtained from evolving clustering method (ECM). ECM implements a scatter partitioning of the input space (crisp partitioning) for the purpose of creating fuzzy inference rules. ECM dynamically estimates the number of clusters in a set of data, and finds their current cluster centers in the input data space. In a cluster the maximum distance, MaxDist, between an example point and the cluster center, is less than a threshold value, Dthr, that is set as a clustering parameter before the model run. In ECM, the distance between vectors is measured with Euclidean distance. At the start of the ECM process, the data examples come from a data stream and ECM process starts with an empty set of cluster centers. When a new cluster is created, the first cluster center, v1, is defined and its radius, R1, is initially set to zero. Then as examples are presented to the algorithm, cluster centers are updated through changing their centers’ positions and increasing their cluster radiuses. Depending on the position of the currently presented data sample, the cluster to be updated and how much it will be changed is determined. A cluster will not be updated any more when its cluster radius, Ri,i=1…c, reaches the value that is equal to a threshold value, Dthr. More detailed background of ECM can be found in [Kasabov, Song, 2002]. Next, cluster centers, vi, obtained from ECM are used as initial cluster centers for offline clustering methods, ECMc. The ECMc uses Euclidean distance measure between data vectors. In each cluster, cluster centers are calculated and an objective function is minimized based on the following distance measure: c ⎡ J ECMc = ∑ ⎢ ∑ xk − vi i =1 ⎣ x ∈v k
i
⎤ ⎥, k = 1,..., n ⎦
(2.92)
subject to
xk − vi ≤ Dthr , j = 1, 2,...c
(2.93)
Each cluster is defined by a membership matrix uik∈U∈{0,1} × , which is a binary matrix in which the element uik =1 implies that kth data object belongs to cluster i. Once the cluster centers vi are fixed, uik is obtained by n c
if xk − vi ≤ xk − v j , for each i ≠ j; uik = 1, else u ij = 0.
(2.94)
Using the iterative algorithm in ALGORITHM 2.1, the cluster centers and membership matrix are determined.
46
2 Fuzzy Sets and Systems
ALGORITHM 2.1 Evolving Clustering Method for constraint optimization of DENFIS Step 1: Initialize cluster centers with ECM clustering process, Step 2: Determine the membership matrix U by (2.94) Step 3: Apply ECMc clustering using objective function in (2.92) with constraints (2.93) to get new cluster centers. Step 4: If objective function J in (2.92) is below a certain value or certain number of iterations is over, then stop, otherwise return to Step 2.
DENFIS Leaning Mechanism
After the ECMc obtains cluster centers, DENFIS uses Takagi-Sugeno-Kang type fuzzy inference engine [Takagi and Sugeno, 1985; Sugeno and Kang, 1986]. The antecedent fuzzy sets are triangular fuzzy numbers, ⎧0 ⎪ x−a ⎪ μ ( x) = ⎨ bx−−aa ⎪ b−a ⎪⎩ 0
x≤a a≤ x≤b , b≤ x≤c
(2.95)
c≤x
where b is the value of the cluster center on the x domain, a=b-d*Dthr and c=b+d*Dthr, d=[1.2-2] is defined by the user. The threshold value, Dthr, is determined during clustering process. The DENFIS offline learning is implemented in the following way:
¾ The cluster centers obtained from ECMc are used to construct the antecedent part of the rules as in (2.95).
¾ c datasets are captured each of which include one cluster center and n data pairs that are closest to the center in the input space. Hence, data pairs may belong to several clusters. ¾ the consequent functions are learnt either using weighted least squares method using membership values as weight vector, or applying ANN learning method to find the functions of the consequents. Here, only a brief introduction to DENFIS is presented. More detailed information can be found in [Kasabov, Song, 2002].
2.8.3 Genetic Fuzzy Systems (GFS) In recent years, a great number of publications have explored the use of genetic algorithms as a tool for designing fuzzy systems. Genetic Fuzzy Systems apply genetic algorithm principles to optimize the parameters of fuzzy inference systems. In other words, fuzzy system generated or adapted by genetic algorithms is called Genetic Fuzzy Systems (GFS). Genetic algorithm (GA)3 is an optimization and search technique based on principles of genetics and natural selection. Genetic 3
Genetic Algorithms are briefly presented in Appendix C.3.
2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems
47
algorithms were introduced by John Holland in 1973 and later the methodology has been developed [Goldberg, 1989] and has been applied to various domains. Genetic algorithms have been used to perform different tasks such as generation of fuzzy rule base, optimization of fuzzy rule bases, generation of membership functions, and tuning of membership functions [Cordon et al., 2001]. All these tasks can be considered as optimization or search processes. The first step in designing GFS is to decide which parts of knowledge base are subject to optimization by the GA. As an example, the knowledge base of a descriptive TakagiSugeno-Kang (TSK) type fuzzy system [Takagi and Sugeno, 1985; Sugeno and Kang, 1988] is comprised of two components: a database containing the definitions of membership functions of the fuzzy sets associated with linguistic labels, and a fuzzy rule base, constituted by the collection of (non)linear functions. The GFS is utilized as a benchmarking tool in the chapter of experiments of this work, is based on a traditional TSK type fuzzy inference systems [Cordon et al., 2001]. In GFS, each chromosome encodes a complete fuzzy rule set. Parameters of triangular membership functions are optimized during the GA process. Here we will use triangular membership functions defined with two parameters as follows: 0 , if x ≤ a − b2 ⎧ ⎪ ⎪ 2 x−a μ ( x ) = ⎨1 − , if a − b2 < x < a + b2 (2.96) b ⎪ ⎪ 0 , if x ≥ a + b2 ⎩ The parameters a and b define the center and spread of triangular membership functions as shown in Figure 2.16. First-order TSK models, which identify linear local functions for each rule, are constructed. Using the rule base structure as shown in (2.83), TSK fuzzy model is described as follows:
Rule i: IF x1 is Ai and…and xnv is Anv THEN yi =βi x1 +…+βi xnv + βi 1
i
*
1
nv
0
(2.97)
where i=1,…,c, c is the number of IF-THEN rules, x=[x1,…,xnv] the premise input j variables, nv the number of input variable, βi , j=1,…,nv, the consequent paramej ters, yi is an output from the ith IF-THEN rule and Ai is a fuzzy variable. Given the input x, the final output of the fuzzy model is inferred by a weighted mean defuzzification as
∑ i =1 μi ( x ) yi* ,∀i = 1...c , c ∑ i =1 μi ( x ) c
y* =
(2.98)
where the weight implies the overall truth value of the premise of the jth implication for the input, and is calculated as
μi ( x ) = AND μi ( x j ) ,∀i = 1...c nv
j =1
(2.99)
48
2 Fuzzy Sets and Systems
AND is usually taken as t-norm function, so (2.99) is calculated using product t-norm as:
μi ( x ) = ∏ μi ( x j ) ,∀i = 1...c , nv
(2.100)
j =1
j
where µi(xj) is the grade of the membership function of Ai and is characterized by a Triangular function as in (2.95). Thus, design of fuzzy model consists of optimizing membership function parameters, i.e., a and b in (2.96), which represent the spread and center of the membership function. The affects of change in membership function parameters on the membership values are shown in Figure 2.16. The fitness of each chromosome is evaluated by the error between the fuzzy model output and the desired output, which is to be minimized. Using the GAs to design TSK model, one encodes the TSK model parameters into chromosome. To completely represent a fuzzy system, each chromosome must encode all the needed information in the rule set and membership functions. In order to represent a TSK fuzzy system, described above, each chromosome will encode the parameters of each antecedent membership function in each rule. When scalar valued gene structure is used, each token as shown in Figure 2.15, represents the value of each parameter. The Genetic Algorithm has to be set up by setting the upper and lower bounds for each parameter, through which the algorithm may search. In each membership function, there are three parameters to be optimised, that is the three vertex positions. For simplicity, each corner should have the same amount of freedom. Thus, for a TSK model with c number of rules, j nv number of antecedent fuzzy sets for each input variable, Ai , is defined with two parameters for each triangular membership function, i.e., spread and center, then j there will be c×nv×2 tokens in each chromosome. µi , i=1,…,c, j=1,…,nv represent th th parameters of j antecedent membership function in i rule.
j
j
Fig. 2.15 Chromosome structure of a Generic Fuzzy System. Each ai and bi indicates the triangular membership function parameters
The freedom afforded to each corner position of the triangular membership function can be calculated as follows: Corner variation = ± (abs(data_range) / 2*(no of membership functions)) The relationship means that the situation will never arise that the membership function will always maintain its triangular shape and a conflict will never arise between membership functions. Nonetheless, each membership function is independent of every other membership function so the optimum parameters for each MF will prevail as the algorithm evolves.
2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems
49
Fig. 2.16 The effect of changing parameters on membership function
Since the consequent parameters βi , j=1,…,nv, can be obtained by least square optimisation method, explicit inclusion of consequent parameters into a chromosome is unnecessary. It is only necessary to encode the parameter of the antecedent membership functions, i.e. centres and widths, which are necessary to represent a triangular membership function, into the chromosome. The initialization of a chromosome is usually done in the same way as the standard GAs using randomly generated numbers. The encoded chromosome is evolved by the conventional evolutionary operations (reproduction (crossover) and mutation). After each evolutionary step, the antecedent membership values of each data point are calculated using the parameters of each antecedent membership j function denoted by each chromosome. Then the consequent parameters βi using least square algorithm is calculated. The mean square error (MSE) for the training data or the validation data is regarded as the objective function (fitness function) of each chromosome and can be calculated as j
J mse =
2 1 n ∑ k =1 ( yk − y*k ) n
(2.101)
*
n is the data length, yk the predicted output and yk is the target output. Based on the fitness obtained with (2.101) the offspring is chosen for the next crossover and mutation steps by using the tournament selection approach4. Then crossover and mutation operation are performed on the current chromosomes of the model to generate new chromosomes in the search space. Elitist’s reinsertion approach guarantees that the best chromosome in the population always survives and is retained in the next generation. For each individual these fitness functions are evaluated until a fixed number of generations or when the search process converges with a given accuracy. The best chromosomes will be used to determine the optimal TSK fuzzy model. 4
Some of the characteristic functions of Generic Algorithms are reviewed in Appendix C.3.
50
2 Fuzzy Sets and Systems
2.9 Summary We briefly summarized characteristics and operations of Type-1 fuzzy sets, Type-2 fuzzy sets, fuzzy functions and extensions of well-known fuzzy structure identification and inference systems. This chapter presents foundations of the algorithms proposed in this work. Hence, this chapter can be considered as a reference for the reader. The following chapters will present the essentials of this work, including improved fuzzy clustering algorithm and structure and mathematical algorithms of the proposed fuzzy system modeling based on novel fuzzy functions.
Chapter 3
Improved Fuzzy Clustering 3 Improved Fuzzy Clustering
The new fuzzy system modeling approach based on fuzzy functions implements fuzzy clustering1 algorithm during structure identification of the given system. This chapter introduces foundations of fuzzy clustering algorithms and compares different types of well-known fuzzy clustering approaches. Then, a new improved fuzzy clustering approach is presented to be used for fuzzy functions approaches to re-shape membership values into powerful predictors. Lastly, two new cluster validity indices2 are introduced to be used to validate the improved fuzzy clustering algorithm results.
Everything is vague to a degree, you do not realize till you have tried to make it precise —Bertrand Russell
3.1 Introduction In 1965 and later in 1975, Zadeh [1965;1975a] introduced the concept of mathematical modeling with imprecise propositions, fuzzy sets and fuzzy logic. Since then, fuzzy sets and logic has been applied on many areas to control uncertain information and to simulate how inferences can be made with uncertain information. It is known that noises and uncertainties can never be totally eliminated in any given database system. Generally, a common way is to use fuzzy logic theory in clustering problems to further explain these types of uncertainties. Fuzzy clustering algorithms can identify overlapping clusters in a given dataset, while calculating membership values that specify to what degree each object belong to such clusters. In the next section, we present firstly the terminology of the fuzzy clustering methods and a general classification of fuzzy cluster analysis. The “Fuzzy 1
This section of this chapter is an extension of papers [Celikyilmaz, Turksen, 2007a,b].
2 This section of this chapter is an extension of papers [Celikyilmaz, Turksen, 2007c and
2008c]. A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 51–104. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
52
3 Improved Fuzzy Clustering
Functions” system modeling approach of this work (Chapter 4) utilizes the most commonly used type of fuzzy clustering algorithm, namely Fuzzy c-Means (FCM) clustering algorithm [Bezdek, 1981a] for structure identification. Therefore, mathematical background of FCM is presented in this chapter. Next, a novel Improved Fuzzy Clustering (IFC) algorithm instead of FCM clustering algorithm is presented to be utilized in “Fuzzy Functions” approaches for prediction problems (regression). Motivation and theory of the IFC algorithm is presented along with its mathematical transformation. An extension to the novel IFC, entitled Improved Fuzzy Clustering for classification – IFC-C - is also proposed to be applied on pattern recognition problem domains. Later, justification of the performance of IFC algorithm in comparison to FCM clustering algorithm is discussed with examples. Lastly, two new cluster validity index (CVI) are presented to be used to find the optimum number of clusters with IFC algorithm for classification and regression type system domains. Using artificial datasets, performance of the new CVIs is discussed and the results are compared to the results from other well known CVIs.
3.2 Fuzzy Clustering Algorithms System modeling with soft computing can be broadly separated into following two main parts: global and local system modeling [Babuška and Verbruggen, 1997]. In global modeling, overall system is analyzed as a whole to understand underlying relationships, if any. In local modeling, the system under study is first decomposed into meaningful parts and sub-models are built using linear or nonlinear methods. Hence, the class of fuzzy clustering algorithms is used to identify these local models. There are numerous ongoing research on enhancing fuzzy clustering algorithms and these can be classified as follows based on the clustering structure: ¾ ¾ ¾
Fuzzy clustering based on fuzzy relations, Fuzzy clustering based on objective function and covariance matrix, Non-parametric classifier, e.g., fuzzy generalized k-nearest neighbour rule, ¾ Neuro-fuzzy clustering, e.g., self organizing maps, fuzzy learning vector quantization, etc.
This work focuses on “objective” based fuzzy clustering algorithms, which assign an error or a quantitative measure to each possible cluster partition using an evaluation function and try to minimize/maximize total error/quality. Ideal solution is reached when cluster-partitions are assessed to obtain the best evaluation. Objective based clustering algorithms try to solve an optimization problem. Definition 3.1. (Objective Function) Objective function J(f) or J is an error or quantitative measure and our aim in fuzzy clustering algorithms is to find the global minimum or maximum of J depending on the structure of the clustering algorithm. J is used to compare different solutions usually for the same clustering problem.
3.2 Fuzzy Clustering Algorithms
53
Let X={x1,x2,…,xn} represent a set of n objects, each k object, k=1,…,n, is represented with nv dimensional vector, xk=[x1,k,…, xnv,k ]T ∈ ℜnv. A set of n vectors is then represented by n×nv data matrix
⎡ x1×1 X = ⎢⎢ " ⎢⎣ xn×1
x1×2 "
" xn×2
" "
x1×nv ⎤ " ⎥⎥ xn×nv ⎥⎦
(3.1)
A fuzzy clustering algorithm partitions given dataset X into c number of overlapping clusters, forming a fuzzy partition matrix, U. Definition 3.2. (Fuzzy Partition Matrix). Fuzzy partition matrix, U, is a matrix of degrees of memberships of every object xk, k=1,…,n in every cluster i, i=1,…,c. The value of the degree of membership of kth vector in cluster i is represented by μik∈U. The partition matrix is in the form:
⎡ μ1,1 ⎢ U = ⎢" ⎢⎣ μ1,n
μ2 ,1 " μc ,1 ⎤ ⎥ " " "⎥ μ 2 ,n " μc ,n ⎥⎦
(3.2)
In fuzzy clustering algorithm, each cluster is represented with a vector called vector of “cluster centers” or “cluster prototypes”, with which cluster structures in overall X can be represented. Definition 3.3. (Vector Cluster Centers/Prototype). In a dataset X of nv dimensional vectors, fuzzy clustering algorithm identifies c number of cluster center vectors, V={υ1,υ2,…,υc}∈ℜc×nv, where each cluster center ,υi∈ℜnv is also an nv dimensional vector. Each cluster center (υi) is usually represented as the centroids of nv objects, e.g., average of all the datum of the corresponding cluster. Among many different types of fuzzy clustering algorithms, this work deals with the objective function based and point-wise (distance-based) clustering algorithms. Since an extension of the presented system modeling approach with “Fuzzy Functions”, to be discussed in the following chapters, executes fuzzy c-means clustering (FCM) algorithm for a structure identification of the presented fuzzy system modeling; detailed explanations of this algorithm will be presented next.
3.2.1 Fuzzy C-Means Clustering Algorithm Fuzzy c-Means (FCM) [Bezdek, 1981a] clustering algorithm is a simple and yet the most commonly used and extended fuzzy clustering method of all fuzzy clustering approaches. In FCM clustering algorithm, it is assumed that number of clusters, c, is known or at least fixed, i.e., FCM algorithm partitions given data set X = {x1,...,xn} into c clusters. Since the assumption of a known or a previously fixed number of clusters is not realistic for many data analysis problems, there are
54
3 Improved Fuzzy Clustering
techniques such as cluster validity index (CVI) analysis to determine the number of clusters of FCM clustering algorithm. Some well known CVIs are discussed and two new cluster validity criteria are introduced in the later sub-sections of this chapter. Let each c cluster be represented with a cluster prototype, vi. FCM clustering algorithm, tries to minimize an objective function with two prior information, number of clusters, c, and fuzziness parameter, m, as follows: min J(X;U,V) =
∑ i =1 ∑ k =1 ( μik ) c
n
m
d 2 ( xk ,υi )
(3.3)
In (3.3), m∈(1,∞) represents the “degree of fuzziness”3 or “fuzzifier” of the fuzzy clustering algorithm and it determines the degree of overlapping of the clusters. “m=1” would mean no overlapping, which represents a crisp clustering structure. Here, d2(xk,υi) is a measure of the distance between kth object and ith cluster center. Squared distances satisfy that objective function is non-negative, J>0. The objective function will be 0 when all data objects are clusters at the centers, c=n. On the other hand, when data objects are farther away from cluster centers, υi, objective function will get larger. The location and the number of cluster centers affect the value of the objective function. Criterion of the objective function is minimum at the optimum solution, and one should search for the global minimum. In order to avoid trivial solutions, two constrains are imposed on the partition matrix U, as follows:
∑ i =1 μik = 1, c
∀k >0
0 < ∑ k =1 μik 0
(3.4)
n
(3.5)
The constraint in (3.4) implies that each row of partition matrix in (3.2) adds up to 14. The constraint in (3.5) implies that the column total of membership values cannot exceed the number of data vectors, n, nor it can be zero. This indicates that there is at least one member assigned to each cluster. However, none of these constraints force membership values of each cluster to have a certain distribution. General formula of the distance measure is given by the following formula:
d 2 ( xk ,υi ) = ( xk − υi ) Ai ( xk − υi ) ≥ 0 T
3
4
(3.6)
Fuzziness is a type of uncertainty (of imprecision) accepted in uncertainty theory [Zadeh, 1965, 1975a]. Various functions have been proposed to measure the degree of fuzziness measure. In fuzzy clustering algorithms, the overlapping constant, m, is used as the degree of fuzziness. In later chapters m will be used as a parameter to define uncertainty in proposed fuzzy functions approach, along with other measures. In some research such as Krishnapuram and Keller (1993), (3.4) is indexed in possibilistic approach to clustering.
3.2 Fuzzy Clustering Algorithms
55
In (3.6) the norm matrix Ai, i=1…c, is positive definite symmetric matrix. Other distance measures can also be used in fuzzy clustering algorithms. A short list of different distance measures are given in Table 3.1. FCM clustering algorithm uses Euclidean distance, therefore norm matrix, Ai, is equal to the identity matrix since input matrix is scaled to standard deviation 1 and mean equals 0, e.g., A=I. On the other hand, Gustafson and Kessel [1979] use Mahalanobis distance, in which case norm matrix of each cluster is equal to inverse of the covariance matrix of each cluster, e.g., Ai=Ci-1. Table 3.1 Distance Measures
Distance Measure Euclidean Distance
Function
Minkowski Distance
⎡ s ⎤ d p (a, b) = ⎢ ∑ | ai − bi | p ⎥ ⎣ i=1 ⎦
Maximum Distance
d∞ (a, b) = max | ai − bi |
1/ 2
⎡ nv ⎤ d2 ( a, b) = ⎢ ∑ ( ai − bi ) 2 ⎥ ⎣ i =1 ⎦
1/ p
,p>0
nv
i =1
Mahalanobis Distance
d A (a, b) = (a − b)T A (a − b)
From (3.3)-(3.6), one can imply that FCM clustering algorithm is a constraint optimization problem, which should be minimized in order to obtain optimum results. Therefore, FCM clustering algorithm can be written as a single optimization structure as follows:
min J ( X ;U ,V ) = ∑ i =1 ∑ k =1 ( μik ) d 2 ( xk ,υi ) A c
s.t.
n
m
0 ≤ μik ≤ 1, ∀i,k
∑ μ 0< ∑ c
i =1
ik
n
k =1
= 1, ∀k > 0
(3.7)
μik < n, ∀i > 0
Constraint optimization model in (3.7) [Bezdek, 1981a] can be solved using a well-known method in mathematics, namely Lagrange Multiplier method [Khuri, 2003], and the model is converted into an unconstraint optimization problem with one objective function. In order to get an equality constraint problem, primal constraint optimization problem is first converted into an equivalent unconstraint problem with the help of unspecified parameters known as Lagrange Multipliers, λ;
56
3 Improved Fuzzy Clustering
∑ i =1 ∑ k =1 ( μik ) c
max W(U,V) =
n
m
d 2 ( xk ,υi ) -λ A
(∑
c i =1
μik − 1
)
(3.8)
According to Lagrangian Method, the Lagrangian function must be minimized with respect to primal parameters and maximized with respect to dual parameters. According to the derivative of Lagrangian function in (3.8) with respect to the original model parameters, U and V should vanish. Hence, by taking the derivative of objective function in (3.8) with respect to cluster centers, V and membership values, U, optimum membership value calculation equation and clusters centers are formulized by:
μik(t )
⎛
n
2 ⎡ ⎤ c ⎛ d x ,υ ( t −1) ⎞ m−1 ( ) ⎢ ⎥ k i ⎟ ⎥ = ⎢∑ ⎜ ( − 1) t ⎜ ⎟ ⎢ j =1 ⎝ d ( xk ,υ j ) ⎠ ⎥ ⎣ ⎦
( )
υi(t ) = ⎜ ∑ μik(t ) ⎝ k =1
m
⎞ xk ⎟ ⎠
∑ ( μik(t ) ) n
m
−1
, ∀i = 1,..., c
(3.9)
(3.10)
k =1
In (3.9) υi(t-1) represent cluster center vector for cluster i obtained in (t-1)th iteration. Similarly, in (3.9) and (3.10), μik(t) denotes optimum membership values calculated at tth iteration. The proof of extracting the membership value calculation formula in (3.9) and the cluster center function in (3.10) can be found in Appendix B.1. The result from this operation shows that membership values and cluster centers are dependent on each other, so Bezdek [1981a] proposed an iterative algorithm to calculate membership values and cluster centers. Objective function Jt at each iteration, t, is measured by J(t) =
∑ ∑ (μ ) c
n
i =1
k =1
(t ) m ik
d 2 ( xk ,υi( t ) ) >0
(3.11)
FCM algorithm stops according to a termination criterion, e.g., either after certain number of iterations, or magnitude of separation of two nearest clusters is less than a pre-determined value (ε), etc. Iterative algorithm of FCM clustering algorithm is shown in ALGORITHM 3.1. The effect of the fuzziness value, m, can be analyzed by taking the limit of the membership value calculation equation in (3.9) at the boundaries as follows: −1
1 ⎡ ⎤ m−1 c ⎛ d 2 x ,υ ⎞ ( ⎢ ⎥ 1 k i ) ⎟ ⎥ = , ∀i, j = 1,..., c. lim μik ( x ) = lim ⎢∑ ⎜ 2 m→∞ m→∞ ⎜ ⎟ c ⎢ j =1 ⎝ d ( xk ,υ j ) ⎠ ⎥ ⎣ ⎦
(3.12)
3.2 Fuzzy Clustering Algorithms
57
ALGORITHM 3.1 Fuzzy C-Means Clustering Algorithm (FCM) Given data vectors, X={x1,..,xn}, number of clusters, c, degree of fuzziness, m, and termination constant, ε (maximum iteration number in this case). Initialize the partition matrix, U, randomly. Step 1: Find initial cluster centers using (3.10) using membership values of initial partition matrix as inputs. Step 2: Start iteration t=1…max-iteration value; Step 2.1. Calculate membership values of each input data object k in cluster i, μik(t ) , using the membership value calculation equation in (3.9), where xk are input data objects as vectors and υi( t −1) are cluster centers from (t-1)th iteration, Step 2.2. Calculate cluster center of each cluster i at iteration t, υi( t ) using the cluster center function in (3.10), where the inputs are the input data matrix, xk, and the membership values of iteration t, μik(t ) . Step 2.3. Stop if termination condition satisfied, e.g., | υi( t ) - υi( t −1) |≤ε. Otherwise go to step 1.
and under the assumption that no cluster centers are alike, we get;
⎧⎪1, if d 2 ( xk ,υi )
(3.13)
As the value of m is increased, μik will converge to 0, e.g., when m=6, any strong membership such as μik=0.85 will be decreased somewhere close to 1/c (see (3.12)). Since m parameter represents degree of overlap of clusters, as m gets larger, the fuzzier the results would be and overlapping will be wider. As m gets smaller, fuzzy clustering result will be more close to a crisp clustering model. m=1 is same as crisp clustering where there is no overlapping between clusters, and all the membership values are μik ∈ {0,1}. Earlier research [Turksen, 1999] indicates that, as a rule of thumb, m=2 should be used in system modeling analysis. In a more recent study [Ozkan and Turksen, 2007], the maximum and minimum values of m was mathematically proven to be m∈[1.4,2.6] based on Taylor expansion analysis of the membership value calculation function. Also in this work, the fuzziness interval, m∈[m-lower,mupper] will be investigated based on a new fuzzy modeling approach in uncertainty modeling chapter, in Chapter 5, entitled “Modeling uncertainty with Improved Fuzzy Functions”. FCM [Bezdek, 1981a] clustering algorithm can be regarded as the standard fuzzy clustering algorithm and many extensions of FCM clustering algorithm have been proposed for various different purposes. FCM clustering is also the core structure of the proposed improved clustering algorithm (IFC) of this work, to be explained in the following sub-sections. In the next sub-section, different classifications of fuzzy clustering algorithms are reviewed.
58
3 Improved Fuzzy Clustering
3.2.2 Classification of Objective Based Fuzzy Clustering Algorithms Objective function based fuzzy clustering algorithms other than FCM clustering algorithm can be categorized into three parts based on their purpose in system modeling as follows: 1. Algorithms that use adaptive distance measures [Gustafson and Kessel, 1979] or fuzzy maximum likelihood estimation methods [Gath and Geva, 1989], 2. Algorithms based on fuzzy linear variants and fuzzy c-elliptotypes [Bezdek et al., 1981b; Bezdek et al., 1981c], 3. Fuzzy c-regression algorithms using prototypes defined by regression functions [Hathaway and Bezdek, 1993]. Gustafson and Kessel [1979] proposed that for each cluster, different symmetric and positive semi definite matrix, A, should be used. Gath and Geva’s method [1989] is an extension of Gustafson and Kessel [1979] algorithm that also uses the size and density of the point-wise clusters. On the other hand, fuzzy c-varieties algorithm by Bezdek [1981a; 1981b, and 1981c] was developed for recognition of lines, planes and hyper-planes. Shell fuzzy cluster algorithms are also recently developed methods for recognizing circular, elliptical and parabolic shapes [Höppner et al, 1999]. Fuzzy c-regression model clustering algorithm, to be discussed in the following, falls into this category. Since the scope of the work only covers the third category of fuzzy clustering algorithms, we will leave out the definitions of the first two algorithms.
3.2.3 Fuzzy C-Regression Model (FCRM) Clustering Algorithm The objective of fuzzy c-regression model (FCRM) clustering algorithm [Hathaway and Bezdek, 1993], as in all clustering algorithms, is to classify objects into similar groups. FCRM clustering algorithm yields simultaneous estimates of parameters of c regression models, while fuzzy partitioning a given dataset. A prominent feature of this clustering algorithm that separates it from other pointwise clustering algorithms, e.g. FCM, is that, cluster prototypes are functions instead of geometrical objects. FCRM clustering algorithm can be used to separate linear patterns, where each pattern can be identified by a linear function. Hathaway and Bezdek [1993] displayed the domain of FCRM algorithm with a simpler example as shown in Figure 3.1. The artificial dataset used for this example consists of two regression models, e.g., linear functions. In [Hathaway and Bezdek, 1993] these models are called switching regression models and FCRM algorithm tries to identify these structures using fuzzy sets. Although FCRM clustering algorithms are based on standard FCM [Bezdek, 1981a] clustering algorithm, there are various differences between them. These differences can be briefly listed as follows:
3.2 Fuzzy Clustering Algorithms
•
59
In FCM [Bezdek, 1981a] clustering algorithm, clusters are hyper-sphere shaped, whereas in FCRM clustering algorithm [Hathaway and Bezdek, 1993], clusters are hyperplane-shaped. The representatives of clusters of FCM are cluster centers, υi, whereas the representatives of clusters in FCRM are hyper-planes, which are represented with nv-dimensional inputs and an output, e.g., for multiinput, single output model, by
•
yi = βi0 + βi1 x1 + ... + β inv xnv ,
(3.14)
where βi are the regression coefficients of each function i, i=1…c. •
FCM algorithm calculates cluster centers by averaging each data vector weighted with their membership values. FCRM calculates cluster representative functions by weighted least squares regression algorithm.
In order to demonstrate FCRM clustering algorithm, an artificial dataset in Figure 3.1 is composed using two different linear functions with random noise, ε, of mean=0 and standard-deviation=0.05, using MATLAB’s rand(x) function, where x∈[0,1] , i.e. f1(x)=rand(x)+ε, and f2(x)=ε+0.3. 0.7 f1(x) f2(x)
0.6 0.5
y
0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x
Fig. 3.1 Scatter Plot of artificial data with two different random functions. f1(x)=y+ random_noise, f2(x)=0.3+random_noise.
FCRM clustering algorithm yields simultaneous estimates of parameters of c regression models together with a fuzzy c partitioning of the data [Hathaway and Bezdek, 1993]. Each regression model is defined as; yk = fi (xk , βi)
(3.15)
60
3 Improved Fuzzy Clustering
In (3.17) xk=[x1,k,…,xnv,k]T∈ℜnv denotes kth data object and βi ∈ℜnv , i=1,…,c, are the parameters of functions fi and c represents total number of functions. Performance of captured functions is generally measured by Eik(βi) = (yk - fi (xk, βi))2
(3.16)
The objective function to minimize the total error of these approximated functions is calculated by
E (U,βi )= ∑ i =1 ∑ k =1 ( μik ) Eik ( βi ) c
n
m
(3.17)
where m ∈ (1,∞) is the “fuzzifier” exponent, which determines overlapping of the functions slightly different than FCM [Bezdek, 1981a] algorithm. In FCRM clustering algorithm as m approaches to ∞ each function will have almost the same parameters, i.e., coefficients (weights), β i, graphically speaking all the functions will be on top of each other. On the other hand, as m approaches 1, the functions will be as discrete from each other as possible. Therefore, choice of parameter m affects the performance of the FCRM clustering models as well. In FCRM modeling algorithm, membership values, μik, are interpreted as weights, i.e., coefficients of linear or polynomial regression functions. They represent to what extent values predicted by the model, fi (xk,βi), is close to yk. Therefore, based on membership value calculation equation of FCM in (3.10), membership value calculation equation of FCRM algorithm is re-formulated in [Hathaway and Bezdek, 1993] as follows: −1
1 ⎡ ⎤ m −1 c ⎛E ⎞ μik = ⎢ ∑ j =1 ⎜ ik E ⎟ ⎥ ,∀i, j = 1,...,c < n ⎢ ⎥ jk ⎠ ⎝ ⎣ ⎦
(3.18)
Minimization of the objective function in (3.17) yields an optimization model as shown in ALGORITHM 3.2. ALGORITHM 3.2 Fuzzy C-Roegression Models Clustering Algorithm (FCRM) Given the data vectors, X={x1,.., xn}, choose number of clusters, c, degree of fuzziness, m, and termination constant, ε, maximum iteration number, and the structure of regression models in (3.15). Initialize the partition matrix, U, randomly. Start iteration t=1,….max-iteration value; Step 1. Calculate the values of the model parameters βi that minimize the cost function in (3.17). Step 2. Update partition matrix, μik∈U, using (3.18). Terminate if |U(t)-U(t-1)|≤ε. Otherwise go to step 1.
3.2 Fuzzy Clustering Algorithms
61
Hathaway and Bezdek [1993] searches for the optimum parameters of the functions using weighted least-squares method. They use the membership degrees of the fuzzy partition matrix U as weights. In this specific situation, membership degrees, U input matrix, X, and output matrix, y are represented by
⎡ xiT,1 ⎤ ⎡ μi1 ⎡ y1 ⎤ ⎢ T ⎥ ⎢0 ⎢ ⎥ y x X i = ⎢ i ,2 ⎥ , y = ⎢ 2 ⎥ , U i = ⎢ ⎢ # ⎥ ⎢ # ⎢#⎥ ⎢ T ⎥ ⎢ ⎢ ⎥ ⎢⎣ 0 ⎢⎣ xi , n ⎥⎦ ⎣ yn ⎦
0
μi 2 # 0
" 0 ⎤ " 0 ⎥⎥ % # ⎥ ⎥ " μi , n ⎥⎦
Parameters,β i, of each function, fi, are calculated with weighted least squares algorithm: βi =[XTUiX]-1 XTUi y
(3.19)
In FCRM clustering method, linear functions are generally formulized to find hidden structures in a given dataset. Possible extensions of FCRM implement nonlinear functions to find hidden patterns. The new improved fuzzy clustering algorithm of this work, which utilizes FCRM type clustering algorithms, introduces additional non-linearity to the clustering algorithm by using non-linear algorithms to find functions in a novel way. For this purpose FCRM algorithm is one of the fundamental structures of the new clustering algorithm along with the standard FCM clustering algorithm. Next, we review different clustering algorithms, which combine regression and clustering methods in one structure.
3.2.4 Variations of Combined Fuzzy Clustering Algorithms In 1993, Sugeno and Yasukawa published a paper on a FL-based approach to qualitative modeling in which they proposed achieving structure identification by using a combination of FCM clustering algorithm [Bezdek, 1981a] and group method of data handling. In their approach, number of rules was captured by minimizing the variance in each output cluster and maximizing the variance between clusters. The main advantage of their method is that they were able to separate structure identification from parameter identification. These clusters constitute building blocks of fuzzy rule based approaches. Despite the popularity of FCM clustering algorithm, other variations of clustering algorithms are also used in literature to build fuzzy system models for different purposes. One of the prominent FCM clustering variation, fuzzy c-regression methods (FCRM), presented in the previous section, is commonly used to fuzzy partition a given dataset for local function approximation. In one of these studies, Chen et al. [1998] presents a new fuzzy clustering algorithm combining fuzzy c-functions clustering and alike fuzzy c-means clustering algorithm. Their fuzzy c-functions clustering based on linear fuzzy regression. In [Chen et al., 1998] Takagi-Sugeno-Kang (TSK) type system [Takagi and Sugeno, 1985; Sugeno and Kang, 1986] modeling structures are utilized. A nonlinear optimization algorithm is applied to identify the parameters of premises and consequences of each rule. In there, consequence and premise parameters are
62
3 Improved Fuzzy Clustering
optimized in turns with one fixed and the other optimized. In a sequence, they develop different criteria for optimization of model parameters of premise and consequents based on fuzzy c-partition space. Even though their algorithm cannot be considered as one united clustering algorithm, they still manage to combine fuzzy c-variants and c-means clustering algorithms in one system modeling and use them interchangeably to find parameters of premise and consequents of fuzzy rules, but again not in one optimization algorithm. On the other hand, Höppner and Klawonn [2003] combine FCM [Bezdek, 1981a] and FCRM [Hathaway and Bezdek, 1993] algorithms in one clustering schema, to build combined clustering structure. Their main goal was to update fuzzy clustering algorithm so that they can prevent the effect of harmonics by modifying the objective function. Their aim was to eliminate the counterintuitive membership values such that membership value of a linguistic term “young”, which is high for “17 years”, should not be higher for “23 years” than for “21 years”. They not only deal with point-wise clustering algorithms such as fuzzy c-means clustering (FCM) algorithm [Bezdek, 1981a], they also deal with fuzzy c-regression model clustering algorithms (FCRM) [Hathaway and Bezdek, 1993]. They modified the objective function of FCM clustering algorithm to yield the following membership value calculation equation based on their heuristic approaches:
( (
⎡ d ik2 − min dik2 − η c i = 1...c ⎢ μik = ⎢ ∑ j =1 2 d jk − min d ik2 − η ⎢⎣ i =1...c
) )
−1
⎤ ⎥ ,0 < η ⎥ ⎥⎦
(3.20)
η >0 is a user defined constant. The objective in [Höppner and Klawonn, 2003] was to find membership values that can eliminate harmonics by pushing larger membership values towards “1” and smaller ones towards “0” based on distances T of objects. In [Höppner and Klawonn, 2003], each function, ˆyi = βˆ i xi , is
interpreted as a rule in a T-S model. Combining FCM and FCRM algorithms in one clustering schema (since both algorithms are objective based), Höppner and Klawonn [2003] introduced a new combined distance function, which is the combination of both methods as follows: 2 d ik2 (( xk , yk ), (vi , βˆi )) = xk − υi ( x) + ( yk − βˆi T xˆ k ) 2
FCM distance
FCRM distance
(3.21)
In (3.23), (xk, yk) is a given input-output data sample, xk ∈ X, yk∈Y, k=1,..,n, n is the total number of training data vectors, d2 is distance function, υi is a cluster prototype of cluster i, which is the same as FCM clustering, cluster center function, i=1,…,c, c is the number of clusters. xˆ i represents a user defined polynomial, for instance a two dimensional polynomial can be formed with the following vector, xˆ (x1, x2)=(1, x1, x2, x1x2 , x12, x22)
(3.22)
3.2 Fuzzy Clustering Algorithms
63
and the coefficients of the polynomial are represented with βˆi ’s for each cluster i.
Hence, the first term of distance function in (3.21) is FCM distance measure and the second term is FCRM distance measure, which is equal to the error of estimated functions in (3.16). Based on the distance function in (3.21), they try to optimize partition matrix in their new fuzzy clustering algorithm, which uses the membership value calculation equation in (3.20). The coefficients βˆi are obtained in the same way as the cluster centers of FCM clustering are obtained. Therefore, the prototype update function of FCM clustering in (3.10) is replaced with;
⎛
n
( )
βˆi = ⎜ ∑ μik ⎝ k =1
m
⎞ ( yk xˆk ) ⎟ ⎠
∑ ( μik ) n
k =1
m
( xˆk xˆkT ), ∀i = 1,..., c
(3.23)
Their combined clustering algorithm is sketched in ALGORITHM 3.3. ALGORITHM 3.3 Fuzzy Model Algorithm of Höppner and Klawonn (2003) Step 1: Choose number of clusters, c, termination threshold, ε>0, η>0 , Step 2: Initialize the cluster prototypes, υi Step 3: Update membership matrix using (3.20) and distances (3.21) Step 4: Update prototypes using FCM’s prototype function in (3.10), and the coefficients using (3.23) Step 5: Iterate until |U(t)- U(t-1) |≤ε.
It should be noted that, the membership value calculation equation in (3.20) is not the result of a mathematical transformation obtained from combining two clustering algorithms, FCM and FCRM. They do not specifically imply how they chose polynomials to approximate local non-linear functions. Other variations of combined fuzzy clustering methods are also observed in the literature. For instance, in [Wang et al., 2002], FCRM algorithm is replaced with a gravity-based clustering algorithm, which is based on Newton’s gravity law. Their aim was to capture shell/curve like structures in a given data, which is suitable for image analysis. They combine gravity based clustering and fuzzy clustering algorithm to form an integrated clustering algorithm. Leski [2004]’s combined approach to fuzzy clustering algorithm is based on a local non-linear function estimation of local fuzzy models using FCRM [Hathaway and Bezdek, 1993] algorithm. Leski [2004] introduced an e-insensitive fuzzy c-regression algorithm, by converting optimization algorithm of FCRM clustering algorithm into a similar support vector regression optimization algorithm to capture the outliers. Using fuzzy partition space, Leski [2004] formed TSK type fuzzy system models. One drawback of this approach is that it is computationally complex as opposed to the rest of the fuzzy clustering variations mentioned earlier.
64
3 Improved Fuzzy Clustering
A major disadvantage of most of these combined fuzzy clustering approaches reviewed above is that; the algorithms are only defined for training datasets, where output variable is known. Learning algorithms in these earlier approaches utilize output values of training samples during training methods. Similarly, the distance measure and membership value calculation equations use output values during clustering algorithm. In most investigations, it would be preferred if membership value calculation equations of these fuzzy clustering methods would be used to find membership values of validation or testing data objects, whose output variable are not known. Almost none of these approaches explain explicitly how they calculate membership values of verification or testing samples using their proposed membership value calculation equations, nor do they define other membership value calculation equation. In this work, in chapter 4, a novel training and testing algorithm are explicitly defined to explain how to handle such problems. A probabilistic case-based inference is proposed for reasoning. Before moving on to explain the details of the proposed fuzzy clustering approach, it should be pointed out that, fuzzy clustering algorithms and their variations are designed for different purposes in system modeling. For instance, some fuzzy clustering methods are designed to capture noise in a given dataset, e.g., [Dave, 1991]; some try to eliminate harmonics [Kilic, 2002], others search for relationships through linear or non-linear functions, e.g., [Leski, 2004]. In this sense, the new improved fuzzy clustering algorithm has a novel goal. FCM and FCRM algorithms and their variations are the basis of the proposed improved fuzzy clustering algorithm of this work. Motivation and the underlying background of the new fuzzy clustering algorithm are presented in the next section.
3.3 Improved Fuzzy Clustering Algorithm (IFC) 3.3.1 Motivation Since, multi-input, single-output (MISO) models are the interest of this work; a clustering algorithm is designed to approximate local models to explain relationship between inputs and outputs. Most of the times, earlier clustering algorithms are designed to find local groupings of data vectors by leaving out any linear or non-linear relationships between them. However, data objects could be grouped by considering not only the local models of input-output data (clusters) but also relationships between them. Here a new clustering algorithm is proposed to find local fuzzy partitions to estimate local fuzzy models with fuzzy functions [Celikyilmaz and Turksen, 2007b] based on combined clustering structures such as in (3.21). “Fuzzy Functions” approach is initially proposed by Turksen in 2005 [Turksen, 2008]; the first implementation was published in [Celikyilmaz, 2005]. Later, Turksen and Celikyilmaz [Turksen and Celikyilmaz, 2006] published the idea as an alternate reasoning method to Fuzzy Rule Base structures. In simpler terms, instead of using fuzzy rule bases, each rule is converted into a “Fuzzy Function” as such ŷi=fi(Φ(μi,x)), i=1…c, to represent them with a linear or a non-linear function.
3.3 Improved Fuzzy Clustering Algorithm (IFC)
65
Feature space Φ(⋅) is similar to mapping original input space, x, onto a user defined higher dimensional space. The feature space dimension is determined by the user and it is formed by original input variables, membership values, μi, and or their mathematical transformations, for instance power transformations of membership values. In a sense, membership values are used as additional dimensions in approximating hyper-surfaces for each cluster identified by any type of fuzzy clustering algorithm. In the following chapters, we will analyze fuzzy system modeling with “Fuzzy Functions” approach in more detail. Here, only a brief introduction is given in order to explain the motivation of the proposed improved fuzzy clustering algorithm. Before introducing the details of the new clustering algorithm, its motivation are discussed in connection to earlier improved and combined fuzzy clustering algorithms such as combined clustering methods [Höppner and Klawonn, 2003]. Foundations of the Improved Fuzzy Clustering (IFC) approach of this work were introduced by Höppner & Klawonn [2000, 2003], Chen et al. [1998] and Menard [2001]. The structure of these earlier versions of fuzzy clustering algorithms and the IFC algorithm are similar in the sense that they share a similar objective function, which combines standard fuzzy clustering and fuzzy c-regression methods. However, there are many structural differences between the IFC of this work and earlier versions of FCM variations, viz. FCRM or improved clustering algorithms described in [Höppner & Klawonn, 2003; Chen et al., 1998; Menard, 2001]. These differences can be listed as follows (these earlier clustering types will be denoted as “earlier variations”):
•
•
•
•
Earlier variations use polynomials to estimate parameters of estimated functions. Novel IFC can use any type of function including simple functions such as linear regression, or non-linear functions such as support vector regression [Gunn, 1998]. Earlier versions and the IFC have different representations of membership values during structure and parameter identification. Earlier FCRM algorithms use membership values as weights in weighted regression algorithms, whereas the IFC method uses membership values as new predictors of functions in the sense of [Turksen, 2008] of each substructure (pattern) in the data. This is the unique property of “Fuzzy Function” methodologies proposed by Turksen in 2005 [2008]. An earlier combined fuzzy clustering algorithm [Höppner and Klawonn, 2003] as reviewed above, uses a slightly different version of cluster prototype function as in (3.23) to update estimated parameters of the calculated functions. Whereas new IFC updates coefficients of Fuzzy Functions use any type of function estimation method, e.g., multiple linear regression, support vector regression or even neural networks. Membership values of data points calculated with earlier membership value calculation equations produce higher values such as close to 1 when corresponding data objects can explain local input-output relationships. Similarly they estimate lower membership values if corresponding data vectors are outliers of that particular local structure. Whereas in IFC,
66
3 Improved Fuzzy Clustering
membership values and their transformations are considered as additional candidate input variables, not just as an indicator of data points. Structure of the IFC algorithm forces membership values to be better predictors of local models. • Earlier combined fuzzy clustering versions build Takagi and Sugeno [1985] based fuzzy system models, whereas IFC is proposed to be used in novel Fuzzy Functions systems [Celikyilmaz and Turksen, 2007b-k], to be presented in Chapter 4. • Earlier fuzzy clustering algorithms require observed output variable, yk, to estimate the error of each function of each local structure (cluster) identified, see (3.16). Given some data points, the algorithm identifies in which part the input space they fall and use the local model fi(x) assigned to it to predict the output. The error between the observed output and the estimated output from each local model is used as an input to the membership value calculation equation in (3.18). In short, during training, one needs to know what the output value of a particular data point is, before it can actually estimate its membership values. This could be problematic for real life datasets where the output values are generally not known. The improved fuzzy clustering algorithm (IFC) also uses a similar structure, where membership value calculation equations require output as inputs to calculate membership values of each data vector in each of these structures. But, in this work, a new inference engine that can calculate an approximate output value using case-based methods is proposed in order to calculate the membership values of vectors with unknown output values. Earlier system modeling approaches that utilize FCM variations stated above [Hathaway and Bezdek, 1993; Leski, 2004; Höppner and Klawonn, 2002] do not explain how membership values of testing data vectors are calculated. In this sense, IFC clustering is a unique approach and can be applied to any type of data structure, even when there are missing output values in the dataset. In this work, extending distance function in (3.21), a new Improved Fuzzy Clustering (IFC) is proposed. It should be noted that, new IFC is implemented in Fuzzy Functions (Fuzzy functions) systems [Celikyilmaz and Turksen, 2007b], where membership values and their user defined transformations are used as additional predictors in approximating the local input-output relationships. Earlier similar research [Höppner and Klawonn, 2003; Chen et al.,1998] use FRB structures, e.g., T-S models, where system modeling and membership values are used as weights of each local model. It should be emphasized again for the sake of the clarity and novelty of the IFC that, earlier variations of improved clustering methods, e.g., [Höppner and Klawonn, 2003; Chen et al., 1998], etc, approximate a polynomial function for each local partition using original input variables to estimate output variable and append this error term as a second term in the distance function of the clustering, as shown in equation (3.21). The IFC clustering algorithm is designed to find membership values to be used to model the system behavior. At each step of the
3.3 Improved Fuzzy Clustering Algorithm (IFC)
67
IFC optimization, a special regression function is estimated for each cluster. In approximating these functions, only the membership values and their user-defined transformations are used as input variables. These regression functions are called “Interim Fuzzy Functions” and are denoted as hi (τ i ) , where τi is the matrix of ith cluster, consisting of the membership values and their user defined transformations. One could use a simple model to estimate the parameters of these functions using linear regression functions such as least squares regression (LSE) or build non-linear models such as support vector regression [Gunn, 1998] or neural networks [Kosko, 1992]. The residual of the interim fuzzy functions, i.e., (yk-[ŷk=hik(τi)])2, is used as additional similarity information. Hence, they are added to the distance function as the second term. They are used as an additional term in the objective function of the the IFC algorithm. Since objective function based fuzzy clustering algorithms include a distance function as a similarity measure, the distance function should affect the behavior of the membership value calculation equation. In [Höppner and Klawonn, 2003], such an effect is not reflected to the membership value calculation equation in (3.20), instead the membership value calculation equation serves their particular purpose of finding more crisp membership values to eliminate harmonics. In this work, a new membership value calculation equation is introduced, which is structured as a result of the modified distance function. On the other hand, earlier fuzzy c-regression models and combined fuzzy clustering algorithms use membership values of each cluster as weights in local regression models. Since the unique aim of the IFC algorithm is to find membership values that would be better predictor arguments of Fuzzy Functions of each local partition (cluster), interim fuzzy functions of the IFC algorithm only uses calculated membership values and their transformations as input variables. Therefore, in the new IFC clustering optimization approach, we do not include the original scalar inputs while shaping the membership values. We just aim at finding the best membership values that can explain the output. In short, the new IFC clustering introduces a new membership value calculation equation as a result of the effect of the addition of the second term to the distance function, consequently to the objective function of the IFC algorithm. Whereas the earlier approaches use original input variables to find local linear models. Consequently, in the training algorithm of the Fuzzy Functions approach, after hidden structures are identified with IFC, we approximate regression functions for each of these fuzzy partitions (clusters) identified by IFC, using the membership values from IFC, their transformations and the original scalar input variables. We use linear regression methods, e.g., least squares estimation (LSE) as proposed by Turksen [2008] or non-linear regression methods e.g., support vector machines for regression (SVR) [Gunn, 1998] as proposed by Celikyilmaz [2005]. We call these local regression functions, the “Fuzzy Functions”. It is hypothesized that the modeling error of the proposed Fuzzy Function methods will be lower when the new IFC of this work is used instead of standard FCM [Bezdek, 1981a]. It should be emphasized that, in the earlier fuzzy system modeling algorithms, e.g., Zadeh fuzzy rule base structures [1975a], it is assumed that expert
68
3 Improved Fuzzy Clustering
information is available and linguistic description (the degree of memberships) are determined subjectively. This may cause some discrepancies since expert knowledge is subjective. Later, various fuzzy models have been developed implementing clustering algorithms into input-output data to find membership values. Among these studies are [Sugeno and Yasukawa, 1993; Delgado et al., 1997; Emami and Turksen, 1998] and others. In [Delgado et al., 1997] different approaches is presented for identification of fuzzy models including fuzzy clustering of the input-output (XY) domain, viz. Z={XY} input data space. They use FCM clustering to generate c fuzzy clusters in Z domain, i.e., Z = [X,Y] with centers denoted as υiZ=(υiX, υiY) for i=1,…,c. Hence, membership value calculation equation of a fuzzy relation associated with the ith cluster is defined as: −1
2 ⎛ ⎞ Z ⎜ c ⎛ d ( ( zk ),(υi ) ) ⎞ m −1 ⎟ ( μik ) = ⎜ ∑ ⎜⎜ d ( z ),(υ Z ) ⎟⎟ ⎟ , m > 1 5 j )⎠ ⎜ j =1 ⎝ ( k ⎟ ⎝ ⎠
(3.24)
Hence Delgado et al. [1997] identify local relationships between inputs and outputs around centroids, which are captured by the above fuzzy relation on XY domain and then they define membership value calculation equation of individual domains, e.g., X and Y. Identifying fuzzy cluster structure as two separate components of inputs and outputs does not imply that the membership value calculation equations are projections of the clusters, rather they are induced in X and Y spaces by the fuzzy clusters. In this way, c fuzzy sets are obtained in X and Y spaces denoted by υiX and υiY with membership value calculation equations defined as: −1
2 −1 2 ⎛ ⎞ ⎛ ⎞ X m −1 Y − 1 m ⎛ ⎞ c d x , υ ⎛ ⎞ ( k i ) ⎟ ⎟ , Y ⎜ c d ( yk ,υ i ) ⎟ ⎜ μikX = ⎜ ∑ ⎜ ⎟ ⎟ , m > 1 (3.25) ⎟ μ ik = ⎜ ∑ ⎜ X Y ⎜ j =1 ⎜⎝ d ( xk ,υ j ) ⎟⎠ ⎟ ⎜ j =1 ⎜⎝ d ( y k ,υ j ) ⎟⎠ ⎟ ⎝ ⎠ ⎝ ⎠
The assumption here is that the input variables are not independent. Instead of defining separate membership functions for each individual input variable, one multi-dimensional interactive membership function is formulated to represent the whole antecedent part of a fuzzy rule as defined in [Delgado, et al., 1997] by joint (interactive) membership values. This indicates that non-interactivity6 assumption between each component of the input space is not assumed as earlier fuzzy rule base approaches. Some examples of the applications of clustering input-output domain as well as the output domain can be found in [Emami and Turksen, 1998; Kemal, 2002; Uncu, 2003]. The new clustering algorithm is also based on the input-output clustering of the given system domain and also assumes interactivity between input variables. 5 6
t
Indicates any iteration. Non-interactivity assumption indicates that the antecedent part input variables are assumed to independent from each other and hence there would be no interaction what so ever between their antecedent fuzzy sets.
3.3 Improved Fuzzy Clustering Algorithm (IFC)
69
The IFC algorithm finds membership values which are candidate inputs for each local fuzzy model. These fuzzy models are regression functions, where output variable, y∈ℜ has a continuous domain. In this work, we also extended IFC method for classification domains, (IFC-C) where each object in the dataset belongs to one of the predefined classes. In chapter 4, a new fuzzy classifier design is introduced, which uses IFC-C clustering for fuzzy classifier models to solve classification problems. In there, the error i.e., (yk-ŷik)= (yk-h(τi)), is replaced with error between the actual binary output and the posterior probability, pˆ ik , of each object calculated by a chosen classification function, i.e., (yk- pˆ ik ). Next, the framework of the new improved fuzzy clustering (IFC) algorithm is presented for regression and classification problem domains.
3.3.2 Improved Fuzzy Clustering Algorithm for Regression Models ( IFC ) Standard Fuzzy c-Means (FCM) algorithm [Bezdek, 1981a] is used in fuzzy system models to find the membership values, which are assumed to represent optimum partitions of the given dataset. In Fuzzy Functions approaches these membership values are used as additional input variables to predict the parameters of regression models for each cluster. In this work, we propose a new fuzzy clustering method by modifying standard FCM algorithm. The new IFC is proposed to find membership values, which can improve the performance of the local models represented with fuzzy functions. The optimization approach of the new improved fuzzy clustering algorithm not only searches for the best partition of the data, but also aims at increasing the predictive power of the membership values to model the output variable with Fuzzy Functions. Hence, we call the new fuzzy clustering algorithm “Improved Fuzzy Clustering (IFC)” [Celikyilmaz, Turksen, 2007b]. For the given multi-input and single-output system, let the input matrix be represented as xy={(x1,y1),…,(xn,yn)}, where xk={x1…xnv} is nv dimensional kth data vector, k=1…n, n is the total number of data vectors and each yk is their corresponding output value. First, we introduce a new objective function, which carries out two purposes: (i) to find a good representation of the partition matrix; (ii) to find membership values to minimize the error of the Fuzzy Function models. To optimize the membership values, in the new IFC, we append the error we obtained from the regression functions to the objective function as follows:
J mIFC = ∑∑ ( μikimp ) d ik2 + ∑∑ ( μikimp ) c
n
m
i =1 k =1
In (3.26)
μikimp
c
n
i =1 k =1
m
( yk − hi (τ ik , wˆ i ) )
2
(3.26)
is the improved membership value of the kth input vector in ith
cluster, i=1…c, and m is the degree of fuzziness parameter which determines the overlapping of the clusters. Objective function to be minimized, JmIFC in (3.26), comprises of two separate terms. The first term is same as the standard FCM
70
3 Improved Fuzzy Clustering
algorithm. This term controls the precision of each input-output data vector to its cluster, d2=||xkyk-υi(xy)||2 and vanishes when each data sample is equal to the cluster center value. The second term is total squared error of Interim Fuzzy Functions hi(τi,ŵi) of its cluster, where τi is called the interim matrix and ŵi is the coefficient vector of cluster i. This term measures squared error of approximated user defined functions in the sense of which value transformations to be included in Fuzzy Functions that are built during the optimization of IFC algorithm. The only input variables are the membership values and/or their possible transformations, i.e., original scalar inputs are omitted. Therefore, input matrix, τ i ( μiimp ) , which is used to estimate these Interim Fuzzy Functions, hi(τi,ŵi), i=1,…c, utilize the membership values from the previous iteration step. Let the input matrix of ith cluster be composed of two-dimensional input vectors of the membership values and their log-odds transformations, τi=[μi log((1-μiimp)/μiimp)]. The set of planes in ℜ2 of each ith cluster is defined as hi=ŵ0i+ŵ1iμiimp+ŵ2ilog((1-μiimp)/μiimp), or hi=τiTŵi , where ŵiT = [ŵ0i ŵ1i ŵ2i] are the coefficients of the Interim Fuzzy Functions. A particular set of fuzzy functions in ℜ2 is defined by
yˆ i = hi ( τ i , wˆ i ) = wˆ 0 i + wˆ 1i μ iimp + wˆ 2 i log(
1 − μ imp i μ imp i
2
) = wˆ 0 i + ∑ wˆ jiτ ji
(3.27)
j =1
ŷi is the estimated output value at tth iteration in ith cluster, τi is the input matrix created by the improved membership values, μiimp, of the ith cluster at tth iteration, along with their log-odd transformations (original scalar input variables, x, are not included) and ŵi’s are estimated parameters of the functions using a linear regression method, e.g., least squares regression method. Log-odds are a user defined transformation of the membership values, which is usually used in fuzzy functions because mostly the distribution of the membership values is Gaussian (bell shaped) in fuzzy clustering algorithms. The analysis of the distribution of the membership values are discussed in the next section. It should be emphasized that, in (3.27) the only inputs are the membership values and their user defined transformations. We want to improve the predictive power of membership values to be used in system modeling as additional inputs. Hence, each tth iteration of IFC optimization tries to minimize the error between actual and estimated output values. Excluding original input variables, we are able to measure and improve the individual effect of the membership values on model performance. The distance function of the IFC algorithm is denoted by
d ikIFC = zk − υi ( z ) + ( yk − h(τ ik , wˆ i ) ) 2
2
(3.28)
where zk∈Z={xk,yk}∈XY ⊆ℜ(nv+1) indicate the input-output vector, vi(z) is the ith cluster center. For each cluster one interim fuzzy function, hi(τi,ŵi), i=1,…,c, is approximated using membership values from the previous iteration as input variables. The second term vanishes when the estimated interim fuzzy functions of each cluster, hi(τi,ŵi), can perfectly explain the observed output variable. The trade off between the precision and squared error enables to define the optimal IFC model.
3.3 Improved Fuzzy Clustering Algorithm (IFC)
71
The solution to minimize the new objective function, JmIFC, can be found by taking the dual of the model by using the Lagrange transformation of the objective function and converting into a maximization problem. This is done by introducing Lagrange Multiplier, λ, with respect to the constraint ∑ci=1 μikimp = 1. We re-write the objective function as follows:
L = ∑∑ ( μikimp ) d ik2 + ∑∑ ( μikimp ) c
m
n
m
i =1 k =1
c
n
i =1 k =1
m
⎛
⎞
⎛ ⎞ 2 ( yk − hi (τ ik , wˆ i ) ) -λ ⎜ ⎜ ∑ ( μikimp ) ⎟ − 1⎟ c
⎝ ⎝ i =1
⎠
⎠
(3.29)
The solution to the maximization of the convex model is found by taking the derivative of the Lagrange function with respect to the unknown parameters, i.e., μikimp, improved membership values and the cluster centers, υi. The input variables of regression models, hi(τi), to be used to approximate output variable are the membership values from the (t-1)th iteration. Therefore the second term, SEik(t-1)= (yk-hi(τik(t-1)))2, is a known variable at the tth iteration. Hence, by taking the derivative of the objective function in (3.29) with respect to the cluster center and the membership values, the optimum membership values of the proposed clustering algorithm are formulated as:
(μ )
imp ( t ) ik 1< i , j ≤ c 1≤ k ≤ n
1/( m −1) ⎛ ⎞ ⎛ ( d (t −1) )2 + ( y − h (τ (t −1) , wˆ ) )2 ⎞ c ik k i ik i ⎜ ⎟ ⎜ ⎟ = ⎜ ∑ j =1 ⎟ ⎜ d (t −1) 2 + y − h (τ (t −1) , wˆ ) 2 ⎟ ( ) ( ⎜ ⎟ jk k j jk j ) ⎠ ⎝ ⎝ ⎠
−1
(3.30)
Since the second term of the objective function JmIFC in (3.26) does not include the cluster center term, when taking the derivative of the objective function with respect to the cluster center, vi(xy), the second term will vanish. Then the objective function will be as same as the objective function of standard FCM algorithm. Therefore, the cluster center function of standard FCM remains unchanged for IFC and is given by
(
(t ) ⎛ n ∀ vi( t ) = ⎜ ∑ ( μikimp ) 1≤ i ≤ c ⎝ k =1
)
m
⎞ xk ⎟ ⎠
∑ (( μ ) n
k =1
)
m imp ( t ) ik
(3.31)
Extraction of membership value calculation equation in (3.30) using Lagrange Transformation is given in Appendix B.2. The membership value calculation equation in (3.30) uses cluster centers from the previous iteration, (t-1), and the cluster center equation in (3.31) uses the membership values from the current iteration. Hence, similar to the FCM algorithm, an iterative algorithm for the IFC algorithm is used to find the optimum membership values. The algorithm terminates according to some termination criteria, e.g., when total number of iterations exceeds a user defined maximum number of iterations or magnitude of the separation between two consecutive objective function values is below a threshold value, etc. Generalized framework of the IFC algorithm is shown in ALGORITHM 3.4.
72
3 Improved Fuzzy Clustering
Let fuzzy partition of each iteration t, t=1,…,max-iteration, be denoted by:
U (t )
(
)
(
)
⎛ μ imp ⎜ 1,1 =⎜ # ⎜ ⎜⎜ μ imp ⎝ 1,n
(t )
…
(μ )
% (t )
"
⎞ ⎟ ⎟ # ⎟ imp ( t ) ⎟ ⎟ μ c ,n ⎠ imp ( t ) c ,n
(
)
ALGORITHM 3.4 Optimization with the Improved Fuzzy Clustering Algorithm (IFC) for Regression Models Given the training dataset, Z={(x1,y1),…, (xn,yn))}. Set m>1.1, c>1 and a termination constant, ε>0, maximum number of iterations (max-iter), specify the structure of regression models such as in (3.27) for each cluster i, i=1,…,c, k=1,…,n, to create the interim input matrix, τi. Using FCM clustering algorithm, initialize the partition matrix, U0. Then, for each iteration, t=1,…max-iter; (1) Populate c number of input matrices, τi (t-1), one for each cluster, using membership values (U(t-1)) from the (t-1)th iteration, and their selected user defined transformations. (2) Approximate c number of interim fuzzy functions such as in (3.27), h(τ ik
( t -1)
).
(3) Update membership values for iteration t using (3.30). (4) Calculate cluster centers for iteration t using (3.31). (5) If (obj(t) - obj(t-1))<ε, then terminate, else return to step 1.
IFC optimization starts with an initial partition matrix, U0, and initial cluster centers, υ0. One may use a crisp clustering such as k-means or a fuzzy clustering such as fuzzy k-means, or FCM clustering, etc., to find the initial partition matrix, U0, and cluster centers, υ0. One should take into account the trade off between choosing a fuzzy clustering such as FCM clustering and discrete clustering methods like K-means. U0 and υ0 are required inputs at the start of the IFC since new membership value calculation equation in (3.30) requires error terms of the regression models, (yk- hi(τi,ŵi))2. The model output ŷik= hi(τi,ŵi) for each cluster i is estimated by using only membership values and their mathematical transformations as input variables as in (3.27). The IFC optimization method searches for the optimum membership values, which are to be used later as additional predictors to estimate the parameters of the Fuzzy Functions of the given system model. In steps 2 of ALGORITHM 3.4, membership values U t-1 that are calculated at (t-1)th iteration are used as input variables at tth iteration to find parameters of regression functions, ŵi, for each cluster. Any function approximator can be used to identify the parameters. In the experiments we use least squares estimation (LSE) to identify linear functions and support vector regression (SVR) [Gunn, 1998] to approximate non-linear functions [Celikyilmaz, 2005; Celikyilmaz and Turksen, 2007b]. Since the aim of IFC is to find membership values which are good predictors of the output, the only
3.3 Improved Fuzzy Clustering Algorithm (IFC)
73
input variables are the membership values and their user defined transformations when finding interim fuzzy functions of each cluster. The optimization algorithm tries to minimize the error of the regression models, hi(τik(t-1),ŵi). One should choose appropriate membership value transformations to approximate the output. The effects of improved membership values, as predictors of the Fuzzy Functions, is investigated in the following sub-section, and entitled justification of the membership values of IFC algorithm. The output of the IFC algorithm for a given m and c value is: ¾
parameters of the interim fuzzy functions of each cluster, ŵi, i=1…c, that are captured from the last iteration step, ¾ improved membership value matrix, U(z), and the cluster centers υ(z), ¾ the interim input matrix structure, τi, composed of membership values and their transformations for each cluster i. The fuzzy function systems utilize membership values from IFC as additional predictors. Hence, the new IFC algorithm is methodically designed to improve the performance of the membership values to be used as additional predictors of Fuzzy Functions systems.
3.3.3 Improved Fuzzy Clustering Algorithm for Classification Models (IFC-C) In this section, an extension of the improved fuzzy clustering (IFC) algorithm to be implemented into classification models will be presented. In the previous section novel IFC algorithm is introduced to be implemented into fuzzy system models with Fuzzy Functions to solve regression problems. To adapt the IFC algorithm for classification problems (IFC-C), one need to change the way the decision function, namely Fuzzy Function parameters are calculated. It should be noted that, for regression problems, Interim Fuzzy Functions, h(τ i( t ) , wˆ i ) , of each cluster (at each step of the IFC iteration), tries to model a continuous output variable. However, for classification problems, the output variable is dichotomous, e.g., y∈{0,1}, y∈{-1,+1}, or ordinal/discrete variable, y∈{0,1,2,..}. One needs to implement a classifier function, e.g., logistic regression [Allison, 2001], support vector machines for classification [Gunn, 1998], or neural networks for classification [Kosko, 1992], etc., in order to assign class labels to each data point. In this work we only dealt with the classification problems, where the output variable is dichotomous, e.g., y∈{0,1}, y∈{-1,+1}. The type of the classification method to choose is usually a tread-off between complexity (non-linearity) and generalization capacity of the classification. The second term of IFC objective function in (3.26) is the squared error calculated between an output of a form of interim fuzzy function and actual output of each data point. Hence, one first estimates output values, ŷik, of each data point k
74
3 Improved Fuzzy Clustering
in cluster i using interim fuzzy functions of ith cluster. Then the error between ŷik and observed output values, i.e., SEik=(yk-ŷik)2, is measured. For classification functions, the output is a binary variable which only takes values 0 or 1. One uses a classification method, i.e., logistic regression (LR), support vector machines for classification (SVC) or neural networks for classification (NNC), to calculate a decision function, h(τ i ) , so that the sign of the decision function, sign(hk(τik)), is the predicted output of object k in cluster i. Instead of predicting the label, sign(hk(τik)), which is a binary outcome, one can calculate posterior probability of the output, pˆ ik (yk=1|)τik) ∈[0,1], of each data point in each cluster based on the probability of the output to be (yk=1). Hence, estimated probability indicates how likely the output of the given observation would be 1. It is a scalar variable and one could form a performance measure between the actual normalized output values of each object and estimated posterior probability, as such (yk - pˆ ik )2. Then, we append the error of these classification functions onto the objective function of the standard FCM clustering to formulize the new hybrid structure of IFC-C as follows:
J mIFC −C = ∑∑ ( μikimp ) dik2 + ∑∑ ( μikimp ) ( yk − pˆ ik ( yk = 1| τ ik , wˆ i ) ) (3.32) i =1 k =1 i =1 k =1
SE of Fuzzy c
n
m
FCM
c
n
m
2
Classifier Function
The objective function to be minimized in (3.32), JmIFC-C , also has a dual structure. The first term is the same as the objective function of the standard FCM clustering. The second term is the total squared error of the “Interim Fuzzy Classification Function” h(τik,ŵi) of each cluster, i, where τi is the corresponding cluster’s input dataset, i=1,..,c, k=1,..,n, and ,ŵi are the interim fuzzy classification function parameters. This term measures the squared deviation of actual class labels, e.g., yi∈{0,1}, from estimated posterior probabilities, pˆ ik (yk=1⎢h(τi,ŵi)). For any cluster i, the second term takes the form:
⎛ ⎡ y1 ⎤ ⎡ pˆ i1 ( y1 = 1) ⎤ ⎞ ⎜⎢ ⎥ ⎢ ⎥⎟ SEi = ⎜ # − ⎢ # ⎥⎟ ⎢ ⎥ ⎜ ⎢ y ⎥ ⎢ pˆ ( y = 1) ⎥ ⎟ ⎦⎠ ⎝ ⎣ n ⎦ ⎣ i ,n n
2
IFC-C algorithm tries to minimize the error of classifier functions using membership values to estimate pˆ ik (yk=1)=e, which is close to e≈1. Thus, during each iteration step, the membership values are reshaped to predict the correct class labels with Interim Fuzzy Classifier Functions, as well as they represent the optimum fuzzy partition of a given dataset. Excluding original input variables from function estimation, we are able to measure and improve the individual effect of the membership values on the performance of each classifier model. If the estimated function can separate the two classes with the given inputs, then the error will be less and the algorithm will converge faster. Since we want to find membership values which can improve classification accuracy, during IFC-C we
3.3 Improved Fuzzy Clustering Algorithm (IFC)
75
only use corresponding cluster’s membership values and their transformations as the only input variables to estimate classifier functions, h(τi,ŵi). The effect of finding better classifiers by shaping membership values can be explained with an example. During tth iteration of IFC-C optimization, let τik and τik’ represent two vectors, k and k′, randomly selected from the dataset τi, the actual output values of which are y(τik)=1 and y(τik’)=0 respectively. An Interim Fuzzy Classifier Function, h(τi), estimates posterior probability equal to pˆ ik =0.90 for τik. Squared error for this kth vector would be, SEik=(1-0.90)2=0.01. This indicates that interim fuzzy classifier function, h(τik), predicts that output value of kth input vector is 1 with 90% probability, which is quite accurate and its effect on objective function will be quite low. Since we are trying to find the global minimum of the objective function, this would mean fast convergence. Similarly, for the second vector k', let the fuzzy classifier predict a probability of pˆ ik ′ =0.15, which indicates that it is very unlikely that the label of vector k′ would be 1. The SE would be (0-0.15)2=0.025. In conclusion, the better the classifier functions separate the two classes, the smaller the objective function, JIFC-C would get and as a result the faster IFC-C would converge. Depending on non-linearity of given dataset, one could use user-defined transformations of the membership values, such as exponential transformation
e( μik
imp
)
or power transformation, (μikimp)p, p∈Z. One could use suitable statistical learning algorithms, e.g., logistic regression (LR), or much effective soft computing approach such as support vector classification (SVC), to approximate fuzzy classifiers. In LR case, for instance, posterior probabilities of each data point in cluster i is calculated by −1
⎡ ⎤ T ⎥ ⎢ ⎡ wˆ i ,0 ⎤ ⎡ 1 ⎤ ⎥ ⎢ ⎢ wˆ ⎥ ⎢ μ imp ⎥ ⎥ ⎢ ⎥ ⎥ , ik 1 i ,1 ⎥ ⎢ (3.33) = ⎢1 + exp(−( ⎢ pˆ ik ( yk = 1| τ ik ) = ⎢ ⎥ )) ⎢ ⎥ # ⎢ (1 + exp(− wi ⋅ τ ik )) # ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ imp ⎣⎢ wˆ i , nm ⎦⎥ ⎢⎣ e( μik ) ⎥⎦ ⎥ ⎢
⎥ ⎢ τ ik ∈ℜnm +1 ⎣ ⎦
(
)
where τik denotes input matrix, the constituents of which are the membership values, and their user defined transformations, wi are the LR parameters to be approximated. If support vector classification (SVC) is used to estimate fuzzy classifier function parameters, the output values of data vectors are calculated by hi(τi)=∑k βik yk K(τik,τis)+bi , i=1...c, k=1,…n
(3.34)
In (3.34), K(.) is the kernel function that is used to map original interim data vectors, τi,, onto a higher dimensional space, either linearly or non-linearly. In this work, we analyzed two different kernel functions: linear; K(τik,τil)=τikTτil, where τik and τil are interim vectors of nm dimensions which hold the membership values
76
3 Improved Fuzzy Clustering
and their user defined transformations, and the probabilistic kernel, i.e., radial
{
basis function (RBF); K(τik,τil)= exp −δ τ ik − τ ij
2 2
} (δ>0). β
ik
in (3.34) represents
Lagrange multipliers, one for each interim vector, which are introduced to solve the following SVC optimization algorithm for each local model i (cluster):
Max Qi = ∑ k =1 βik − 12 ∑ k ,l =1 βik βil yk yl K (τ ik ,τ il ) n
s.t. ∑ k =1 βik yk = 0 , n
n
0 ≤ βik ≤ Creg , k ,l = 1,...,n,i = 1...c
(3.35)
Creg is the regularization constant, which balances the complexity of the machine, and the number of the separable data vectors. Interim vectors with non-zero coefficients, βik>0, (Lagrange multipliers) are called “support vectors” obtained for each cluster i, τiks. The fewer the support vectors, the better is the generalization capacity of such models. A brief summary of support vector machines for classification (SVC) methods are given in Appendix C.2. The output values obtained from each SVC fuzzy function in (3.34) is a scalar value that one needs to transform into posterior probabilities. We measure the posterior probabilities using improved Platt’s probability method [Platt 2000; Lin et al., 2003], which is based on the approximation of model output labels with a sigmoid function as follows: pˆ ik (yk =1| hi(τi))) = (1+exp(a1 hi(τi)) + a2))-1
(3.36)
a1 and a2 are found by minimizing the log-likelihood of training data:
min a1 ,a2
y +1⎞ ⎞ n ⎛ y +1 ⎛ − ∑ k =1 ⎜ k log ( ˆpik ) + ⎜1 − k log (1 − ˆpik ) ⎟ ⎟ 2 ⎠ ⎝ ⎝ 2 ⎠
(3.37)
The output values obtained from SVC interim fuzzy functions are not represented as class labels, i.e., yk={-1 or 1} , but rather represented with estimated posterior probabilities. It should be noted that, fuzzy classifier functions estimated at each iteration, h(τi), is only used for estimating and structuring the membership values during IFC-C optimization algorithm which excludes the original input variables, x∈ℜnv. Membership value calculation equation to update the membership values in each step is re-formulated for IFC-C according to the latter objectives as follows: 1/( m −1) ⎛ c ⎡ 2 ⎞ 2 d ik ( xk ,υi ) + ( yk − pˆ ik ) ⎤ ⎜ ⎟ imp ⎥ μik = ∑ ⎢ 2 ⎜ ⎟ 2 1< i ≤ c ˆ jk ) ⎥⎦ j =1 ⎢ d ( x ,υ ) + ( y − p ⎜ ⎟ k j jk k ⎣ 1< k < n ⎝ ⎠
−1
(3.38)
Generalized steps of the Improved Fuzzy Clustering algorithm for classification models (IFC-C) are re-displayed as follows:
3.3 Improved Fuzzy Clustering Algorithm (IFC)
77
Step 1: Initialize clustering parameters: m (fuzziness parameter), c (parameter of the number of clusters), choose termination threshold, ε>0 Step 2: Initialize prototypes, υi∈V, and membership values, μikimp∈U⊂ℜn×c. Repeat
Update membership values using (3.38), Update prototypes using (3.31), Until the change in objective function in (3.32) is less than ε.
In Step 2 of IFC-C algorithm, one may use FCM [Bezdek, 1981a] clustering or a crisp clustering such as k-means to find the initial membership values and cluster prototypes. The IFC-C optimization method searches for the optimum membership values, which are to be used as additional predictors to estimate the parameters of Fuzzy Classifier Functions for the given system. Thus, IFC-C is implemented in structure identification phase and inference phases of the new Improved Fuzzy Functions method (to be presented in Chapter 4). The output of the IFC-C algorithm for any given set of {m,c} values are as follows: (i)
optimal parameters, ŵi, i=1,..,c, of the fuzzy classifier function of each cluster h(τi,ŵi) which is used to calculate the posterior probabilities, pˆ i (y=1⎢h(τi,ŵi)) captured from the last iteration step of IFC-C.
(ii)
structure of τi, in other words list of different types of membership value transformations that were used to approximate each h(τi,ŵi) in IFC-C,
Depending on the type of function approximation method used, the parameter, ŵi, will represent different values. For instance, when support vector classification is used, the list of parameters of each cluster i, ŵi, would represent Lagrange multipliers of each support vector of each cluster, βik, and original support vectors, τiS from each cluster, as shown in (3.34). On the other hand, when LR is used, the parameter list is just the coefficients of each membership value (as column vector) to approximate the fuzzy classifier functions as shown in (3.33). Next, we explain why IFC can be a better method than standard FCM clustering for fuzzy function systems.
3.3.4 Justification of Membership Values of the IFC Algorithm “Are membership values obtained from IFC algorithm better predictors than membership values from FCM clustering algorithm?”
The Improved Fuzzy Clustering (IFC) algorithm to be used for regression and classification problems is a modification of two well-known fuzzy clustering algorithms, i.e., FCM and FCRM. It searches for the optimum membership values, which are later to be used as additional predictors of the “Fuzzy Functions” systems [Turksen, 2007; Celikyilmaz, 2005, Celikyilmaz, and Turksen, 2007b,c].
78
3 Improved Fuzzy Clustering
During the structure identification of the novel fuzzy functions approach, FCM clustering or IFC algorithm is used to partition the data into clusters and assign one membership value to every datum for each of these clusters. For each cluster a separate dataset is formed and using these datasets one Fuzzy Function, i.e., a regression function or a classification function, for each cluster is identified by either one of these clustering algorithms. These membership values are then used as additional predictors in these Fuzzy Functions. The assumption is that, the membership values obtained from the novel IFC, which are used as additional predictors of the Fuzzy Functions, can increase the performance of the models more than the membership values obtained from the FCM clustering algorithm. In this section, the strengths of membership values from the FCM clustering and the IFC algorithm is compared using a simple statistical test. It should be noted that the best comparison can be done when two algorithms are applied to real life or benchmarking dataset and test case performances of each model are compared. Such comparisons are presented in the analysis of experiments chapter (Chapter 6). Here we will demonstrate the performance of the models of the FCM clustering and the IFC method using a small artificial dataset. Generally speaking, membership value calculation equations are used to obtain membership values of inputs vectors or output variables of singletons, indicating to what degree an object belongs to a cluster. Membership value calculation equations can be a function of inputs, outputs, or input and outputs. Membership value calculation equations can be manifested by very different shapes and some of the common shapes are triangular, Gaussian, trapezoidal or singleton membership values as shown in Figure 3.2.
Fig. 3.2 Types of Membership value calculation equations
In our applications, we use fuzzy clustering methods, which are based on distance measures between each data point to each cluster center. Since the algorithm is based on minimization of an objective function based on distance
3.3 Improved Fuzzy Clustering Algorithm (IFC)
79
measure, the membership values obtained from scatter diagram which after appropriate curve fitting can be demonstrated as bell-shapes or s-shapes functions of input variables or the output variable for each cluster. It should be emphasized that, for picture representations, idealized (possibly curve-fitted) membership functions are shown to demonstrate the membership values. In fact such membership values ought to be represented with a scatter diagram. On the other hand, during IFC algorithm, Interim Fuzzy Functions of each cluster, h(τi,ŵi), tries to explain the relationship between membership values and their transformations of a particular cluster and the output value, where the membership values become the input variables to explain the output variable in a linear or a non linear function. Therefore, the relationship between the membership values and their transformations and the output variable to be approximated should be the inverse of the known membership shapes as explained in Table 3.2. In this table, possible examples of relationships between the membership values as input variables and an output variable is shown. To demonstrate the relationship with a “bell-shape” and an “s-shaped” distributions are inverted to draw the membership value versus output variable graphs. When one wants to estimate a user defined linear function for each cluster, one way to obtain best approximations of functions is to transform membership values and then use them as input variables. We assume that the relationship between the membership values and their transformations and the output variable to be estimated would approximately be similar to the inverse of the known membership shapes as shown in Table 3.2. Table 3.2 Membership values as input variables in Fuzzy Function parameter estimations Membership value transformation
S-function-1 :
Inverse s-function-1 : 1− u yˆ ≅ fˆ −1 (u ) ≅ − ln u
f ( y) =
1 1 + e − ( a +by )
Membership value as an input variable versus output variable. 2
output-y
Possible function
0 -2
1 f ( y) = 1 + e( a +by )
Inverse s-function-2 : 1− u yˆ ≅ fˆ −1 (u ) ≅ ln u
0 -2
f ( y ) = ay − b
Inverse power function: yˆ = fˆ −1 (u ) = (u / a)( −1/ b ) , a > 0, b > 0
0
0.5 membership-u
1
0
0.5 membership-u
1
2
output -y
Power function:
0.5 1 membership (u)
2
output-y
S-function-2 :
0
0 -2
80
3 Improved Fuzzy Clustering
Here, we want to discuss and demonstrate the significance of the membership values obtained from IFC algorithm, viz., the confidence of the membership values that covers a portion of the unknown output variables, when used as input variables to approximate an output variable with a function fi, μi ( x) → y . In the experiment presented next, we used statistical significance tests for each function of each cluster to justify the performance of a membership value calculation equation. An artificial dataset of 50 samples with a single input x, and a single output y, as shown in Figure 3.3, is used in this experiment. We wanted to determine if there are local linear relationships between the response variable y and the independent variables, viz. membership values of each data point in each cluster and their possible transformations. 3
Fig. 3.3 Scatter plot of the artificial dataset y-output
2
1
0
-1
-2 -3
-2
-1 0 x-input
1
2
When the standard deviations are not known, the best way is to use F-statistic to test if there is a significant relationship between the dependent and independent variables of a system model. The critical Fα,nm,n-2 value for 1-α confidence level, (α ∈{1%,5%,10%}) is captured from statistical graphs. Null hypowork,
H 0 : F clusteri ≤ F critical , i = 1,..., c , states that there is no significant relationship between the membership values and their transformations and the output variable of the specified cluster. Rejection of null hypowork, H0 , due to
H1 : F clusteri > F critical , states that at least one of the independent variables contributes significantly to the model (nm: number of variables). An alternative way to test the significance is to measure the probability, p, obtained from the regression results. p can be compared to the significance level, α. If p<α is correct, then the null hypowork is rejected, and we conclude that the model has significant explanatory power. Otherwise we fail to reject the model. In this small experiment, we tested these F and p values. In this experiment, standard FCM clustering and IFC algorithm are applied on the artificial dataset using five clusters, c=5, and m=2.0. The membership values obtained from FCM and IFC models are used as independent variables to identify one “Fuzzy Function” for each cluster. We used membership values
3.3 Improved Fuzzy Clustering Algorithm (IFC)
81
and their logistic transformations as independent variables, i.e., f ( μi ) = wˆ 0,i + wˆ 1,i μi + wˆ 2,i ⎡⎣ ± ln((1 − μi ) / μi ) ⎤⎦ , where ŵij, j=1,…,nm are the coefficients, to identify linear functions for the FCM clustering and IFC models, five functions one for each cluster. The critical Fα,nm,n-2 value for 1-α=%99 confidence level (n=50-2=48 and nm=2) , which is
critical F0.01,2,48 = 3.20 , is used to
test the significance of each model. Figure 3.4 shows five fitted regression functions (black surfaces) as linear hyper-surfaces, one for each cluster. The independent variables, x-axis and yaxis, are membership values from FCM clustering method and their logistic transformations, respectively. The dependent variable, y-output, is the z-axis. Figure 3.5 is a similar graph, but this time the membership values are obtained from the IFC method. The white surfaces indicate the actual observed decision surfaces, and the black surfaces are the modeled hyper-surfaces using linear functions, fi. The two surfaces, white and black, in each figure can be explained as follows: Actual Decision Surfaces (White Surfaces) in Figure 3.4 and Figure 3.5 are plotted using the membership values obtained from FCM and IFC, respectively and the actual output variable. They represent the actual decision surfaces (the actual relationship between the membership values and the output variable). Since FCM and IFC have different membership value calculation equations, we expect that the actual decision surfaces of FCM in Figure 3.4 will be different than the actual decision surfaces in Figure 3.5. It should be reminded that, this difference between FCM clustering and IFC emerges from the fact that, during IFC optimization, the shapes of membership values are forced to explain the output variable, e.g., in this experiment using interim linear functions of type, f ( μi ) = wˆ 0,i + wˆ 1,i μi + wˆ 2,i ⎡⎣ ± ln((1 − μi ) / μi ) ⎤⎦ , ŵij, j=1,…,nm, i=1,…,c. Since we chose a linear function to represent them, we expect that the white decision surfaces of IFC would be more flat (linear) than the decision surfaces of FCM clustering models. One could observe from Figure 3.4 and Figure 3.5 that, actual decision surfaces (white surfaces) of the IFC models are more flat than those of the FCM clustering models in most clusters, i.e., cluster 1, 2, 5. We can also measure the linearity of the inputs and the outputs (actual decision surfaces) using correlation analysis. In Table 3.3, x1 represents the first input variable that is the membership values, μi , and x2 represents the second input variable that is the logistic transformation of Table 3.3 Correlation Analysis of FCM clustering and IFC membership values with the output variable
FCM IFC
x1 vs. Output 0.18 0.41
x2 vs. Output 0.21 0.32
82
3 Improved Fuzzy Clustering
membership values, ±ln((1-μi)/μi). Each cell in the table represents the root-meansquares-correlation of each model;
c
FCM
⎛1 c ⎞ ( x1 , y ) = ⎜ ∑ corr FCM ( x1,i , y ) 2 ⎟ ⎝ c i =1 ⎠
1
2
⎛1 c ⎞ c FCM ( x2 , y ) = ⎜ ∑ corr FCM ( x2,i , y ) 2 ⎟ c ⎝ i =1 ⎠ c
IFC
⎛1 c ⎞ ( x1,i , y ) = ⎜ ∑ corr IFC ( x1,i , y ) 2 ⎟ ⎝ c i =1 ⎠
1
⎛1 c ⎞ c IFC ( x2,i , y ) = ⎜ ∑ corr IFC ( x2,i , y ) 2 ⎟ c ⎝ i =1 ⎠
1
2
1
= 0.18, x1,i : μi , y: output variable. 2
= 0.21, x2,i : ln((1 − μi ) / μi )
= 0.41, x1,i : μiimp 2
= 0.32, x2,i : ln((1 − μiimp ) / μiimp ) ,
where μiimp represent the improved membership values obtained from the IFC, and μi represent the membership values obtained from the FCM clustering. For instance, c FCM(x 1,y) represents the overall correlation between the membership values, x 1,i= μi , and the output, y, over the five clusters obtained from the FCM clustering algorithm, and c FCM (x 2,y ) is the overall correlation between the transformed membership values and the output, y, over the five clusters obtained from FCM clustering algorithm. Similarly, cIFC(x1,y) represents the overall correlation between the membership values and the output, y, over the five clusters obtained from the IFC algorithm, and cIFC(x2,y), x2,i : ln((1 − μiimp ) / μiimp ) , is the overall correlation between the transformed membership values and the output, y, over the five clusters obtained from the IFC algorithm. It can be observed from Table 3.3 that the correlation between the output variable and the membership values and their transformations obtained from the IFC are higher than those of FCM clustering. So we expect that even with simple methods such as LSE, we can model the output with the IFC better. Model (Predicted) Decision Surfaces (Black Surfaces) as shown in Figure 3.4 and Figure 3.5 are plotted using the membership values obtained from the FCM and IFC, respectively and the predicted output variable from each functions. They represent the predicted (model) decision surfaces. The F-test and the probability test on the predicted functions are discussed below. In these five experimental trials, when the FCM clustering algorithm is used to find the membership values, one out of five fuzzy functions (estimated decision surfaces) reveal F-test values that are greater than the critical value, see Table 3.4. So, we fail to reject the null hypowork and we conclude that for this data, the membership values from standard FCM clustering cannot explain the output variable.
3.3 Improved Fuzzy Clustering Algorithm (IFC)
83
FCM-cluster 2 F=0.59, p=0.56
FCM-cluster 1 F=2.24, p=0.12
1
3
output-y
0.5
output-y
2
0
-0.5
1
-1 0
-1.5 -2
-1
1
1 0.5
u
10
20
30 0
-10 0 log((1-u)/u
-40
-20
0.5
-20 u
FCM-cluster 3 F=3.55, p=0.037
0 20
0
log((1-u)/u
FCM-cluster 4 F=1.41, p=0.25
2
2
1
output-y
output-y
1
0
0
-1
-1 -20
-2 1 0.5
1 0
0.5
20 0 40 log((1-u)/u
0
u
u
10 log((1-u)/u
0
-10
20
30
FCM-cluster 5 F=1.69, p=0.19 2
ou tput-y
1 0 -1 -2 1 30 20
0.5 u
10 0
0 -10
log((1-u)/u
Fig. 3.4 (White Surface) Actual Decision Surface using Membership values (from the FCM clustering), (Black Linear Surface) the estimated linear decision surface. ‘u’ indicates the membership values.
In the same experimental trial, on the other hand, when we analyze the membership values obtained from the IFC algorithm (Figure 3.5), in 3 out of 5 clusters, membership values can explain the output variable much better, since the F-tests are greater than the critical value, see Table 3.4. The majority of the IFC models (3 out of 5) have passed F-significance test. Therefore, we do not fail to reject the null hypowork and we can conclude that the IFC models can model the output variables better than the FCM clustering algorithms for this dataset.
84
3 Improved Fuzzy Clustering IFC-cluster 2 F=4.61, p=0.01
IFC-cluster 1 F=16.68, p<0.0001 4
2 1 output-y
output-y
2
0
-2
0 -1 -2 1
-4 1
-1 0.5 0.5 u 0
5
10
15
0 log((1-u)/u
0
-10
-5
10 0
u
20
log((1-u)/u
IFC-cluster 4 F=2.39, p=0.1
IFC-cluster 3 F=0.37, p=0.69
4
ou tput-y
output-y
2
0
-2
-4 1
u
5 0
0
-2 1
15 0.5
2
10
0.5
0 -5
log((1-u)/u
0
u
-10
0
10
20
log((1-u)/u
IFC-cluster 5 F=13.94, p<0.00001
o u tp u t-y
4
2
0
-2 1 20 0.5
10 0
u
0
-10
log((1-u)/u
Fig. 3.5 (White Surface) Actual Decision Surface using Membership values (from the IFC), (Black Linear Surface) the estimated linear decision surface. ‘u’ indicates the improved membership values.
The results from this artificial data show that the IFC models in most cases, three out of five, can find the optimum fuzzy partitions of the input space. Using membership values from the IFC as predictors, one can summarize that in general one can approximate the output variable better than models constructed with
3.4 Two New Cluster Validity Indices for IFC and IFC-C
85
membership values obtained from standard FCM clustering models. One should run different experiments by using non-linear models, or trying different membership value transformations, or use different parameters for m and c, before coming to a more definitive conclusion. Even though this would be an exhaustive algorithm which may take time, one may be able to obtain the optimum solution. Nonetheless, here, we wanted to display that even with simple regression models the IFC models can appear to be more powerful estimators than the FCM clustering results in function estimation problems. In this example, it is shown that the membership values cannot display a significant relationship with the output when FCM clustering is used to obtain them. Table 3.4 Significance Test Results of fuzzy functions using membership values obtained from FCM clustering and IFC fuzzy functions
FCM
F-value* 2.24 0.59 3.55 1.41 1.69 = 3.20 is the critical value.
Cluster-1 Cluster-2 Cluster-3 Cluster-4 Cluster-5 critical * F0.01,2,48
IFC
p-value 0.12 0.56 0.037 0.25 0.19
F-value 16.68 4.61 0.37 2.39 13.94
p-value 0.0001 0.01 0.69 0.1 0.0001
3.4 Two New Cluster Validity Indices for IFC and IFC-C Fuzzy clustering methods, including the Improved Fuzzy Clustering (IFC) [Celikyilmaz and Turksen, 2007b] algorithm, assumes that some initialization parameters are known prior to the model execution. This is usually a problematic issue since different parameters may reveal different results which could eventually affect system performance. Literature indicates that, many cluster validity functions have been proposed to validate underlying assumptions of number of clusters, especially for the FCM [Bezdek, 1981a] clustering approach. Among well-known validity functions the ones proposed by [Fukuyama and Sugeno, 1989; Xie and Beni, 1991; Pal and Bezdek, 1995; Bezdek, 1976] are the most commonly used FCM clustering validation measures. In later years, many variations of these functions, e.g., [Boguessa et al., 2006; Dave, 1996; Kim et al., 2003; Kim and Ramakrishna, 2005], are presented by modifying or extending these earlier validity functions. These validity functions measure characteristics of a point-wise fuzzy clustering method, i.e., FCM clustering algorithm. They are limited in determining the best clustering structure, but they can provide some information about underlying information of membership values. The main characteristics of these validity functions are that, they all use either within cluster or between-cluster distances or both as a way of assessing clustering schema. The value of within-cluster distances is interpreted as the compactness and the value of between cluster distances is interpreted as the separability of the clustering structure [Kim et al., 2003]. These validity functions, to be summarized in the
86
3 Improved Fuzzy Clustering
following sections, can be categorized into two groups: ratio type and summation type [Kim et al., 2003]. The type of a validity function is determined by the way they combine the within-cluster and between cluster distances. As stated earlier, most validity indices are designed to validate FCM [Bezdek, 1981a] clustering algorithm. They use characteristics of the FCM to indicate the optimum number of clusters. In this sense, earlier validity indices designed for the FCM clustering method may not be suitable for other variations of fuzzy clustering algorithms, which are designed for different purposes, e.g., Fuzzy CRegression (switching regression) algorithm (FCRM) [Hathaway and Bezdek, 1993]. For these types of the FCM clustering algorithm variations, different validity measures have been introduced. For instance, in [Kung and Lin, 2004] a new validity index is proposed to measure optimum number of clusters of FCRM applications. Their validity function is a modification of Xie-Beni [1991] ratio type validity function. It accounts for the similarity between regression models using standard inner product of unit normal vectors, i.e., estimated regression equations for each cluster, instead of the distance between cluster centers. The validity functions should be designed based on the objectives and structure of the corresponding fuzzy clustering methods. In this work, two new fuzzy clustering methods are proposed, i.e., IFC and IFC-C. Therefore in this section we investigate two new cluster validity functions to measure the optimum number of clusters of the models of these new clustering algorithms. The validity functions, to be discussed below, are ratio type indices, which measure the ratio between the compactness and the separability. Since IFC and IFC-C algorithms are new types of clustering methods, which combine two different fuzzy clustering approaches, i.e., FCM clustering [Bezdek, 1981a] and FCRM [Hathaway and Bezdek, 1993] methods in a novel way and utilize “Fuzzy Functions”, the new validity indices are designed to validate two different concepts in the following way. The compactness is used to combine within-cluster distances and the errors between actual and estimated output obtained from c number of regression functions. The separability, on the other hand, is used to determine the structure of clusters by measuring the ratio of cluster center distances to the angle between their regression functions. If two functions of different clusters happen to be parallel to each other, only cluster center distances are to be used as a separability measure. In the next section, well-known cluster validity indexes that are closely related to the proposed validity measures are reviewed and new cluster validity measures are introduced. Then, using four different artificial datasets, and two real life benchmarking datasets, the performance of new validity function for IFC is compared to three other well-known cluster validity measures, which are closely related to the new validity measure.
3.4.1 Overview of Well-Known Cluster Validity Indices The literature indicates that there is numerous cluster validity formulas (indexes) designed for different clustering methods. This section presents validity formula designed for two different types of fuzzy clustering algorithms: point-wise clustering, e.g., FCM clustering [Bezdek, 1981a] and regression type clustering, e.g., fuzzy c-regression model (FCRM) type algorithms [Hathaway and Bezdek,
3.4 Two New Cluster Validity Indices for IFC and IFC-C
87
1993]. The research indicates that most of the prominent cluster validity indices are designed for FCM algorithms to find the optimum number of clusters. In a more recent study, Kung & Lin [2004] proposed a new validity index for identification of the optimum number of clusters of fuzzy c-regression model (FCRM) clustering algorithm. The fact that the new IFC [Celikyilmaz and Turksen, 2007b] algorithm combines regression and clustering concepts, we hypothesize that the new validity index should measure the concepts from two different types of cluster validity indexes while trying to indicate the optimum number of clusters. Next, we will investigate both types of validity measures from literature before we present the new cluster validity indices. Well-Known CVIs for Point-wise and Regression Type Clustering Algorithms
Most cluster validity functions are structured by combining two different clustering concepts [Kim and Ramakrishna, 2005]:
• •
Compactness: to measure the similarity between cluster elements within each cluster. Most validity indexes use within cluster distances to measure compactness. Separability: to measure dissimilarity between each individual cluster. Most validity measures use between cluster center distances to measure separability.
It has been shown that a clustering algorithm is effective when compactness is small and separability is large [Kim and Ramakrishna, 2005]. Based on the way these two concepts are conjunct, cluster validity indices can be categorized into two different types: ratio-type and summation-type. Ratio-type validity indices are formed by measuring compactness to separability ratio, e.g., Xie-Beni [1991] index. The summation-type validity indices combine separability concepts by adding them in many different ways, e.g., Fukuyama-Sugeno [1989] index. Since new validity index proposed in this paper is a ratio-type validity measure, we will give details of prominent ratio-type validity indexes from literature. A well-known ratio-type cluster validity index (compactness/separability) is XB cluster validity index [Xie and Beni, 1991] which is formulated by:
⎛ c n 2 2⎞ ⎜ ∑∑ μik d ( xk , vi ) ⎟ / n ⎠ , d (x ,v ) = x − v XB (c) = ⎝ i =1 k =1 k i k i 2 min i , j ≠ i d (vi , v j )
2
(3.39)
where xk∈ℜnv represents the kth input vector, k=1,…,n andυi∈ℜnv , i,j=1,…,c, represents cluster center as a vector of nv dimensions. XB decreases monotonically when c is close to total number of data samples, n. Kim and Ramakrishna [2005] discuss the relationship and behavior of compactness and separability between clusters obtained from the FCM clustering method for changing values of number of clusters. In [Kim et al., 2003; Kim and Ramakrishna, 2005], the relationship between compactness and separability is demonstrated using graphs. Figure 3.6 is adopted from [Kim and Ramakrishna,
88
3 Improved Fuzzy Clustering
compactness, seperability
2005]. According to their generalization, compactness increases sharply as c decreases from coptimal to coptimal-1. This means that for c coptimal compactness will be small. Compactness will be zero when each object is a cluster of itself, i.e., c=n. A sudden drop in compactness would be the indicator of the coptimal.
separab ility
compactness c-optimal num be r of cluste rs
Fig. 3.6 Compactness and Separation concepts of ratio-type CVI index
From the analysis of results of FCM clustering, one observes that each cluster may have different compactness values because some clusters are more/less dense than the others. As the number of clusters is increased/decreased, the change in compactness of those clusters would be different (bigger or smaller) than the rest of the clusters. In XB validity index, compactness of overall clustering structure is determined (the numerator in (3.39)) by averaging compactness of every cluster. However, averaging might suppress the effect of large changes in compactness values of some clusters. These changes are usually caused by FCM clustering models with undersized (oversized) number of clusters. Therefore, to exploit these huge compactness value shifts during determination of the optimum number of clusters, the best way is to interpret the compactness of a model by measuring the maximum compactness of clusters. On the other hand, from Figure 3.6 it can be observed that the relative changes of compactness and separability are somewhat similar when c≠ coptimal. Therefore, their effect on ratio-type validity indices should be somewhat similar, i.e., they should both show an increasing/decreasing behavior at the same c values. Concerning the latter discussion, in [Kim and Ramakrishna, 2005] an improved version of XB validity index is proposed as follows:
⎧ n ⎪ max i =1,..,c ⎨∑ μik2 xk − υi ⎪ k =1 n ⎩ * XB (c) = 2 min i , j ≠ i υi − υ j
2
⎫ ⎪ ⎬ ⎪ ⎭
(3.40)
3.4 Two New Cluster Validity Indices for IFC and IFC-C
89
It was proved in [Kim and Ramakrishna, 2005] that XB* index is more effective than XB index, because with XB* one can detect clusters if they have large compactness values. With this information one can determine the optimum number of clusters by observing these ambiguities in clustering structure. Hence, XB* will be the starting point of our new cluster validity formula for IFC algorithm. On the other hand, Kung and Lin [2004] formulated a validity index to validate fuzzy c-regression (FCRM) [Hathaway and Bezdek, 1993] type clustering approaches, i.e., fuzzy adaptation of switching regression problems. FCRM clustering algorithms models identify c regression model parameters and a partition matrix (membership values matrix), which is interpreted as importance or weight attached to error measure between actual output and each regression model output. The validity measure in [Kung and Lin, 2004] is based on XB index; however, compactness is measured by the error between the observed output and the output obtained from a linear or polynomial regression function of each cluster. The separability is measured by inverse dissimilarity between clusters defined by absolute value of the standard inner-product of unit normal vectors, which represent c hyper-planes. Kung-Lin’s cluster validity index is formulized as follows: 2⎞ ⎛ c n 2 T ⎜ ∑∑ μik x β i − yk ⎟ / n ⎠ , where Kung − Lin(c) = ⎝ i =1 k =1 ⎡ ⎤ 1 ⎢ ⎥ ⎢ max u , u + κ ⎥ i j ⎢⎣ i ≠ j ⎥⎦
(3.41)
-1
βi = ⎡⎣ xT μi x ⎤⎦ ⎡⎣ xT μi x ⎤⎦ y Numerator in (3.41) is a compactness measure and denominator represents separability. ui’s represent unit normal vector of each c-regression function. FCRM models [Hathaway and Bezdek, 1993] are represented with regression equations and therefore their corresponding unit vectors are defined by
ui =
[ni ] , ni = [ β i1 " β i , nv −1] ∈ℜ nv +1 ni
(3.42)
ni indicates regression function parameters, βi,nv, in a vector form and ||.|| is Euclidean norm, nv indicate number of variables of input dataset, x =[x1,…,xnv]. The inner product of two unit vectors of two clusters equals to the value of cosine of the angle between them. This value is used to measure the separability of c regression functions of Kung and Lin’s [2004] validity formula. When functions are orthogonal, separability is maximized. They have shown that, their validity function is a good indication for FCRM models.
90
3 Improved Fuzzy Clustering
3.4.2 The New Cluster Validity Indices In this work, two new ratio-type cluster validity indices, cviIFC, and cviIFC-C are presented to validate novel Improved Fuzzy Clustering (IFC) [Celikyilmaz and Turksen, 2007b] and IFC for classification problems, IFC-C [Celikyilmaz and Turksen, 2007i] Firstly, we will focus on IFC for regression problems. In IFC algorithm, clusters are identified by cluster prototypes (centers) and their corresponding regression functions. Membership values calculated from IFC algorithm represent the degree to which each object belongs to each prototype. They are also used as candidate input variables, which can help to identify relationships between input and output variables with regression functions. It should be reminded that, each input vector is mapped onto a new feature space using membership values and/or their user-defined transformations as new additional inputs. A new dataset is formed for each cluster in feature space and then one “Fuzzy Function” is estimated using these datasets. It is therefore expected that, membership values obtained from IFC algorithm can explain output variable, in other words be “good” predictors of the regression functions, as well as represent better fuzzy partitions in the feature space. When validating IFC algorithm, one needs to find a relation between the compactness and separability of clusters by measuring the clustering structure as well as the relationships between their representative regression functions. The new validity measure should include both of these concepts when validating the number of clusters of IFC models. The compactness of the new validity measure will combine two terms. (1) XB* compactness (numerator in (3.40)) will be used as the first term of compactness and (2) a modified version of compactness of Kung-Lin index (numerator in (3.41)) as a second term. The second term of compactness of cviIFC will represent the error between regression model and actual output. Regression models are the “Fuzzy Functions”, f(Φi,Ŵi)Æy, where Φi is a matrix of input variables and their membership values in the corresponding cluster and their transformations, and Ŵi are the regression coefficients, which are the mappings of the input space onto a new space using membership values. On the other hand, the separability of cviIFC will couple the angle between regression functions and the distance between each cluster center prototypes. In the new validity function, original scalar inputs also enter as predictors of functions when measuring the compactness. This can be explained as follows: IFC algorithm finds membership values, μiimp, i=1,…,c, that can predict the local models of a given system. (imp) indicates that membership values are from improved clustering (IFC). These membership values and/or their transformations are also used together with the original input variables to determine “Fuzzy Functions” for each cluster using suitable function approximation method, e.g., least squares, support vector machines, ridge regression, etc., in system modeling with fuzzy functions (to be discussed in Chapter 4). The novel IFC algorithm introduces membership values and their transformations as additional predictors along with the original input variable in order to minimize the error of the local models of each cluster. For this reason, here optimum number of clusters of the
3.4 Two New Cluster Validity Indices for IFC and IFC-C
91
new IFC algorithm is validated by analyzing the behavior of new membership values in addition to the original input variables in regression functions. The following steps describe the configuration of the new validity analysis: (i) A different dataset is structured for any cluster i by using membership values (μiimp) and/or their transformations as additional predictors. This is the same as mapping original input space, x∈ℜnv of nv different input variables, onto a higher dimensional feature space ℜnv+nm, i.e., xÆΦi(x,μiimp)∈ℜnv+nm, for each cluster i, i=1,..,c. Hence, each data vector is represented in (nv+nm) feature space. nm is the number of augmented membership values and their potential transformations as new predictors to the original input space. Below, a special feature space is shown which is formed by mapping the original input matrix of one-dimensional inputs onto a new space of (nv+1+1) dimensions and a single output using only the membership values by
⎡μiimp x1 ⎤ ×1 ⎢ ⎥ imp Φi(x, μi ) = ⎢ # #⎥ ⎢μiimp ⎥ ⎣ ×n xn ⎦
⎡ y1 ⎤ ⎢ # ⎥ ∈ℜnv+nm ⎢ ⎥ ⎢⎣ yn ⎥⎦
(nv=1, nm=1). If it is a need, we use mathematical transformations of membership values such as (μikimp)2, (μikimp) m, exp(μikimp), ln((1- μikimp)/μikimp), etc., where m represents the degree of fuzziness of IFC. (ii) We then fit a regression function for each cluster using their corresponding input datasets, Φi(x,μiimp). Therefore, new cluster validity index designed for the IFC clustering is formulated as; vc * = max i =1,.., c
1 n ∑ μ ikm n k =1
(
⎧ min v − v 2 i j ⎪ i, j≠i vs * = ⎨ ⎪ min i , j ≠ i ⎩
(xy
)
⎫ + ( y k − f i ( Φ i , Wˆi )) 2 ⎪ ⎪ , vc * ⎪ α i , α j , if α i , α j ≠ 0 ⎬ cviIFF = (3.43) * ( c ⋅ vs ) + 1 ⎪ i,j=1,..,c,j ≠ i ⎪ 2 vi − v j , otherwise ⎪⎭ k
k
− vi
)
2
where vc* represents compactness and vs* represents separability of the new validity measure.
Φ
Let ni ,
niΦ = Wˆi1 ,Wˆi 2 ,..,Wˆinm ,Wˆi ( nm +1) ,...,Wˆi ( nm + nv ) ∈ℜnv + nm ,
represent the normal vector, i.e., the vector orthogonal to the plane, of fuzzy functions, obtained from given dataset in the feature space, Φi(x,μiimp)∈ℜnv+nm . α i in the |〈αi,αj〉|∈[0,1] represents unit normal vector of each “Fuzzy Function”, i,
α i = [niΦ ] / niΦ
, where
niΦ is the length of the vector and αi represents the unit
vector. Absolute value of inner product of two unit vectors of fuzzy functions of two clusters, i,j=1,…,c, i≠j, equals to the value of cosine of the angle between them:
92 cosθi , j = α i , α j =
3 Improved Fuzzy Clustering niΦ , nΦj niΦ n Φj
=
(Wˆ ) i1
⎡Wˆi1Wˆ j1 + ... + Wˆi ( nm + nv )Wˆ j ( nm + nv ) ⎤ ⎣ ⎦ 2
(
) ( ) 2
+ ... + Wˆi ( nm + nv ) . Wˆ j1
2
(
+ ... + Wˆ j ( nm + nv )
)
2
(3.44)
When two cluster centers are too close to each other due to oversized number of clusters, the distance between them becomes almost (≅0) invisible, then the validity measure goes to infinity. To prevent this, the denominator of cviIFC in (3.43) is increased by 1. In cviIFC-C for IFC-C models, “Fuzzy Functions” in vc*, fi(Φi(x,μiimp),Ŵi), are replaced with posterior probabilities obtained from classifier models as follows: ⎧1 n vc * = max i =1,..,c ⎨ ∑ μ ikimp ⎩ n k =1
(
) (x m
k
y k − vi
2
)
⎫ + ( yk − Pˆi ( y = 1 | f ( Φ i ( x , μ iimp ), Wˆi ))) 2 ⎬ ⎭
(3.45)
Compactness measure, vc* (numerator of cviIFF) of the new validity function, which is different from the XB* compactness criteria, includes an additional term; squared error of Fuzzy Functions. Nevertheless, the compactness, vc* has still the similar affect on the outcome as the compactness of XB* index as shown in Figure 3.6 for the following reasons. When c>c-optimum, compactness of clusters will be small. This is due to the fact that, as the number of clusters is increased, within cluster distances will decrease since clusters will include more similar objects. In addition there is one regression function estimated for each cluster. Similarly, as the number of functions is increased, the error of the “Fuzzy Functions” will decrease, because regression model output will approach to actual output. Compactness will be zero when c=n, which is when every object becomes its own cluster center and one function, passes through every object. On the other hand, when c=c-optimum, the compactness of each individual cluster will all be very small, so that the difference between maximum and minimum compactness of clusters will be negligible. Hence, we can identify sudden changes by analyzing the maximum compactness values instead of average compactness values. These are assumptions under the optimum conditions and new cluster validity index combines the measure of the separability of clusters within the new validity index to increase the precision of approximation of optimum number of clusters.
3.4 Two New Cluster Validity Indices for IFC and IFC-C
93
Separability of the new validity index also combines the separability measure obtained from two different structures, i.e., regression and clustering. Between cluster distances are represented as Euclidean distances between cluster centers in (3.43). Additionally, absolute value of cosine of the angle between each “Fuzzy Function” |〈αi,αj〉|∈[0,1] is used as an additional separability criteria. If functions are orthogonal, then they are the most dissimilar functions, which is an attribute of an optimum model. The separability, i.e., denominator of cviIFC in (3.43), conditionally combines between-cluster-distances and angles by taking the ratio between them. If an angle between any two function is zero, then they are parallel to each other. So we used the minimum distance between their cluster centers to represent the separability. Better separability would then yield better clustering results. Figure 3.7 is used to explain the changes in separability based on different patterns in the dataset. When two clusters are apart from each other, i.e. the distance between their cluster centers are farther than the rest of the clusters, e.g., cluster1 (Δ’s) and cluster 2 (•’s) in Figure 3.7, no matter what the angle between their regression functions are, their separability will be large. Separability will get larger, if the two vector lines are orthogonal, i.e., cosine of the angle approaches to zero as the value of the angle approaches to 900. On the other hand, when clusters are so close to each other, e.g., cluster 2(•’s), cluster 3 (*’s) and cluster 4 (o’s) in Figure 3.7, the angle between their functions would be dominant separability identifier. If vectors are close to being orthogonal vectors, i.e., cosine of the angle between them is very close to 0 as in cluster3 and cluster 4, then the minimum separability would be very large even if their distance is close to zero. When two 50
40 1 30
20 2
10
0 4 -10 3 -20 -5
-3
-1
1
3
5
Fig. 3.7 Four Different Functions with Varying Separability Relations
94
3 Improved Fuzzy Clustering
cluster centers are close to each other and the functions are almost parallel to each other, i.e., the value of the cosine of the angle would be close to 1, e.g., cluster 2(•’s) and cluster 3 (*’s), then the separability will be small. Using the artificially created dataset, to be presented next, we will justify that the presented cviIFC and cviIFC-C validity measures are good indicators of the new clustering schema, i.e., IFC and IFC-C respectively. The performance of the new validity measures will be investigated in comparison to three other wellknown validity criteria that were mentioned above.
3.4.3 Simulation Experiments [Celikyilmaz and Turksen, 2007i;2008c] In this section we present simulation experiments to measure the performance of cviIFC and cviIFC-C methods by applying IFC and IFC-C methods on artificially created datasets of different structures as well as real datasets. We also apply three different well-known cluster-validity indices to justify the strength of the new cluster validity measures. Later, we will discuss the results obtained form validity experiments. Experiment 1
In order to demonstrate the performance of the new validity indices, we constructed tests on datasets with known structures (number of clusters, or patterns of functions, etc.,). We introduce a dataset structure similar to [Kung and Lin, 2004], but we formed 4 different datasets containing different number of linear models. Each dataset is created using different number of functions with varying Gaussian noise, εl,m(l=1..4, m=1,..,nf) having zero means and variance of 0.9 in each function. m represents each linear function such as in Table 3.5 to be used to generate each dataset l, l=1..4. Firstly, 400 training input vectors, x, which are uniformly distributed in the range of [-5, +5], are generated randomly to form the first dataset. Then the dataset is grouped into 4 separate parts, of 100 observations. Each function from Table 3.5(A) is applied to one group to obtain output values, e.g., 4-clustered datasets are formed using four separate local models. Following the same convention, we generated 500, 490 and 450 more training vectors, x, separately, which are also uniformly distributed in the range of [-5, +5] to generate the datasets with 5, 7 and 9 patterns using the corresponding functions in Table 3.5 (B), (C), and (D), respectively. Using 500, 490 and 450 training vector sets; we applied each of the 100, 70 and 50 observations to the 5, 7, and 9 functions correspondingly to form 3 more single input- single output datasets, e.g., dataset2, dataset3, and dataset4. We applied IFC algorithm [Celikyilmaz and Turksen, 2007b] onto these 4 datasets separately using 2 different degree of fuzziness values, m=1.3, and m=2.0 and 14 different number of clusters, c=2,…,15. We measured the new cluster validity index, cviIFC as well as XB, XB*, and Kung-Lin validity measures using the membership values obtained from these IFC models.
3.4 Two New Cluster Validity Indices for IFC and IFC-C
95
Table 3.5 Functions used to generate Artificial Datasets (A) 4-cluster (dataset1) y1 = x1T β1 = 2 x + 5 + ε1,1
(C) 7-cluster (dataset3) y1 = x1T β1 = 2 x + ε 3,1
(D) 9-cluster (dataset4) y1 = x1T β1 = x + ε 4,1
y2 = x2T β 2 = 0.2 x − 5 + ε1,2
y2 = x2T β 2 = 7 x − 5 + ε 3,2
y2 = x2T β 2 = 3x + 2 + ε 4,2
y3 = x β3 = − x + 1 + ε1,3
y3 = x3T β 3 = − x + 1 + ε 3,3
y3 = x3T β 3 = 0.5 x + 1 + ε 4,3
y4 = x β 4 = x − 8 + ε1,4
y4 = x β 4 = x − 14 + ε 3,4
y4 = x4T β 4 = 3x − 3 + ε 4,4
(B) 5-cluster (dataset2) y1 = x1T β1 = 2 x + ε 2,1
y5 = x β5 = −5 x − 6 + ε 3,5
y5 = x5T β5 = − x + 2 + ε 4,5 y6 = x6T β 6 = −2 x − 9 + ε 4,6
y2 = x β 2 = 7 x − 5 + ε 2,2
y6 = x β 6 = 6ε 3,6 y7 = x7T β 7 = 3x − 25 + ε 3,7
y7 = x7T β 7 = −2 x + ε 4,7
T 3
T 4
T 2
y3 = x3T β 3 = − x + 1 + ε 2,3
T 4
T 5
T 6
y8 = x8T β8 = x − 0.5 + ε 4,8
y4 = x4T β 4 = x + 14 + ε 2,4
y9 = x9T β9 = − x + 0.2 + ε 4,9
y5 = x β5 = −5 x − 6 + ε 1,5 T 5
Figure 3.8 illustrates the values obtained from the four cluster validity functions using membership values of the IFC models on dataset1 for changing values of number of clusters, c, and two fuzziness values, m=1.3 and 2.0. m=1.3 is used to represent a more crisp model where overlapping of clusters are negligible, while m=2 is a model where the local fuzzy clusters (models) overlap to a degree of 2.0. In an analogical manner, Figure 3.9, Figure 3.10, Figure 3.11 compare the values obtained from these four cluster validity measures using the membership values of the IFC models on dataset2, dataset3 and dataset4, respectively. Hence, it is expected that c* would be 4, 5, 7, and 9 for dataset1, dataset 2, dataset3 and dataset4, respectively. 15
8
m=1.3 m=2.0
m=1.3 m=2.0
6 XB*
XB
10 5
4 2
0
0
4 8 12 number of clusters
0
15
0.15 m=1.3 m=2.0
0.6 0.4 0.2
4 8 12 number of clusters
15
m=2.0 m=1.3
0.125 cviFF
Kung-Lin
0.8
0
0
0.1 0.075 0.05 0.025 0
0
4 8 12 number of clusters
15
0
4 8 12 number of clusters
15
Fig. 3.8 Dataset 1 - Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values;m=1.3 (.-) and m=2.0(*-) using 4-patterned dataset
96
3 Improved Fuzzy Clustering 60
8
m=1.3 m=2.0
XB
4
20 0
m=2.0 m=1.3
6 XB*
40
2 0 0
5 10 number of clusters
15
0
5 10 number of clusters
15
0.8 m=1.3 m=2.0 cviFF
Kung-Lin
0.6 0.4 0.2 0
0
5 10 number of clusters
m=1.3 m=2.0
0.2 0.1 0
15
0
5 10 number of clusters
15
Fig. 3.9 Dataset 2- Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 5-patterned dataset 3
6 m=1.3 m=2.0
m=1.3 m=2.0
2
XB
XB*
4
1
2 0
0 1
4 7 10 13 number of clusters
15
13
15
m=1.3 m=2.0
m=1.3 m=2.0
cviFF
0.75 Kung-Lin
4 7 10 number of clusters
1.5
1
0.5
1 0.5
0.25 0
1
1
4 7 10 13 number of clusters
15
0
1
4 7 10 number of clusters
13
15
Fig. 3.10 Dataset 3- Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 7-patterned dataset
Experiment 2
In order to explain the strength of the proposed validity measure on real datasets, we applied the proposed IFC on a dataset of historical stock prices of a major Canadian Financial Institution. The aim in this experiment is to predict the last two months’ stock prices using estimated models obtained from the previous 10 month stock prices. In real datasets, just as in artificial datasets, there are hidden
3.4 Two New Cluster Validity Indices for IFC and IFC-C 3
6
m=1.3 m=2.0
m=2.0 m=1.3
2
XB
XB*
4
1
2 0
1
4
7 10 numer of clusters
13
0
3 6 9 12 number of clusters
15
0.15 m=2.0 m=1.3 cviFF
m=2.0 m=1.3
0.8 0.6
0.1
0.05
0.4 0.2
0
15
1 Kung-Lin
97
0 0
3
6 9 12 number of clusters
15
0
3 6 9 12 number of clusters
15
Fig. 3.11 Dataset 4-Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 9-patterned dataset -3
x 10
Potential Cluster #1
3.5
Potential Cluster #2
3
density
2.5 2 1.5 1 0.5 0
70 60 50 45 40 Bollinger-Band 20 Day
50
55
60
65
EMA 50 day
Fig. 3.12 Density graph of stock price dataset using two financial indicators. Two components that are well separated are indicated.
components viz., clusters, within which input-output relations can be locally identified. Since we do not have a prior conception of the actual number of clusters of real datasets, we can conduct an exhaustive search by changing the number of clusters and keeping other parameters constant and investigate the
98
3 Improved Fuzzy Clustering
performance of each model. The optimum model would be the one with the best performance, e.g., the least model error, etc. We identify the optimum number of clusters based on the number of cluster of the optimum model. Here, we used the proposed fuzzy system modeling tools based on type-1 Improved Fuzzy Functions, (T1IFF) to be discussed in chapter 4. T1IFF system modeling tools implement novel IFC method to find hidden structures in a given dataset. Improved membership values are obtained from IFC to be used as additional predictors to identify the local fuzzy functions. We iterated the T1IFF for different values of the number of clusters, keeping the rest of the parameters constant and captured the optimum number of clusters, c*, based on the minimum error criteria. At this point we should pause and explain the multi-modal structure of real systems. One can observe hidden structures in a given system by just analysing the probability distributions of input variables. For instance, in Figure 3.12 the Gaussian density graphs of the given stock price dataset is illustrated using two financial indicators. Two separate components can easily be noticed from the graph. The degree of overlap between these components might affect the number of clusters perceived by the human operator. We expect to distinguish between any hidden overleaping clusters in real datasets using the cviIFC based on the structure of the system. In this experiment, stock prices collected over 12 months are divided into two parts. Indicator values from the first ten months, i.e., from 27 July, 2005 to 11 May, 2006, are used to train models and to optimize model parameters. The last two months, i.e., from 11 May-21 July 2006, are holdout for testing the model’s performances. Experiments were repeated with 20 random subsets of above sizes. Model performances using holdout datasets are measured using root-mean-errorsquare (RMSE) and averaged over 20 repetitions. Financial indicators such as moving average, exponential moving average are used as predictors. Detailed explanations on the financial indicators of such experiments are given in the chapter of experiments. We applied an exhaustive search method to identify the optimum T1IFF model by changing values of number of clusters, bounded with c≤ n*(1/10), where n is the number of training samples. Improved membership values, μimp, their logit transformations, i.e., log((1-μimp)/μimp), and exponential transformations, i.e., exp(μimp), are used as additional dimensions (input variables) to approximate fuzzy functions in a new feature space. Based on the minimum RMSE values, the average optimum c, c*, of the best T1IFF models over 20 repetitions is identified as c*=4.9±1.77. In short, we found that this stock price dataset may consist of c*∈[3, 6] structures. We now want to test if the new cluster validity index could have found the optimum c*, by just applying IFC clustering without executing the exhaustive search method using T1IFF strategy. To validate the latter statement, we calculated the values of the proposed validity index, cviIFC, as well as XB, XB* and King-Lin indices using membership values obtained from the proposed IFC models. We plotted validity index values measured for different c values and two fuzziness values, m=1.3 and
3.4 Two New Cluster Validity Indices for IFC and IFC-C 8
3 m=1.3 m=2.0
4
1
2 2
4 6 8 number of clusters
0
10
2
4 6 8 number of clusters
10
1.5 m=1.3 m=2.0
0.01
cviFF
Kung-Lin
0.015
0.005 0
m=1.3 m=2.0
2 XB*
XB
6
0
99
2
4 6 8 number of clusters
10
m=1.3 m=2.0
1 0.5 0
2
4 6 8 number of clusters
10
Fig. 3.13 Stock Price Dataset - Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-)
2.0. The CVI graphs in Figure 3.13 indicates the values of different CVI measures using membership values from IFC algorithm applied on stock price dataset. Experiment 3
In addition to the first two experiments above, we wanted to validate the performance of the novel Improved Fuzzy Clustering for classification (IFC-C) clustering algorithm using the new cviIFC-C method. Since proposed IFC-C is specifically designed for binary classification problems; we have chosen the Ionosphere classification dataset from UCI repository [Newman et al., 1998] to demonstrate the performance of the new cviIFC-C. The targets of ionosphere dataset take on the values of “good” or “bad” indicating the free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere. The details of this datasets are given in chapter of experiments (Chapter 6). In this experiment, again the actual number of clusters that the Ionosphere dataset might hold, viz., overlapping or discrete structures that represent multiple classification surfaces of the ionosphere dataset, is an unknown parameter. Therefore we applied our proposed Improved Fuzzy Functions approach (T1IFFC) for classification problems (to be discussed in Chapter 4) on the ionosphere dataset. We only changed the number of clusters keeping the rest of parameters constant. The list of parameters of previous T1IFF models of experiment 2 that
100
3 Improved Fuzzy Clustering
are used to model stock prices is also used to build different models T1IFF-C of this experiment. The experiment is repeated 10 times. The results have yielded that the average c* is around c*=3.5±0.92, viz, c*∈[3-5] based on the average highest classification accuracy over 10 iterations. To validate the optimum number of clusters, c*, we applied the proposed cviIFC-C as well as XB, XB* and KingLin indices on the outcome of 10 different IFC-C models and took the average of the values of each individual validity measure. Figure 3.14 demonstrates the results for two fuzziness values, m=1.3 and 2.0. 60
400
200
0
2
4 6 8 number of clusters
2
4 6 8 number of clusters
10
0.8 m=1.3 m=2.0
0.06 0.04 0.02 2
4 6 8 number of clusters
m=1.3 m=2.0
0.6 cviFF-C
Kung-Lin
20 0
10
0.08
0
m=1.3 m=2.0
40 XB*
XB
m=1.3 m=2.0
10
0.4 0.2 0
2
4 6 8 number of clusters
10
Fig. 3.14 Ionosphere Dataset- Cluster validity measures, XB, XB*, Kung-Lin and cviIFCC, versus c for two m values; m=1.3 (.-) and m=2.0(*-)
3.4.4 Discussions on Performances of New Cluster Validity Indices Using Simulation Experiments We have reached the following conclusions based on the analysis of the results of the experiments on 4 different artificial datasets and a real life dataset of prediction nature (stock price dataset). We analyzed the outcome of IFC to measure the values of cviIFC along with the values of the known cluster validity measures; XB, XB*, and Kung-Lin. The discussions also include the analysis of models of IFC-C on a binary classification dataset, i.e., Ionosphere dataset. For these experiments we measured cviIFC-C as well as XB, XB*, and Kung-Lin to validate the IFC-C model results.
Analysis of Experiment 1. In Table 3.6 the predicted optimum number of clusters obtained from four different validity measures using 4 different artificial datasets for two different fuzziness levels, i.e., m=1.3 and m=2.0, are shown.
3.4 Two New Cluster Validity Indices for IFC and IFC-C
101
Table 3.6 Optimum number of clusters of artificial datasets for m∈{1.3,2.0}
Actual # of Clusters Î m=1.3 XB XB* Kung-Lin Proposed cviIFC m=2.0 XB XB* Kung-Lin Proposed cviIFC
Dataset 1 c*=4
Dataset 2 c*=5
Dataset 3 c*=7
Dataset 4 c*=9
2-6 2-9 9 4,5
2-6, 11 2-7, 11 10 4-5
5 10 8 7
2, 8 4 9,10 9
2-6 2-8 9, 13 4
2-6, 11 2-9 10 5,6
3, 5, 6 8 6 6,7
4, 7,9 9-10 9,10 9
The results as shown in Table 3.6, which are obtained from the application of the proposed cviIFC on 4 different datasets (Figure 3.8 - Figure 3.11), indicate that cviIFC can successfully suggest the optimum number of clusters. It is also not effected by changing values of the levels of fuzziness, i.e., m=1.3 and m=2.0. It converges at the optimum number of clusters, and then it asymptotes to zero as the number of clusters goes to infinity, e.g., c=15 in this case. In all of the four datasets, the new validity criteria captures c* at the actual value, and then converges to zero as c>c*. The difference between cviIFC and XB* is that, proposed cviIFC is expected to show an asymptotical behaviour towards larger number of clusters, whereas XB* validity index can increase or decrease but c* should be where index is at its minimum value. When actual number of clusters are small, as in dataset1 and dataset2, XB* cannot directly identify c* for either level of fuzziness values. For larger c values, e.g., dataset3 in Figure 3.10, XB* can identify the actual number of clusters more precisely. We can conclude that when the number of different components (clusters, different patterns) in a given dataset are large, XB* can validate the IFC clustering applications to a certain degree. When the system has fewer models, XB* is not the optimum validity measure of number of clusters for the IFC method. As Kung-Lin (2004) cluster validity index asymptotes to zero, the knee-point is the indicator for the optimum number of clusters, c*. From the cluster validity graphs of four artificial datasets demonstrated in Figure 3.8-Figure 3.11, Kung-Lin index is unable to identify the optimum number of clusters for small-clustered datasets and is capable of identifying c* for datasets with larger number of clusters to a certain extent. In [Kim and Ramakrishna, 2005], it was proven that the XB index is inefficient to identify the optimum number of clusters, c*, of FCM clustering models. We wanted to test if it can identify the c* of the IFC models. The minimum XB values
102
3 Improved Fuzzy Clustering
indicate the predicted c*. XB index was unable to identify an exact c* in most of the datasets. It can however give a wider range which includes the optimum c*, however, this information is not enough to determine a precise value for the optimum number of clusters.
Analysis of Experiment 2. The optimum number of clusters was estimated to be c*∈[3,6] based on the models of T1IFF fuzzy system where proposed IFC is used for structure identification on real stock price dataset [Celikyilmaz and Turksen, 2007b]. Hence, we applied the proposed cviIFC function and averaged over 20 repetitions. In Figure 3.13 the values obtained from the cluster validity indexes of four different validity measures are shown using two levels of fuzziness values. The results are summarized in Table 3.7. The values of the proposed cviIFC indicate that c* should around 4 or 6 for two different m values. Kung-Lin index can also correctly validate c*. In addition, XB index can somewhat identify c* within the interval of c*∈[4-8]. Among 4 different cluster validity measures, the cviIFC has still the closest estimations to the actual c*. Table 3.7 Optimum Number of Clusters of IFC models of stock price dataset identified by different validity indices
Actual # of Clusters m=1.3 XB XB* Kung-Lin Proposed cviIFC m=2.0 XB XB* Kung-Lin Proposed cviIFC
Î
[3,6] 6-8 4, 6-8 4 4,6 5,7 5,7 4 4,6
Analysis of Experiment 3. In experiment 3, we wanted to demonstrate that the new cviIFC-C, specifically designed for the classification type datasets, could actually identify the optimum number of clusters of a real dataset, viz., ionosphere dataset from UCI repository. From exhaustive analysis based on models of IFF-C for changing c values and two different levels of fuzziness, we obtained that c* should be c*∈[3-5]. From the validity graphs shown in Figure 3.14, we obtained the optimum number of clusters from each cluster for two different fuzziness values and listed in Table 3.8. It is to be noted that cviIFC-C is able to identify overlapping clusters and the elbow indicates that optimum c* should be 4 or 6. Kung-Lin index could also identify the approximate c*. None of the other cvi measures were able to confidently identify c* for changing values of fuzziness.
3.5 Summary
103
Table 3.8 Optimum Number of Clusters of IFC models of Ionosphere dataset indicated by different validity indices
Actual # of Clusters m=1.3 XB XB* Kung-Lin Proposed cviIFC m=2.0 XB XB* Kung-Lin Proposed cviIFC
Î
[3,6] 6-9 8-9 6 6 4,7,8 7, 8 6 4
3.5 Summary In data mining practices, one of the challenges one encounters is that the datasets under study may have different properties, e.g., number of variables, number of data vectors, noise level, number of different model structures, the dependent variable characteristics, etc., that different approaches should be followed to get better results from the representative models of a system. Based on these characteristics of datasets, different fuzzy clustering algorithms have been introduced in the literature to find the hidden structures in them. If one designs a fuzzy clustering algorithm for an image processing dataset, which composes of the pixel intensities, the neighborhood pixel intensities might be one of the important measures that the clustering algorithm should utilize. Similarly, for a web log dataset, similarity based on recursive occurrences of corresponding words will be the most discriminative property. The fuzzy clustering should include corresponding similarity measures based on the type of the dataset under study. In this chapter different fuzzy clustering methods are reviewed and the structural differences between them are discussed. Most importantly, in this chapter a new improved clustering algorithm is proposed to be used for a new fuzzy system modeling approach, namely “Improved Fuzzy Functions” approaches. The logic behind proposed improved fuzzy clustering algorithm is that, feature spaces are not only related to input space but also to input-output mapping space, so the given data objects could be grouped by considering local models of the dataset as well as linear relationship between input and output variables. Therefore, new improved clustering algorithm combines regression type clustering algorithms to find membership values and their transformations that may help to explain local model relationships and basic clustering structure, which helps to a separate datasets into meaningful clusters at the same time. We also present an extension of the new fuzzy clustering method for classification problem domains.
104
3 Improved Fuzzy Clustering
This chapter also deals with one of the problems common to almost all of the clustering algorithms, that is to say, determination of optimum number of clusters. In current literature, since there are numerous fuzzy clustering methods, many different cluster validity measures are proposed. Especially there are numerous validity measures just to identify the optimum number of clusters of the wellknown Fuzzy C-Means Clustering algorithm. In this chapter two new cluster validity indices are introduced corresponding to the two new fuzzy clustering methods in order to validate their structures. Experimental analysis on artificial and real life datasets showed that the new cluster validity indices can help to identify optimum number of clusters of the models of the new improved fuzzy clustering methods. We preferred to present the experimental analysis on the new validity indices in this chapter instead of displaying them in the chapter of experiments. The main reason is that in chapter 4 and chapter 5 a new evolutionary system modeling structure based on proposed fuzzy functions approaches will be introduced to dynamically optimize the number of clusters based on stochastic search. Hence, the proposed approaches in Chapter 4 and Chapter 5 will be used to optimize the number of clusters with genetic algorithms.
Chapter 4
Fuzzy Functions Approach 4 Fuzzy Functions Approach
This chapter presents mathematical theory of the new “Fuzzy Functions” and “Improved Fuzzy Functions” approaches to build fuzzy system models for regression and classification problems using type-1 fuzzy sets.1 Additionally, Evolutionary Type-1 Fuzzy Functions and “Improved Fuzzy Functions” approach2 are presented. Mathematics seems to endow one with something like a new sense. Charles Robert Darwin (1809-1882)
4.1 Introduction This chapter introduces the theory of novel Type-1 Fuzzy System Modeling based on the “Fuzzy Functions” approach [Turksen, 2008; Celikyilmaz, 2005; Celikyilmaz and Turksen, 2007a]. Type-1 Fuzzy Systems utilize “first order” fuzzy sets. System modeling with higher order fuzzy sets, such as type-2 fuzzy sets will be investigated in the next chapter. Fuzzy systems based on Fuzzy Rule Bases (FRB) have been successfully used to model human problem-solving activities. A classical way to represent the human knowledge is with ‘IF…THEN’ fuzzy rules. The ‘IF’ part represents the antecedents (input fuzzy sets), and the ‘THEN’ part represents the consequents (output fuzzy sets). There are various FRB systems, based on the way the antecedent and consequent parameters are formed. These FRB system modeling strategies has been reviewed in the previous chapter (Chapter 2); hence we will not go into detail about the underlying methods here. One should note that in these systems, membership values of fuzzy sets represent different concepts such as the “degree of belongingness,” the “degree of fire”, the “degree of compatibility”, the “weight or strength of local functions or individual objects”. For instance in Fuzzy c-Regression methods [Hathaway and Bezdek, 1993], membership values are used as weights (strengths) assigned to each local function identified by the system 1
The first part of this chapter is an extension of [Turksen, and Celikyilmaz, 2006], [Celikyilmaz and Turksen, 2007a;b;e;i;k;2008e;f;g]. 2 The last section of this chapter is an extension of [Celikyilmaz and Turksen, 2007g]. A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 105–147. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
106
4 Fuzzy Functions Approach
model. Nonetheless, in this work a new representation of membership values are introduced to formulate the structure of “Fuzzy Functions” approaches. The novel fuzzy functions approach is initially introduced by Turksen in 2005 [Turksen, 2008]. Recently the algorithm was extended and combined with other soft computing approaches for performance improvement and automatic parameter identification [Celikyilmaz, 2005; Celikyilmaz and Turksen, 2007b,d,g,k]. The standard fuzzy functions approaches are multi-variable crisp valued functions which implement type-1 fuzzy sets. This chapter will explore these methods. In the next chapter, “Interval Type-2 Fuzzy Functions” will be presented with the new improvements on “Fuzzy Functions” theory. “Fuzzy Functions” approaches have emerged from the idea of representing each unique rule of an FRB system in terms of the ‘Fuzzy Functions’. One of their prominent features is that; the degree of belongingness of each sample vector in a fuzzy set has a direct affect on how the local fuzzy functions of the particular set are defined. Domain experts, who are not familiar with the theory of fuzzy logic, can build various fuzzy system models with “Fuzzy Functions” approaches instead of FRB systems, since “Fuzzy Functions” strategies offer reduced steps for identification of each rule and reasoning with them. Domain experts do not need to know fuzzy set operations, e.g., fuzzification or defuzzification, and fuzzy set operators such as t-norms and co-norms, modus ponens, etc. With the new “Fuzzy Functions” approach, one can build models for various system structures as with the other fuzzy system modeling tools. The goal of general system modeling depends on the type of the system under study. If the aim is to assign class labels to objects, such as in classification problems, the goal of system modeling is to reduce the number of misclassified cases. On the other hand if the problem involves estimation of a relationship between given independent variables and the dependent variable using functions, then the goal of a system modeling is to find a representation function that can minimize the prediction error. In this chapter, both of these system modeling techniques will be investigated using the “Fuzzy Functions” approaches. The constituents of the general system modeling frameworks [Uncu, 2003] are summarized as follows: • • •
•
Data Pre-processing: This step ensures that raw data extracted from data pool, e.g., data warehouse, is free from noise and missing values. Structure Identification: The training dataset is used to approximate the learning parameters of a mathematical model of the system to predict an output value or a label for a given object. Structure Tuning and Model Selection: In this section, parameters of the mathematical model are tuned by applying it on a validation dataset to measure its performance. This process is usually called the validation mechanism. Training and validation mechanisms are repeated several times for different sets of parameters, which, in turn, tune such parameters. The final model is selected when the desired error rate or computation time is reached. Structure Verification: This process verifies the performance of a final model using a testing dataset which has not been used during previous stages of system modeling.
4.2 Motivation
107
In every system modeling technique, one or more of these steps are implemented; however underlying theories of each step may vary from one strategy to another. Hence, in this chapter we introduce novel structure identification, tuning and reasoning algorithms for the “Type-1 Fuzzy Functions” approach. In this work, data pre-processing steps, e.g., data cleaning, selection of input variables, etc., are assumed to be completed in the first step of the fuzzy system modeling. Structure identification and parameter tuning are called the “training algorithm” and the structure validation is called the “inference algorithm (for testing data)”. Therefore we divided the system modeling with fuzzy functions method into two parts based on the two concepts; ‘training’ and ‘inference’ or ‘testing’. The fuzzy clustering methods form the basis of the fuzzy functions strategies. Fuzzy c-Means (FCM) Clustering algorithm [Bezdek, 1981a] is generally used in fuzzy functions approaches to obtain membership values. In chapter 3, the new Improved Fuzzy Clustering (IFC) [Celikyilmaz and Turksen, 2007b] is introduced and hypothesized that it would yield better performances than FCM clustering algorithm when implemented into the fuzzy functions system. Since FCM clustering and IFC require different initial parameters and processing steps, in this chapter we investigated Fuzzy Functions using FCM clustering and Improved Fuzzy Clustering (IFC) separately. Next, we will present our motivation of the fuzzy functions systems in comparison to earlier FRB systems.
4.2 Motivation In the earlier traditional fuzzy systems models, expert knowledge was encoded to define linguistic properties using fuzzy sets. However, these methods have major drawbacks of being subjective and suffer a lack of generalization. To reduce expert knowledge intervention and rather build self learning systems, more objective fuzzy system models are developed, e.g., [Babuška and Verbruggen, 1997; Emami et al., 1998; Kasabov and Song, 2002; Kilic et al. 2001; Zarandi et al. 2004]. Fuzzy sets in these methods are learned from data under study through optimization algorithms, such as fuzzy clustering. In recent studies more sophisticated optimization algorithms to build hybrid fuzzy systems, e.g., neuro-fuzzy algorithms, [Jang, 1993; Kasabov and Quin, 2002; Pedrycz, 2004], genetic fuzzy systems [Pedrycz and Reformat, 2003; Bodur et al. 2006, Cococcioni et al. 2007] are utilized for performance enhancement. Although the aforementioned approaches improve the efficiency of fuzzy systems, they have various challenges that should not be neglected [Turksen, and Celikyilmaz, 2006]. Let us investigate next the standard fuzzy systems based on fuzzy rule bases. Among some of their challenges are:
¾ identification of the types of antecedent and consequent membership functions, and their parameters,
¾ identification of most suitable combination operators (t-norm, t-conorm, etc.), conjunction operators during aggregation of antecedents, and consequents,
108
4 Fuzzy Functions Approach
¾ identification of implication operator types to capture uncertainty associ¾
ated with the linguistic “AND”, “OR”, “IMP” for the representation of the rules, and reasoning with them, identification of the type of defuzzification method.
Over the course of many years these challenges have been investigated and many different methods are used to optimize the parameters of these system models to reduce expert intervention, etc., [Vergara and Moraga, 1997; Wang and Mendel, 1992, Cococcioni et al. 2007]. Thus in this chapter we introduce a unique approach, entitled “Fuzzy Functions” approaches originally proposed by Turksen [2008], instead of FRB approaches. The aim is to reduce the number of fuzzy operators, e.g., fuzzification, aggregation of antecedents, implication, aggregation of consequents, optimization of antecedent and consequent membership function, etc. The “Fuzzy Function” strategy is designed to be less complicated than FRB systems since they will eliminate most of the aforementioned operations. In a somewhat simplified view, type-1 fuzzy functions approach works as follows: ¾ The domain x∈X ⊆ ℜ , nv dimensional input space, where X is the domain of x, is partitioned into c overlapping clusters, represented with cluster centers, vi, i=1,..,c. ¾ To each of these regions, a local fuzzy model fi: viÆℜ is assigned. Given some input data x∈X, the system then identifies one fuzzy output from each fuzzy model and then weighs these outputs based on the degree of belongingness (membership values) of the given input vector in each cluster. nv
Already, at this point, we ought to outline two requirements that need to be fulfilled in order to obtain a fuzzy system that requires a smaller number of fuzzy operators, and yet yield good prediction performance: (i)
Concerning the partition: The partition must identify the local models accurately so that during reasoning, the local functions could be as close to the true local relationships as possible. As a first step, each variable may be partitioned separately and the partition of the multivariate space can be obtained as the cross-product of the uni-variable partitions. However, in real life datasets, separating variables is usually impossible since variables are very much related and/or interact. Thus, it is more useful to consider them in combination. A simple example is using variables like “location” instead of using {“longitude” and “latitude”}, and similarly “speed” instead of {“distance” and “time to travel”}. Hence, in “Fuzzy Function” systems, we assume that each variable interact with each other and should be analyzed together by representing the antecedent part with the interactive membership values. This is the interactivity assumption of antecedent fuzzy sets, to be discussed in the following section.
(ii)
Concerning the complexity of the local models: Given the input variables are within a limited range in each cluster, the local models identify the behavior of the output variable. In order to improve the prediction performance, the local models should be close enough to the actual local relationships, as well as should have good generalization capability. Additionally,
4.2 Motivation
109
the local functions should not be complex such that they cause over-fitting. One should choose the optimum functions to identify the local models. In many cases, a small set of simple models should be sufficient. In our models a fuzzy clustering algorithm is executed to form the “Fuzzy Functions” system models and to identify hidden local structures. The “Fuzzy Functions” systems are unique because membership values obtained from fuzzy clustering and their transformations are used to explain the local fuzzy relationships. This is a new and novel implementation of membership values in fuzzy system modeling studies. In this respect, clustering algorithms should find improved membership values to enhance the prediction performance of the local “Fuzzy Functions”. Hence, the performance of “Fuzzy Function” system models relies on the fact that, in principle, the input/output behavior can be explained with each local “Fuzzy Function” structure, which poses a third requirement: (iii) Concerning the behavior of membership values: Given fuzzy partitions, Vi, (either univariate or multivariate), the fuzzy membership value calculation equations in “Fuzzy Functions” system models capture the fuzzy membership of belongingness of individual objects in fuzzy clusters together with their original independent variables to identify the behavior of an output variable in each cluster. Therefore, improved membership values should be extracted with the improved fuzzy clustering algorithm in (3.26) and (3.32). This is achieved by using intelligent fuzzy clustering methods discussed in Chapter 3 to improve the predictability of the output behavior of each “Fuzzy Functions” model. A common property of conventional Fuzzy Rule Base (FRB) approaches is the non-interactivity of antecedent fuzzy sets. In the earlier FRB systems, a separate membership function for each explanatory variable in each rule is identified by domain experts, usually using triangular membership functions. However, each input variable affects the output variable in relation to the other input variables. One should analyze the input variables together while trying to estimate a function to explain the behavior between inputs and outputs. One way to approach this issue is by assuming that the input variables interact with one another; so their antecedent fuzzy sets should be analyzed as a whole. Several structure identification methods are proposed that address this concept, e.g., [Delgado et.al., 1997; Babuska and Verbruggen, 1997]. In these approaches, it is assumed that there is interactivity between antecedent fuzzy sets. Hence, using clustering methods, a single membership function is identified for the entire antecedent part of fuzzy rules instead of separate membership functions for each variable. Input membership functions are defined by projecting the membership function obtained from different domains, e.g., input-output, input or output domains, onto input and output domains, separately. In this work, the “Fuzzy Functions” approach also assumes interactivity between each input variable. The new “Fuzzy Functions” system models and traditional fuzzy system models of FRB structures and their extensions [Bodur et al. 2006; Jang, 1993; Kasabov and Song, 2002; Kilic et al. 2001; Kim et al. 1998] share similar system design steps, as shown in Figure 4.1, but they differ in structure identification
110
4 Fuzzy Functions Approach
Reasoning with Fuzzy Functions Fuzzy System Modeling with Fuzzy Functions
System IdentificationTraining Engine
Input variables and Membership value Structure Identification
Parameter Identification
Rule Generation with Fuzzy Functions Inference Engine
Fig. 4.1 Framework of fuzzy system models with “Fuzzy Functions” approach
technologies, namely in finding the fuzzy models (rules) for each pattern identified [Turksen, and Celikyilmaz, 2006] and inference methods. The new “Fuzzy Functions” approach first clusters the given data into several overlapping fuzzy clusters, each of which is used to define a separate decision rule. Initially Fuzzy cmeans clustering (FCM) has been the clustering algorithm utilized in these methods to find fuzzy partitions. The novelty of the “Fuzzy Functions” approaches is that, similarity of objects are enhanced with additional fuzzy identifiers viz. membership values and their transformation, by utilizing them as additional predictors to estimate local relations of input-output data along with original input variables. Thus, membership values and their list of possible (user-defined) transformations are augmented to original dataset as new predictors to identify the structure of the different datasets for each cluster. Using the datasets of each cluster, local functions are identified to explain the local input-output relationships. These functions are named the “Fuzzy Functions” as originally proposed by [Turksen, 2008]. In system modeling approaches, where similarities are based on distances between vector objects, membership values play a crucial role in shaping these “Fuzzy Functions” of each cluster [Celikyilmaz and Turksen, 2007a]. Earlier research [Celikyilmaz. 2005; Turksen and Celikyilmaz, 2006], has pointed out the importance of these membership values on fuzzy models. Results from these studies indicate that Fuzzy Functions strategies can minimize the error between the system output and the model output better than traditional FRB approaches. These systems implemented the Fuzzy C-Means (FCM) clustering algorithm [Bezdek, 1981a] into structure identification. In the experiments section, the results of the investigations of performance improvements are presented. In this regard, one might argue that the FCM clustering algorithm may not be the optimum clustering method. This is because the FCM clustering method is not designed to find optimal membership values, which can, at the same time, explain the behavior of input and output variables in local models. In response, in chapter 3, we introduce a new fuzzy clustering algorithm, namely Improved Fuzzy Clustering (IFC), which achieves two objectives: ¾ ¾
To find a good representation of the partition matrix which captures the multiple model structures of the given system by identifying hidden patterns, To identify the membership values obtained from these “Fuzzy Functions” those improve the representation of the local input-output relationships with each local model.
4.2 Motivation
111
The IFC method ensures that the membership values can help to predict the relationship between independent and dependent variables in local structures. As a result of this improvement, the new IFC identifies better membership values for regression and classification (pattern analysis) models as in (3.30) and (3.38). Recall that the improved fuzzy clustering models can be applied on regression system domains (IFC) or for classification system domains (IFC-C). We hypothesize that membership values obtained from improved clustering algorithms and possibly their transformations can increase the prediction power of “Fuzzy Functions” of each cluster, when used together with original input variables to explain the behavior of the input and output variables in local models. In this sense, the resulting Fuzzy Functions are referred as “Improved Fuzzy Functions” and this approach is denoted as Type-1 Improved Fuzzy Functions, in short T1IFF. Type-1 Improved Fuzzy Functions for classification problem domains is represented by T1IFF-C. The novel T1IFF approach introduces a structure identification (training) and inference (reasoning) method, different from the training and inference methods of Type-1 Fuzzy Functions using FCM clustering models, which will be denoted with T1FF. The membership value calculation equation of the new IFC or IFC-C requires additional prior information, so that the training method of T1IFF is different than that of T1FF. Consequently, the new inference algorithm of T1IFF and T1IFF-C implement a semi-parametric case-based inference method to estimate the membership values of new data vectors different from the T1FF approach. Previous fuzzy inference system models, in other words fuzzy system models based on fuzzy rule bases (FRB), require various initial parameters to be set prior to execution. Recent publications [Ishibuchi et al., 1999; Tettamanzi and Tomassini, 2001; Pedrycz and Reformat, 2003; Roland, 2003] have shown that evolutionary algorithms to optimize the FRB parameters such as the shape and number of fuzzy rules and the type of the rule base operators mentioned above have major impact on the performance of fuzzy systems based on fuzzy rule bases. Similarly, an important characteristic of “Fuzzy Functions” system models is that they require a few key parameters that may affect the performance of a system model. A procedure that learns the Fuzzy Functions system model automatically from data must respect the properties of such parameters. In this work an evolutionary algorithm method is applied to encode these parameters as hybrid genetic code. Then “Fuzzy Function” models are evolved to capture optimum system parameters based on a stochastic search3 method. Hence, a new Evolutionary Type-1 fuzzy structure identification and inference method are proposed by combining evolutionary algorithms with improved fuzzy functions. The stochastic optimization model within the “Fuzzy Functions” approach would reduce the number of 3
The evolutionary algorithms are stochastic search methods because of the fact that certain steps are based on random choice. Initial population is randomly assigned. The individuals to go through cross-over or mutation operations are randomly selected. Some cross-over or mutation operations are based on probabilistic approach. Due to the fact that the probability of the evolutionary process depends on the stochastic process but not solely on the data, there will be different results of the same evolutionary method in two consecutive runs. A brief definition of genetic algorithms, a type of evolutionary computation, is presented in Appendix C.3.
112
4 Fuzzy Functions Approach
iterations to find the optimum model compared to “Type-1 Improved Fuzzy Functions” based on an exhaustive search method to optimize these parameters. The aforementioned Type-1 Fuzzy Functions and its variations are demonstrated in one graph as shown in Figure 4.2.
Evolutionary Type 1 Fuzzy Functions
Evolutionary Type 1 Improved Fuzzy Functions
Type 1 Fuzzy Functions
Type 1 Improved Fuzzy Functions
FCM
Implementation of Improved Membership Values
IFC
Fig. 4.2 Evolution of Type-1 Fuzzy Functions Strategies
Firstly, we introduce Type-1 Fuzzy Function system models based on the FCM clustering algorithm – T1FF (middle part of inverted triangle on the left side of Figure 4.2). Later, improved Fuzzy Functions System modeling using the novel Improved Fuzzy Clustering (IFC) method – T1IFF– is introduced (middle part of the inverted triangle on the right side of Figure 4.2). To optimize the systems parameters of both methods, hybrid fuzzy function methods are applied, entitled Evolutionary Type-1 Fuzzy Functions –ET1FF– and Evolutionary Type-1 Improved Fuzzy Functions –ET1IFF–, using an evolutionary algorithm based on a stochastic search method to solve fuzzy data mining problems. In this chapter these four methods will be briefly discussed. In the following, we start with the definition of the design principles of the proposed Fuzzy Functions algorithm using FCM clustering for regression –T1FF– and classification models –T1FF-C–, separately. Then, the architecture of the training and inference steps of the new T1IFF using IFC algorithm for regression and T1IFF-C using IFC-C for classification problem domains will be discussed. In the last part, a new approach to optimize the parameters of T1FF and T1IFF approaches will be presented using Evolutionary Algorithms. These methods will be denoted with ET1FF and ET1IFF for the regression domains and ET1FF-C and ET1IFF-C for the classification models.
4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF 4.3.1 Structure Identification of FF for Regression Models (T1FF ) Fuzzy function systems have two separate components: training and inference algorithms. During the training algorithm, the system model is identified using a training
4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF
113
dataset, randomly selected from the entire dataset, and the parameters are optimized using another randomly selected sample dataset called validation dataset. During the inference algorithm, the testing dataset, which is also randomly selected from the entire dataset, is used to measure the model performance. A datum can only be present in one of three partitions. This process is sketched in Figure 4.3. As observed from several articles listed in the references, a common way to optimize the parameters of fuzzy systems is to separate the datasets into several randomly selected training and testing subsets. The training dataset is used to learn the system structure and optimize the parameters. Similarly, the testing dataset is used to measure the modeling performance. This process is usually repeated several times with different random subsets to form training and testing datasets. The overall performance is calculated by averaging the performances of the models applied on the testing samples in these repeated experiments. In this work, a subsampling cross validation method, which is slightly different from the standard cross validation methods, is implemented to optimize system model parameters using a separate dataset from the training set, as shown in Figure 4.3. We use an additional subset called validation dataset, which is different from the training and testing datasets. The validation dataset is used to optimize the model parameters obtained from the training dataset. Also, the testing dataset is neither used for training nor for validation purposes: it is only used to measure the performance of the optimum model. This way, we can ensure an unbiased capacity (performance) for our models. After the optimum fuzzy function model parameters are obtained using training and validation datasets, to be discussed next, overall performance of models is measured on a separate testing dataset. In order to measure the robustness of the models, we repeat the above learning and optimization processes several times, using different subsets randomly selected from the entire dataset. c Training Dataset
m,c
Structure Idendtification with FCM
.
.
1 Fuzzy Function per cluster
No Validation Dataset
Inference
max performance optimum model parameters YES
Testing data semi-non-parametric inference
Fig. 4.3 Modeling with Fuzzy Functions Systems
Scoring the model with optimal parameters
114
4 Fuzzy Functions Approach
Multi-input, single-output (MISO) problems are the main interest of this work. We begin with structure identification of fuzzy function strategies for regression type problems. Later the same strategy will be extended for classification type problems. Let Z(x,y)={(x1,y1), (x2,y2),…, (xn,yn)} represent the input-output dataset, where nv+1 z(xk,yk)⊂ℜ denotes any data point (vector) from the training set, and every data nv point is composed of (nv+1) dimensions of inputs vectors, xk=(x1,k,…,xnv,k)∈ℜ , k=1,…,n, a total of n vectors, and an output, yk∈ℜ. Z is the (n×(nv+1)) input-output matrix, n is the total number of data vectors, i, i=1…c, is the cluster identifier, c represents the total number of clusters identified, and m is the degree of fuzziness (i.e., overlapping degree) parameters of the Fuzzy C-Means (FCM) clustering th method. Let μik ∈[0,1] represent the membership value of the k datum in cluster i. Therefore the list of parameters of the training algorithm is given by the; • • • •
number of clusters of the system model, c, which are discrete values such as c∈{2,3,… n(1/℘)}, 0<℘
The programming steps for T1FF systems for regression models of a MISO domain is displayed in ALGORITHM 4.1. ALGORITHM 4.1 Training Algorithm for the Type-1 Fuzzy Functions Approach ( T1FF ) using Standard FCM clustering algorithm Step 1: Choose FCM clustering parameters, m≥1.1 (degree of fuzziness) and c>1(the number of clusters) and ε (a termination threshold). Step 2: Execute FCM using Z(x, y) to find the cluster centers, υi(xy), and interactive (I/O) membership values for m and c by;
⎛ c ⎛ d ( xy ) ⎞ ∀ μik ( xy ) = ⎜ ∑ ⎜ ik ⎟ 1≤ i ≤ c d ( xy ) ⎜ j =1 jk ⎠ 1≤ k ≤ n ⎝ ⎝
2 /( m −1)
⎞ ⎟ ⎟ ⎠
−1
, where d ikxy = ( xk , yk ) − vi ( x, y ) .
Step 3: Find membership values of the input space using −1
2 /( m −1) ⎛ c ⎛ d ( x) ⎞ ⎞ ⎟ , where dik ( x) = xk − vi ( x) . ∀ μik ( x) = ⎜ ∑ ⎜ ik ⎟ 1≤ i ≤ c d jk ( x) ⎠ ⎜ j =1 ⎝ ⎟ 1≤ k ≤ n ⎝ ⎠
Step 4: For each cluster i, 4.1. Membership values of each input data sample, μik and their user defined transformations in ℜ nm space is augmented to the original input space to map the original input matrix onto a new feature space ℜnv+nm , i.e., xÆΦi(x,μi), for each cluster i. 4.2. Estimate the parameters, Ŵi={Ŵ0 ,…,Ŵnv+nm}, of each fuzzy function fi(Φi,Ŵi) of each cluster.
4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF
115
The novel training algorithm in ALGORITHM 4.1 applies standard FCM [Bezdek, 1981a] clustering method on the input-output data, z(xk,yk), k=1…n, to generate membership values μ(x,y) and cluster centers, vi(x,y), i=1…c. In step 3, we obtain membership values corresponding to input space, μi(x), and cluster centers of the given input space, denoted by vi(x). It should be noted that membership values are not projections of the clusters onto each input space of the input variables, but rather induced on the whole input space by fuzzy clusters [Babuska and Verbruggen, 1997]. Due to interactivity assumption between input variables, we do not map the membership values onto every input variable. Therefore there is one membership value assigned to each input vector to represent the whole input space. In step 4, a different dataset is formed for each cluster i by using membership values, μi(x) and/or their transformations as additional dimensions. This is the n nv same as mapping the original input space, x∈ℜ × , of nv input variables, x={x1 n (nv+nm) …xnv}, onto a higher dimensional feature space ℜ × , i.e., xÆΦi(x,μi), for nv+nm each cluster i, i=1,..,c. Each data vector is represented by Φik∈ℜ , which is a matrix composed of the original input variables of nv dimensions and membership values and their (nm) possible different transformations. The feature space dimension (nv+nm) is determined by the user and the optimum dimensions are sought based on exhaustive search method. And, nm is the number of augmented membership values appended to the original input space as new dimensions. Optimum regression function parameters are sought in this new space. In the following sections, with the implementation of the evolutionary algorithms, the feature space dimension will be determined automatically based on the system model performance. These approaches will be explained in the following sections of this chapter. A special case of this approach is shown in Figure 4.4 using a Toy dataset of n data points, of a single input and single output, where only membership values are nv nv+1 used as an additional dimension, i.e., ℜ Æℜ , nm=1, to form a new dataset in each cluster as follows: ⎡ μi ,1
μik > α - cut , i=1.. Φ i ( x , μi ) ∈ℜnv +1 = ⎢⎢ M
⎢ μi , p ⎣
x1×1 M x1× p
xnv×1 ⎤ ⎥ M ⎥ L xnv× p ⎥⎦ K M
μik > α - cut 0< p≤n
(4.1)
c, k=1...p
For this example, a nonlinear regression function (i.e., support vector regression) is fit for each cluster in (nv+1+1) space, i.e., one additional dimension for the intercept. A prominent feature of the fuzzy functions approaches is that, if the relations between input variables and the output can not be defined in the original nv+nm space using additional space, we can explain their relationship in the ℜ information induced by the membership values and their transformations. Note from (4.1) that, for each cluster a different subset is obtained by the following constraint:
116
4 Fuzzy Functions Approach
μik >α-cut, i=1..c, k=1..p
(4.2)
Using an alpha-cut value, α-cut>0, we eliminate vectors from each cluster that do not affect the decision surfaces, viz, far away from the cluster centers. The α-cut in general eliminates anomalities generated by the clustering algorithms. In the experiment we usually use α-cut=0, when the number of data points in one cluster is below n/(#clusters). The training algorithm would focus on vectors that can support the decision surfaces, in other words, the data points that can explain the local input-output relationships with less error will be retained to find the fuzzy functions. Hence, for each local model i, functions are shaped using different datasets, Φi(x,μ i), the dimensions of which are determined by the user. Again, the optimum values of the parameters that define the dimensions of each dataset of each cluster, viz., nm and α-cut, can be automatically determined by the proposed evolutionary fuzzy functions, to be discussed in the following sections of this chapter. 4
y
2
0
-2 -2
0 Input (x)
Output (y)
0 -5 1
2
0.5 membership
0 -2
1
2
Cluster2
5
Cluster1
5 Output (y)
-1
0 Input (x)
0 -5 1
2
0.5 membership
0 -2
0 Input (x)
Fig. 4.4 Mapping the Input-Output space onto individual clusters using a one dimensional Toy Example. Support Vector Regression is used to find the decision surfaces for two clusters in (1+1) space, i.e., only membership values are used as additional dimensions.
In step 4.2 of ALGORITHM 4.1, one regression function f(Φi,Ŵi) is identified for each cluster, i, i=1…c. These functions are called the “Fuzzy Functions” of type-1[Turksen, 2008, Celikyilmaz, 2005, Celikyilmaz and Turksen, 2007a]. One can map the original input matrix onto a higher dimension by using various forms of membership functions as additional dimensions, depending on the non-linearity
4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF
117
of the dataset. Where appropriate, we usually use mathematical transformations of 2 m membership values such as μik , μik , exp(μik), ln((1-μik)/μik), etc., where m represents the degree of fuzziness. Our extensive research indicates that the exponential and various logarithmic transformations of membership values can improve the performance of the system models better than several others. However, one should apply as many forms of fuzzy membership values as possible to improve the performance. In addition, one could use any regression function type. In this work, the simple least squares estimation (LSE) and the more sophisticated support vector regression (SVR) [Smola, 1996; Gunn, 1998] methods are implemented using different kernel functions to estimate fuzzy function parameters for each cluster, f(Φi,Ŵi), where Ŵi={Ŵi,0,Ŵi,1…Ŵi,(nv+nm)} are the parameters of these functions for each cluster i. When LSE is used to find the parameters of the fuzzy functions, Ŵi represents the linear regression parameters as follows:
) { (
(
)
(
)
f Φi ,Wˆi = Φi x , μi | Wˆ0,i + Wˆ1,i ( μi ) + " + Wˆnm,i μip >0 + Wˆnm+1,i x1 + " + Wˆnm+nv,i xnv
}
(4.3)
represents input space, which composes of a column vector Each Φi(x,μi)∈ℜ of 1’s for the intercept, membership values of the ith cluster and their possible transformations as separate nm number of columns, and input matrix x (n×nv) of the nv variables with n data vectors. On the other hand, if support vector regression (SVR) is used to approximate nv+nm
regression functions of the local models instead of the LSE approach, Ŵi represents a more complex structure including the Lagrange multipliers of the support vectors obtained from each fuzzy partition and selected support training vectors for each cluster. Hence, the SVR optimization model for each fuzzy function of each cluster is represented by:
(
)(
)
max L = − 12 ∑ k ,l =1 α ik* − α ik α il* − α il K Φ ( xk , μ ik ) ,Φ ( xl , μ il ) n
i
(
)
(
)
− ε ∑ k =1 α ik* + α ik + ∑ k =1 α ik* − α ik yk n
(
n
)
⎧ ∑ n α * − α = 0, ik ik ⎪ subject to ⎨ k =1 * ⎪⎩ α ik ,α ik ∈ [0,Creg ]
(4.4)
k ,l = 1,...,n ,Creg > 0,i = 1,...,c
In (4.4), the SVR [Smola, 1996; Gunn, 1998] optimization algorithm is applied separately to each cluster i to find the support vectors and their corresponding La* grange multipliers, (αik,αik ), k=1,…,n. Some data vectors are eliminated from the training datasets of each cluster based on the criteria in (4.2) viz., vectors with μik>α-cut values, before the SVR model execution. In that case the total number of vectors of each cluster would be less than the total number of training vectors. When SVR is used to estimate function parameters, each fuzzy function of the system model is represented by:
118
4 Fuzzy Functions Approach n
(
)
fi ( Φi ) = ∑ α ik − α ik* K Φ ( xk , μik ) ,Φ ( x , μi ) + bi k =1
(4.5)
The number of initial parameters that one should determine prior to execution of the T 1FF method changes based on the function approximation type used. If one uses LSE to approximate the fuzzy function parameters, then there are only three initial parameters; m (degree of fuzziness), and c (the number of clusters) from the FCM clustering and the structure of fuzzy functions. Whereas if SVR is used to model the fuzzy functions, on top of those three initial parameters, four additional initial parameters, i.e., C-regularization constant (C-reg), ε-margin for epsilon insensitive regression, kernel type, and parameters for non-linear SVR, are also required to be determined by the expert. Therefore, optimizing the T1FF approach with SVR methods would take a longer time than T1FF methods when LSE is used. One could use any other regression method such as kernel regression, or weighted least squares regression to estimate fuzzy function parameters. This work only makes use of two methods to demonstrate the implementation, a simple and a more complicated learning method into the novel T1FF methodologies. Depending on the structure of the dataset in hand, one should choose an appropriate function estimation method. Recall that the structure identification of T1FF system models requires fewer operators when compared to FRB systems. One reason being is that the list of operations and operators starting from fuzzification, aggregation of input fuzzy sets, implication, aggregation of output fuzzy sets, and finally defuzzification steps are replaced with easily interpretable steps, summarized as follows: ¾ ¾ ¾
clustering the given dataset and finding the type-1 membership functions for each cluster, forming sub-datasets for each cluster by using data points that have membership values μik>α-cut and additional predictors using membership values and their transformations as input vectors, estimating one function, linear or non-linear.
The challenge with T1FF systems, as in FRB systems, is that one should apply a search method such as an exhaustive search to optimize each parameter. Even though an optimum model would be certain when an exhaustive search is applied with higher precision, the searching process takes time, which may not be a desirable solution for many system domains. In this work, we applied an evolutionary algorithm to optimize the parameters of T1FF systems. The parameters of T1FF system that are optimized by the evolutionary method –ET1FF– are the degree of fuzziness and number of clusters to shape membership values, the α-cut, the structure of fuzzy functions and their parameters to shape local system model structures. The aim for applying evolutionary algorithms for optimization of the parameters of fuzzy functions strategies is to reduce the number of iterations it takes to optimize them.
4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF
119
4.3.2 Structure Identification of the Fuzzy Functions for Classification Models (T1FF-C ) The structure identification of T1FF systems for classification problems, T1FF-C, is similar to the structure of T1FF systems for the regression problems. The only difference between them is that the output variables in classification problems have a discrete domain, y={0,1,2…}, whereas in regression/prediction domains the output variable has a continuous domain, y∈ℜ. So the fuzzy function types that are used to approximate decision surfaces would be different in T1FF-C than T1FF systems. In this work we only deal with binary classification problems. Hence, output variables only take on dichotomous values e.g., y={-1,+1}. During the reasoning (inference) method, we obtain posterior probabilities of data vectors as a result of classification models and make decisions with these model posterior probabilities based on an optimum threshold value. The optimum threshold is captured whilst sub-sampling cross validation and is used to measure recognition performance of the testing data vectors. Among many different classification functions, we use a linear classifier and a sophisticated non-linear classifier in order to approximate fuzzy classifier functions and obtain posterior probabilities of each input vector in each local structure. Hence, in this section, we will shortly review these well-known classifiers and calculations of posterior probabilities of the dataset instances. Structure Identification of T1FF-C Using Posterior Probabilities from Logistic Regression (LR) Given a training dataset of n samples, Z(x,y)={(x1,y1),…(xn,yn)}, where the binary output is represented with, yk∈{0,1}. Firstly, the FCM clustering algorithm is applied on the training dataset to obtain membership values and cluster centers, as in Step 1-3 of ALGORITHM 4.1. Then, the training data is mapped onto a feature space as such; xÆΦi(x,μi), for each cluster i, i=1,..,c. Each new input matrix Φi(x,μi) of a cluster i such as in (4.1) comprises of the original input variables and membership values of the corresponding cluster i and their transformations. We also eliminate those data vectors from each cluster based on the following criteria, μik>α-cut≥0. At this point, instead of building a linear regression model for each dataset of each cluster, Φi(x,μi), a logistic regression (LR) function is applied. LR tries to estimate a posterior probability, Pˆi (yk=1|Φik), for each vector k, xk, indicating the probability of a datum k belonging to the class of positives, yk=1. The posterior probabilities of any xk in each cluster i is measured by: T Pˆ (y =|Φ (x ,μ )) = 1/ (1 + exp(-(Ŵ ,+Ŵ Φ ))) (4.6) i
k
i
k
ik
0,i
i
i
Ŵi={Ŵi,0…Ŵi,(nv+nm)} is the list of coefficients of the logistic regression function of cluster i, where Ŵ0,i refers to the intercept, and Ŵj,i, j=1…nv+nm, are the coefficients of each input variable. These are composed of the original input variables and the membership values of cluster i and their transformation, Pˆi (yk=0|Φik)=1Pˆ (y =1|Φ ). Parameters Ŵ and Ŵ are estimated using the maximum likelihood i
k
formula.
ik
j,i
0,i
120
4 Fuzzy Functions Approach
Structure Identification of the T1FF-C Using Posterior Probabilities from the Support Vector Classification (SVC)
If support vector classification (SVC) is used to estimate fuzzy classifier function parameters, the output values of data vectors are calculated by fi(Φi)=∑k βik yk K(Φik,Φi)+bi , i=1...c, k=1,…n
(4.7)
In (4.7), K(.) is the kernel function that is used to map original data vector onto a higher dimensional space, either linearly or non-linearly. In this work, we anaT lyzed two different kernel functions: linear, K(xk,xl)=xk xl, and probabilistic kernel,
{
i.e., the radial basis function (RBF), K ( xk , x j ) = exp −δ xk − x j
2 2
} (δ>0). And, β
ik
in (4.7) represents Lagrange multipliers, one for each training data vector, which are introduced to solve the following SVC optimization algorithm for each local model i (cluster):
Max Qi = ∑ k =1 βik − 12 ∑ k ,l =1 βik βil yk yl K (Φ i ( xk , μik ) ,Φ i ( xl , μil ) ) n
s.t. ∑ k =1 βik yk = 0 , n
n
(4.8)
0 ≤ βik ≤ Creg , k ,l = 1,...,n,i = 1...c
where Creg is the regularization constant, that balances the complexity of the machine, and the number of the separable data vectors. Training vectors with non-zero coefficients, βik>0, (Lagrange multipliers) are called “support vectors” obtained for each cluster i. The fewer the support vectors, the better is the generalization capacity of such models. A brief summary of support vector machines for classification (SVC) methods are given in Appendix C.2. During the learning phase of the proposed T1FF-C method, one fuzzy classifier function is identified for each cluster obtained from FCM clustering algorithm. The output values obtained from each SVC fuzzy function is a scalar value that one needs to transform into posterior probabilities. We measure the posterior probabilities using the improved Platt’s probability method [Platt 2000; Lin et al., 2003], which is based on the approximation of model output labels with a sigmoid function as follows: -1 Pˆik (yk =1| fi(Φi)) = (1+exp(a1 f(Φi) + a2)) ,
(4.9)
where a1 and a2 are found by minimizing the log-likelihood of training data:
min a1 ,a2
y +1⎞ n ⎛ y +1 ⎛ ˆ − ∑ k =1 ⎜ k log( Pˆik ) + ⎜1 − k ⎟ log( 1 − Pik 2 2 ⎝ ⎠ ⎝
⎞ )⎟ ⎠
(4.10)
The output values obtained from SVC fuzzy functions are not represented as class labels, e.g., yk={-1 or 1} , but rather represented with estimated posterior probabilities. The optimum threshold value of a specific T1FF-C model will be
4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF
121
determined based on the estimated posterior probabilities obtained from the validation dataset during structure identification. The optimum threshold values obtained during the structure identification are used to score testing datasets during performance evaluation of each T1FF-C model. The algorithm to obtain the optimum threshold value will be introduced in the next section in Algorithm 4.3.
4.3.3 Inference Mechanism of T1FF for Regression Models In supervised learning strategies a common method for inference is to implement a case-based (instance-based) reasoning methodology. In case-based reasoning, knowledge is represented in terms of specific cases or experiences. The reasoning relies on flexible matching methods to retrieve these cases and applies them to new situations. In a very simple case-based reasoning, some descriptive nearest neighbors- according to some distance measure- are captured during the training algorithm to be later used for reasoning. A recent addition to cased-based reasoning algorithms is Support Vector Machines (SVM)4 [Vapnik, 1995]. In SVM, only cases that define interclass boundaries in a newly transformed feature space are retained after learning; these become the support vectors upon which inference occurs. The proposed inference method uses selected training cases to infer the output values for the new cases. In standard exhaustive search methods, parameters of an optimum model are sought by iteratively executing the same methodology for the different values of its parameters. In this work, the structure identification to learn the system models is applied on the training dataset. The inference method is used to calculate the performance of those models captured during structure identification by using the provided datasets, i.e., validation or testing datasets. The validation dataset is used to optimize the model parameters during sub-sampling cross validation that it is a part of the structure identification method. The optimum model is obtained based on exhaustive search. The selection of the optimum model is based on the best performance obtained from inference (reasoning) methodology. Then, using the parameters of the optimum model, we test the generalization capacity of the model with a testing dataset. v v v v th Let the validation dataset be represented as X ={x1 ,x2 …,xndv }, and every k data v v v nv vector is composed of (nv) dimensional input vectors, xk =(x1,k ,…,xnv,k )∈ℜ , and an v v output which we do not know beforehand, yk ∈ℜ where X is the (ndv×nv) input matrix, ndv is the total number of validation data vectors, i, i=1…c, is the cluster identifier, c is the total number of clusters identified, and m is the degree of fuzziness parameters of Fuzzy C-Means (FCM) clustering method. On the other hand, test test test test the testing dataset is represented with X ={x1 ,x2 ,…,xnte }, and every testing data test test test nv vector is composed of (nv) dimensions of vectors, xk =(x1,k ,…,xnv,k )∈ℜ , and an test test output variable, yk ∈ℜ. X is the (nte×nv) input matrix, nte is the total number of testing data vectors. In the Fuzzy Functions reasoning mechanism one wants to obtain a crisp output value from the approximated model and the estimated output values along with the actual output values of corresponding vectors are used to
4
A background of Support Vector Machines for regression and classification models are presented in Appendix C.2.
122
4 Fuzzy Functions Approach
calculate the performance of the corresponding model, as shown in ALGORITHM 4.2. Here, we used the validation dataset to explain the new inference mechanism. ALGORITHM 4.2 Inference Algorithm of the Type-1 Fuzzy Functions Approach ( T1FF ) using Standard FCM clustering algorithm v
Step 1: Find input membership values of each validation sample, xk ,k=1,..,ndv using
⎛⎛ c v d ∀ μikv = ⎜ ⎜ ∑ ikv 1≤i ≤c ⎜ ⎜ j =1 d jk 1≤ k ≤ ndv ⎝⎝
⎞ ⎟ ⎟ ⎠
2/( m −1) −1
⎞ ⎟ , i=1,…,c, where dikv = xkv − υi ( x ) . ⎟ ⎠
Step 2: Membership values of validation data, μik and their user defined transformations optimized during training algorithm, are used to map the original validation data onto v v nv+nm v space. each cluster x ÆΦi(x ,μi ), in ℜ v
Step 3: Infer output value of new data vectors using Fuzzy Function parameters, ŷik=f(Φik,Ŵi) using equation (4.3) or (4.5) depending on the type of the model.
Step 4: Find single output value for validation data samples by weighting inferred fuzzy output values from each cluster with their corresponding membership values:
∑ ˆyik μik ,i = 1...c,k = 1,...,ndv ˆyk = i c ∑ i μik c
The first step of the inference algorithm of the T1FF system, as shown in ALGORITHM 4.2, is fuzzification. Here, membership values of each of the validation data v c nv samples, x , is calculated using cluster centers, v(x)={v1(x),…, vc(x)}∈ℜ × obtained from the standard FCM clustering algorithm for a given m and c values. This is interpreted as finding the membership values of testing cases in each cluster during the learning stage. These membership values are used as additional dimensions of original validation input matrix. Hence, original validation matrix is mapped onto a user v v nv+nm defined feature space as determined in the training algorithm i.e., x ÆΦi(x ,μi)∈ℜ , to form the datasets of each cluster. The fuzzy functions of each cluster i, f(Φi,Ŵi) obtained from the training algorithm, are used to infer output values of the validation samples. Hence, for each cluster, one fuzzy model output, ŷi = f(Φi) is obtained for each data point of each dataset of each cluster in the feature space. In order to obtain a crisp output, each fuzzy output is weighted with the membership values by using the fuzzy weighted average formula as follows:
∑ i ˆyik μikv c ∑ i μikv c
ˆyk =
(4.11)
4.3.4 Inference Mechanism of T1FF for Classification Models The proposed inference mechanism of the T1FF-C approach for classification problems is similar to the inference mechanism of the T1FF systems for the regression problems. The only difference is in Step 4 of ALGORITHM 4.2, where decision
4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF
123
functions are used to obtain estimated output values. Here, posterior probabilities of ˆ (y =1|f (Φi)), are obtained from the fuzzy each data sample k, xk, in each cluster i, P i ik k functions using (4.6) or {(4.7) and (4.9)} depending on the model type used in the training algorithm. An estimated crisp posterior probability is obtained for each verification vector by the fuzzy weighted average method as follows:
ˆ = ∑i P k
c
ˆ μ P ik ik
∑ i μik c
(4.12)
The objective of building classification models is to predict a class label, ŷk=(-1) or ŷk=(+1), for each data object which represent either one of the two dichotomous outputs. Based on the estimated class labels, we can measure the performance of a model in terms of accuracy. However, the output values we obtain from the T1FFC inference method are posterior probabilities, as in (4.12). Some performance measures such as accuracy (to be discussed in the chapter on experiments) require estimated class labels instead of probabilities. In order to obtain a crisp output label such as ŷk=(-1) or ŷk=(+1), one needs to apply a discretization method. In order to apply discretization, one needs to determine a threshold value, so that data vectors with estimated probabilities above some threshold would belong to the class of ŷk∈(+1), or otherwise would belong to the class of ŷk∈(-1). The threshold is especially required to measure the performance of the testing dataset using area under the receiver operating curve analysis in chapter of experiments, Chapter 6. After estimated posterior probabilities are obtained from structure identification of the T1FF-C method, model class labels are determined based on the optimum threshold value (OTV). OTV is captured using a threshold determination method during the structure identification stage as shown in ALGORITHM 4.3. The threshold value obtained during the training method is used to measure the accuracy of the testing method. The recognition performance (RP), i.e., the accuracy, measure in ALGORITHM 4.3. is a probabilistic measure that captures the correctly classified samples as a ratio of the total samples in the dataset. The threshold at the point where the RP is optimum is selected as the optimum threshold value (OTV). Optimum threshold value is an additional parameter optimized during the structure identification of the proposed the T1FF-C method and it is retained after the structure identification to discretisize the testing posterior probabilities to measure the model’s performance where necessary. In the conclusion of this section, we hypothesize that the optimum membership values estimated from the standard FCM clustering [Bezdek, 1981a] method and their possible transformations will be one of the optimum input variables of the T1FF or T1FF-C strategies. Our research on empirical datasets [Turksen, 2008; Celikyilmaz, 2005; Turksen and Celikyilmaz, 2006; Celikyilmaz and Turksen, 2007a] indicated that the T1FF model performances using these membership values can be improved if one can find better membership functions. Therefore, as a first step, a new approach is proposed to enhance the accuracy and generalization capability of the T1FF models using the proposed Improved Fuzzy Clustering (IFC) approach. Later, we will present an hybrid fuzzy functions method, that can further reduce some of the challenges of the T1FF methods, using other optimization methods that can search for global optimum solutions.
124
4 Fuzzy Functions Approach
ALGORITHM 4.3 Threshold determination for classification problem domain
Step 1: Create a 2 column dataset of actual class labels of the training data samples, yk∈{0,1}, and predicted posterior probabilities from the T1FF-C model, Pˆk (yk=1)∈[0,1], using the training dataset. An example is shown below: actual − yk
predicted probability − Pˆk
1 0
0.95 0.28
0 1 0 1
0.36 0.58 0.57 0.87
Step 2: Sort the above dataset in descending order according to the second column, predicted probabilities, Pˆk . Assign the probability of the first data vector to optimum-threshold value (OTV) and ŷk=1t=1=1 and ŷk>1t=1=0 Measure the recognition performance (RPt=1), e.g., total number of correctly classified points, using thresholdt=1 Step 3: Start iterating t=2…n-1; - Assign threshold t = Pˆk = t . - For k=1…n; If Pˆk > threshold t, then assign labels ŷkt =1, otherwise ŷkt =0. - Measure the recognition performance (RPt) using thresholdt - If RPt > RPt-1, OTV = threshold t, otherwise continue. predicted probability − Pk
ˆykt =1
ˆykt =2
ˆykt =3
1
0.95
1
1
1
1
0.87
0
1
1
1 0
0.58 0.57
0 0
0 0
1 0
0
0.36
0
0
0
0
0.28
0
0
actual − yk
accuracy ⇒
67% 84%
0 100%
In Chapter 3, the novel Improved Fuzzy Clustering (IFC) algorithm is presented and its advantages on Fuzzy Functions approaches are investigated using artificial datasets. The following section introduces the new training and inference algorithms of the Fuzzy Function systems by implementing the IFC algorithm, namely the Type-1 Improved Fuzzy Functions (T1IFF) system. The hypotheses is that membership values obtained from the IFC, can identify the parts of the behavior of the input-output variables in local models better than when membership values are obtained from the standard FCM clustering algorithm. The extension of the T1IFF approach to classification problems denoted with T1IFF-C, is also presented in the next section.
4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF
125
4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF This section presents new fuzzy system architecture with improved fuzzy functions - T1IFF - for regression and classification type problems (T1IFF-C). Novel T1IFF applies the proposed IFC algorithm during the learning stage to approximate improved membership values. These membership values and their transformations are then used as additional predictors of fuzzy regression functions of the system model along with candidate input variables. The reasoning mechanism is based on a new semi-parametric case-based reasoning approach. The structure of T1IFF and previous T1FF systems are similar to each other; however, the new training and inference algorithms are changed due to the new IFC method. The general framework of the proposed T1IFF is depicted in Figure 4.5 and Figure 4.6. The next sub-sections present learning and reasoning mechanisms of T1IFF and T1IFF-C strategies.
Fig. 4.5 Type-1 Improved Fuzzy Functions Framework based on sub-sampling cross validation method
4.4.1 Structure Identification of T1IFF for Regression Models Proposed Improved Fuzzy Clustering (IFC) optimization method searches for the optimum membership values, which are to be used later as additional predictors to estimate the parameters of the local fuzzy functions of a given system model. For regression problems, these fuzzy functions can be built using any type of function estimation method, e.g., LSE, SVR, etc. The structures of functions to be approximated depend on the structure of the membership values. One should choose appropriate transformations of membership values to identify improved fuzzy functions. The first step of T1IFF is to capture hidden patterns in the given dataset and to obtain improved membership values using the IFC method. For the readers’ convenience, we re-list the output values obtained from the IFC algorithm for given m and c values as follows:
126
4 Fuzzy Functions Approach
¾ ŵi : optimum parameters of interim fuzzy functions hi(τi|ŵi) of each cluster, i=1…c, that are captured from the last iteration step of IFC, ¾ the structure of the input matrix, τi, viz. list of different types of optimum improved membership value transformations that are used to approximate each Interim Fuzzy Function, hi(τi|ŵi), during IFC,
⎡ imp ⎢ μ1,i ⎢ ⎢ μ imp τ i = ⎢ i ,2 ⎢ # ⎢ ⎢ μ imp ⎣ i ,n
(μ ) (μ )
imp p ≠ 0 i ,1 imp i ,2
p≠0
(μ ) ⎤ " e ⎥ ⎥ μ ) ( ⎥ " e imp i ,1
imp i ,2
imp p ≠ 0 i ,n
p ≠0
⎥ ⎥ ⎥ p ≠0 μiimp ( ,n ) ⎥ " e ⎦ "
(μ )
p ≠0
(4.13)
¾ optimized membership values matrix, Uimp(x,y), and the cluster centers imp imp v (x,y), ( ) indicates optimum results from the new Improved Fuzzy Clustering algorithm. It should be recalled that, the interim input matrix τi is composed of improved membership values and their potential transformations as shown in (4.13). τi is obtained from the clustering algorithm IFC, which contains both the classical FCM clustering objective function term and the added “Interim Fuzzy Function”, hi(τi,ŵi), error estimation. It should be noted that Interim Fuzzy Functions contain only the membership values and their transformations in structures as shown in (4.13), i.e., the original input variables are not a part of the “Interim Fuzzy Functions”, hi(τi,ŵi), where ŵi is the coefficients of the function. τi matrix considers all possible transformations of membership values obtained from the IFC that may be considered in identifying the optimum Fuzzy Function parameters, which we usually refer to “System Fuzzy Functions”, or “Primary Fuzzy Functions” or simply “Fuzzy Functions” to differentiate them from “Interim Fuzzy Functions”. The system “Fuzzy Functions” are identified by using the improved membership values obtained from IFC and their transformations, τi, as additional inputs along with the original input variables, xk. We denote the new matrix Φi(x,τi) and the local system “Fuzzy Functions” using this matrix, fi(Φi,Ŵi), whose parameters are denoted with capital Ŵi. We also retain the cluster centers vectors of the optimum model to be later used for inference. Proposed Learning Schema of Type-1 Improved Fuzzy Functions Approach
Structure identification steps of the proposed approach as shown in Figure 4.6 is briefly explained in ALGORITHM 4.4.
4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF
127
Fig. 4.6 Structure of the T1IFF approach. (Top) Structure Identification with Improved Fuzzy Functions, (Bottom) Improved fuzzy function i.
Clustering Using Proposed IFC
The steps for applying the IFC method into the T1IFF system are shown in Figure 4.7. A user defined fuzzy or non-fuzzy clustering algorithm, usually Fuzzy C-means Clustering (FCM) [Bezdek, 1981a] algorithm is applied to capture initial cluster prototypes and the partition matrix of the membership values. Then, input-output dataset is re-clustered using Improved Fuzzy Clustering (IFC) to capture the interactive imimp proved membership values, μi (xy), i=1…c, and cluster centers, υi(x1,…,xnv,y) for c number of clusters and user defined fuzzy function structures, τi, such as in (3.29). imp Next, membership values corresponding to the input space, μi (x), is calculated using imp input matrix, x, and the cluster centers, υi (x1,…,xnv), of the input vectors.
Fig. 4.7 Implementation of Improved Fuzzy Clustering in T1IFF
128
4 Fuzzy Functions Approach
ALGORITHM 4.4 Proposed Training Algorithm of Type-1 Improved Fuzzy Functions Approach using the Improved Fuzzy clustering (IFC) algorithm Step 1: Compute an initial fuzzy partition, Uinit, of the input-output space Z(x,y) using a clustering method (of the users’ choice), e.g. typically FCM is used. Step 2: Compute Improved Fuzzy Clustering – IFC- algorithm in ALGORITHM 3.4 as follows: Use membership values Uinit as initial parameters to find the improved partition matrix, Uimp(xy) of the training samples, the cluster centers v(xy) and the optimum parameters, ǒi, of the regression functions hi(IJi|ǒi) for a given m and c value, while minimizing the IFC objective function:
J mIFC = ¦ ¦ ( μikimp ) d ik2 + ¦ ¦ ( μikimp ) c
n
i =1 k =1
m
c
n
m
i =1 k =1
( yk − hi (IJ ik , wˆ i ) )
2
(Equation (3.26) in chapter 3)
Step 3: Find improved membership values of the input space using the membership function, (Equation (3.30) from chapter 3) 1/( m −1) 2 2 § · § · ¨ c ¨ ( dik ) + ( yk − hi (τ ik , wˆ i ) ) ¸ ¸ imp μik ( x,τ i ) = ¨ ¦ j =1 2 2 ¸ ¨ ¸ 1< i , j ≤ c ˆ ¨ ¸ 1≤ k ≤ n © ( d jk ) + ( yk − h j (τ jk , w j ) ) ¹ © ¹
−1
where dik = xk − υiimp ( x ) and h(τi,ǒi) using improved membership values, μiimp , is obtained from the results of the IFC method in Step 2. Step 4: Repeat for each cluster i for τ i ( μiimp )∈Uimp 4.1. Using improved membership values of each of the training input data sample, μikimp and their user defined transformations (a total of nm new variables ℜnm space), IJi map the original input space onto each cluster feature space i, ℜnv+nm, i.e., xÆΦi(x,τi), 4.2. Estimate the regression parameters Ǒi of the fuzzy functions, fi(ĭi,Ǒi) for each cluster i. 4.3. Infer the output value of the data vectors using Fuzzy Functions:
ǔi =fi(Φi(x,τi),Ǒi). Step 5: Find a single output value by weighting inferred output values from each cluster with their corresponding membership values: ˆy =
(¦ ( μ c
i =1
imp ˆyi i
)) ( ¦ ( μ )) c
i =1
imp i
At this stage, we apply the cluster validity index, cviIFF, to measure the optimum number of clusters. In chapter 3, the implementation of the cviIFF into regression models is explained with experiments. Approximation of Improved Fuzzy Functions
The interim matrix, τi(μi )∈U , composed of the improved membership values and their transformations, are used as additional predictors of fuzzy regression functions in this step. The novelty of the “Fuzzy Functions” approach is that a different dataset is structured for each cluster i by using improved membership imp values (μi ) with their possible other forms as additional predictors of the original input matrix, see Figure 4.6. This is similar to mapping the original input space, imp
imp
4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF
129
ℜ onto a higher dimensional feature space ℜ , viz., xÆΦi(x,τi), for each cluster i, i=1…c. Hence, each datum is represented in (nv+nm) feature space. It should be noted that improved fuzzy functions are represented the same way as the fuzzy functions in equation (4.1) and Figure 4.4, but this time improved membership values obtained from the IFC method are used. For instance, a sample dataset of multi input and single output that was used to demonstrate the fuzzy functions in the previous section can also be used to demonstrate the improved fuzzy function structure. A new dataset is formed for each cluster by mapping the original input matrix onto a higher dimension using improved membership values as additional nv nv+1 predictors as new dimensions, i.e., ℜ Æℜ , as follows; nv
nv+nm
⎡ μ imp ⎢ i ,1 nv +1 Φi ( x ,τ i ) ∈ℜ =⎢ # ⎢ imp ⎢⎣ μi ,niα
xnv×1 ⎤ ⎥ # ⎥ ⎥ " xnv×nα ⎥ i ⎦ … #
x1×1 # x1×nα i
μikimp > α - cut 0 < niα ≤ n
(4.14)
In (4.14), i=1…c, represents the number of clusters, nm=1 denotes that there is only one additional column for this special example; nv indicates the original input dimension. Here, one fuzzy regression function of each cluster is identified in (nv+1+1) space (Additional dimension for the intercept.) It should be noted that these functions are nothing but crisp multi-variable-single output functions. Using μikimp>α-cut>0, we eliminate clustering anomalities which may be created by outliers, and only use the rest of the vectors to identify local fuzzy functions. A prominent feature of fuzzy function approximation is that, if the relations between input and output variables can not be defined in the original space, we can use the nv+nm+1 proposed fuzzy functions approach to explain their relationship in the ℜ space using additional information induced by the membership values and their transformations. Consequently, one fuzzy function, f(Φi), is identified for each cluster i, as shown in Figure 4.6, to obtain fuzzy output values for each vector. For instance, a 2 fuzzy function of a cluster can be represented with a set of linear planes in ℜ , such as in (4.14), by ŷ=f(Φi)=Ŵ0i+Ŵ1i(μi )+Ŵ2ix= Φi Ŵi imp
T
(4.15)
Ŵi=[Ŵ0i Ŵ1i Ŵ2i] are regression coefficients of fuzzy functions. One can use various forms of the improved membership values to identify local fuzzy decision surfaces. We use mathematical transformations of the membership values such as imp p imp p imp imp imp (μi ) , exp((μi ) ), ln((1-(μi ))/(μi )), 1/(1+exp(μi )), p∈ℜ,etc. The optimum membership function transformations depend on merely the structure of the dataset, similar to the simple regression problems where the input variable relationships with the output depends on the linearity issues (polynomials should be used if the relations cannot be explained with the linear functions). Hence, we identify the optimum structure based on the given dataset. Mainly search methods such as exhaustive search or genetic algorithms are used to optimize these parameters. We refer these parameters of fuzzy function systems as fuzzy function structures. In
130
4 Fuzzy Functions Approach
addition, the membership value transformations also depend on the clustering structure, the clustering parameters, etc. It is therefore difficult to construct a general rule about the suitability of a type of fuzzy function for a specific type of the problem set. If SVR is used to identify fuzzy functions, the general function to find fuzzy outputs for each cluster is given by:
(
)
ˆyi = fi (Φi ,α i ,α i* ) = ∑ k =1 α ik − α ik* K Φik , Φi + bi n
(4.16)
In (4.16), (αi,αi ) are the non-zero Lagrange multipliers, which denote multipliers for support vectors in each cluster i, K(.) is the kernel function that generates new nv+nm feature space for each cluster using Φi∈ℜ , i.e., ΦiÆK(Φi). A linear or a nonlinear kernel can be used depending on non-linearity of the input variables and the output variable. Just as in LSE structure, in SVR models, Ŵi represents a more complex structure indicating the model parameters of each cluster as follows *
Struct Ǒi { [Įi,Įi*]: Lagrange multipliers of support vectors of cluster i, [Φis]: training vectors of cluster i that are support vectors, (s) [K]: type of kernel mapping (K), i.e., Gaussian, linear, etc. }
When SVR is used to identify the fuzzy functions, the system fuzzy function pa* s rameters are represented with four-tuple form Ŵi=〈αi,αi ,Φi ,K〉. Finally, as the last step of the learning mechanism, one could infer single output value of each training data sample through fuzzy weighted average functions as follows;
μ imp ˆy i ∑ i =1 i ˆy = c ∑ i =1 μ iimp c
(4.17)
In conclusion, after the structure identification of T1IFF is completed, the outputs we obtain from each model will be: ¾
parameters of the interim fuzzy functions h(τi) of each cluster ŵi, i=1…c, which are used during IFC, and the interim input matrix, τi, viz., the list of different types of improved membership value transformations that are used to approximate each h(τi), ¾ degree of fuzziness, m, and number of cluster, c, parameter of IFC, ¾ parameters of system fuzzy functions f(Φi) of each cluster Ŵi, i=1…c, Ŵi=〈αi,αi,Φi,K〉, and the optimum α-cut value. ¾ If SVR is used to find fuzzy functions, C-regularization constant, ε-margin for epsilon insensitive regression and kernel type and parameters with the addition to the initial parameters for a non-linear SVR. The optimum model is captured with an exhaustive search method during structure identification using the training and validation datasets, i.e., training a particular model with a set of initial parameters on the training dataset and validating the particular model using a validation dataset for every set of parameters. The overall
4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF
131
performance of the model is measured by applying the optimum model on the testing dataset. Hence, the inference mechanism, to be discussed next, is used both for validating the model (optimization of the model parameters) and testing the generalization capacity of the optimum model.
4.4.2 Structure Identification of T1IFF-C for Classification Models Structure identification (training) of the proposed T1IFF-C systems for the classification type problems is very similar to structure identification of T1IFF systems as shown in Figure 4.8. Instead of implementing the IFC clustering, IFC-C clustering [Celikyilmaz and Turksen, 2007i] is implemented for the classification domains. It should be remained that, the matrix, τi is obtained from the clustering algorithm IFC-C. It contains only the membership values and their transformations, a total of nm different input variables such as in (4.13) to approximate interim fuzzy classifier functions, hi(τi|ŵi), with a classification function method of user choice, e.g., logistic regression (LR) or support vector classification (SVC). A sample interim classifier fuzzy function using a linear LR method is represented by: h(τi) = 1/ (1 + exp(-(ŵ0,i + ∑j ŵj,i τj,i))), i=1,..c, j=1...nm. T
(4.18)
After the IFC is executed and improved membership values are obtained, in order to find the local decision surfaces of each cluster i, a fuzzy classifier function, nm+nv f(Φi) is identified in feature space, ℜ . Here, nm is the number of additional membership value transformations augmented to the original input matrix and nv is the number of original input variables, i.e., xÆΦi(x,τi). Then posterior probabilities, Pˆi (ŷi=1|fi(Φi)) are calculated from f(Φi). For instance, if the support vector classification (SVC) is used to estimate fuzzy classification functions, f(Φi), the output values of data vectors are calculated by: Improved Fuzzy Classifier Functions gi, i=1,...,c
Inputs
Fuzzy Classification Function 1
μ
imp 1
Improved Fuzzy Clustering (IFC-C) hi(τi, ŵi)
x1
μiimp
e.g., ⎡
τ i = ⎢ μiimp ⎣
(
μiimp
)
2
imp ⎤ " e μi ⎥ ⎦
μ
. . . xk . . . xnv
imp c
Φ1 = [ x1 x2 ...xnvτ 1 ]
Outputs
μ
imp 1
g1
. . .
Fuzzy Classification Function i
Φ i = [ x1 x2 ...xnvτ i ]
μ1imp Pˆ1
μ
imp i
gi
. . .
Fuzzy Classification Function c
Φ c = [ x1 x2 ...xnvτ c ]
μcimp
μiimp Pˆi
∑μ
μcimp Pˆc
threshold
Pˆ
imp i i
Pˆ
Set Signal
gc
Fig. 4.8 Topology of Type-1 Improved Fuzzy Functions Systems for Classification Models
132
4 Fuzzy Functions Approach
f i (Φ i ) = ∑ nk βik* yk K (Φ ik ,Φ i ) + bi
(4.19)
In (4.19), βi are non-zero LaGrange multipliers, which denote the multipliers of nv+nm to the support vectors of cluster i, K(.) is the kernel function which uses Φi∈ℜ generate a new feature space, i.e., ΦiÆK(Φi), where K could be linear or a nonlinear function. The parameters of fuzzy functions such as in (4.19) are represented by Ŵi , which is a rather complex structure as in the T1IFF method. The parameters of the models for each cluster i, Ŵi will include: *
Struct Ǒi { [βi*]: Lagrange multipliers of support vectors of cluster i, [Φis]: training vectors of cluster i, that are support vectors, (s). [K]: type of kernel mapping (KT), i.e., Gaussian, linear, etc. }
We denote Struct Ŵi with three-tuple notation as Ŵi=〈βi ,Φi ,K〉. Fuzzy th posterior probabilities for the i cluster are calculated by; *
( (
Pˆi ( y = 1| f i (Φ i ) ) = 1 + exp − ( a1 fi (Φ i ) + a2 )
))
s
−1
(4.20)
Then, each fuzzy posterior probability of each cluster obtained from the classification function of the users’ choice, e.g., LR, SVC, etc., are weighted with their membership values,
gi = μiimp Pˆi , i = 1,..., c
(4.21)
Overall crisp posterior probabilities are calculated by the membership weighted method as follows: imp ∑ c gi ∑ c μ Pˆ Pˆ = c i =1 imp = i =c1 i imp i , i = 1,..., c ∑ i =1 μi ∑ i =1 μi
(4.22)
Actual class labels are calculated based on the optimum classification threshold value (OTV) obtained during the structure identification of T1IFF-C. The OTV is calculated from the application of ALGORITHM 4.3.
4.4.3 Inference Mechanism of T1IFF for Regression Problems Inference (reasoning) mechanism is used for validating a model (optimization of the model parameters) or testing the generalization capacity of an optimum model. This work applies a sub-sampling cross validation method to optimize model parameters based on the performance of the models applied on the validation dataset. Hence, the learning mechanism uses the training dataset and the inference mechanism uses * validation dataset to search for the optimum parameters of a given system. Let ( )
4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF
133
indicate the optimum parameters obtained from exhaustive search m ethod. After the optimum T1IFF system parameters obtained, which are; ¾ ¾ ¾ ¾
optimum number of clusters, c*, optimum degree of fuzziness, m* , * optimum parameters of the fuzzy classifier functions of IFC, ŵi , optimum system fuzzy function parameters, four-tuple form, * * Ŵi =(αi,αi ,Φi,K), i=1…c*, and ¾ some additional parameters specific to the selected function estimation type, e.g., kernel parameters (if SVR is used), are obtained from structure identification of the T1IFF system, we measure the performance of each T1IFF model with the reasoning (inference) method. In the following, the new reasoning (inference) mechanism is described using the testing dataset. In chapter 3, we introduced a new membership value calculation equation for IFC, i.e., 1/( m −1) ⎛ ⎞ ⎛ ( d ) 2 + ( y − h (τ * , wˆ * ) )2 ⎞ k i ik i ⎜ c ⎜ ik ⎟ imp ⎟ μik = ⎜ ∑ j =1 2 2 ⎟ * * ⎜ ⎟ 1< i , j ≤ c ˆ j )) ( , + − d y h τ w ⎜ ⎟ ( ) ( jk k j jk 1≤ k ≤ n ⎝ ⎠ ⎝ ⎠
−1
(4.23)
which is used by the new inference method to find the membership values of the testing data samples in each cluster. In order to calculate membership values of the testing data vectors whose output values are unknown, we need to review the new membership value calculation equation in (4.23). Note that this has an additional * * 2 squared error term, SEik=(yk-h(τ ik,ŵi )) . One should measure the value of this error term to calculate the improved membership values of the testing samples. Here, yk, indicates the actual output value of each testing sample, which is an unknown term * for the inference method. And, τi is the input matrix of the improved membership values such as in (4.13) and their user defined transformations, e.g., logistic, power functions, etc., and it is an unknown parameter, since we have not yet calculated improved membership values of the testing samples. The only known parameter that would be used during fuzzification of testing samples is the estimated * optimum function parameters, ŵi , i=1…c, of the membership values of IFC model from training algorithm. Hence, we need to approximate the unknown value of SEik, to be able to calculate new improved membership values of testing data samples. For this, we propose a case-based inference approach to estimate the error term using training data samples and their membership values. Since, the algorithm partially relies on the training samples, it is called a semi-parametric inference method. The following parameters are given at the beginning of the inference method of the T1IFF strategy; ¾ ¾
te
X ={xk, k=1…nte} represents the testing dataset, c*, represents the optimum number of clusters,
134
4 Fuzzy Functions Approach
¾ ¾ ¾ ¾ ¾
m* be the optimum fuzziness value, X={(xk, yk), k=1…n}, represent the training data samples, n be the total number of training data samples, imp μik ∈U*(x), i=1…c*, represent the improved membership values of the training samples, τi* is the optimal interim fuzzy function matrix composed of improved * membership values and their transformations, and ŵi be the parameters * of the interim fuzzy functions identified using τi , * Ŵi are the structures of system local fuzzy regression functions of the T1IFF system model.
ALGORITHM 4.5 shows the inference mechanism of the proposed T1IFF method, which is summarized in Figure 4.9.
Fig. 4.9 Proposed Inference Mechanism of T1IFF
ALGORITHM 4.5 can be summarized as follows: The first step of the new inference mechanism is to find κ training data samples that are the nearest to each testing data sample, r, r=1…nte, using the Euclidean distance measure, where A is the norm. Improved membership values of κ-nearest training data samples are * * * T used to build the interim matrix, τi =[τi1 …τiκ ] , a total of κ vectors for each cluster to be used to estimate the membership values of the corresponding testing sample, * r. Thus, using regression function parameters, ŵi , of interim fuzzy functions cal* culated during the IFC algorithm, hi, and the input matrix structure, τi , i=1…c, formed with the improved membership values, the output values of each κ-nearest
ˆ i ) , q=1…κ. training sample is calculated using hi (τ iq , w *
*
The next step is to measure the squared error values of these nearest κ data
ˆ i ) ) , q=1…κ, i=1…c, to be samples for each cluster, i.e., SEiq=(y q- hi (τ iq , w *
th
*
*
2
used to approximate the mSEir of the r test data sample in each cluster i. Next, error values, SEiq, are weighted with weight constants,ηrq, which represent the normalized distances of the κ-training samples to the testing sample r. The averth th age approximate squared error of the r testing sample in i cluster is calculated with weighted square error, m SEir , which is used in the new membership function to calculate improved membership values of testing samples. Then type-1
4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF
135
fuzzy outputs are weighted by their membership values to get a crisp output value of type-0. ALGORITHM 4.5 Inference Mechanism of the Improved Fuzzy Functions (T1IFF) for regression models Select k∈Ζ+. Iterate for each testing data sample r:1,…,nte; Step 1: Find the distance of the rth testing vector to each training sample, x and store them in D:
{
Dr = drq ∈ℜ drq = xte r - xq
2 A
}
, q=1,…,n, te: testing vectors
Step 2: Sort distances in Dr in ascending order and choose the first κ training vectors as the κnearest data sample to the testing sample r. Step 3: Save the distances of κ-nearest training samples in Dr=[dr1…drκ]T, their corresponding improved membership values in cluster i in Ui imp=[μi1imp,…,μiκimp]T, i=1,…,c*, and their actual output values in yr=[y1,…,yκ]T. Step 4: Calculate the average error term (SEir) of the κ-nearest data samples in each cluster: m SEir
= ∑ κq =1 SEiq ⋅η rq
(
SEiq = yq - hi (τ iq* , wˆ i* )
η rq = 1 - (drq
κ
∑ drs ) s =1
) =(y 2
q
- yˆ i (τ iq* )
)
2
Step 5: Use the membership function in (4.23) to calculate improved membership values of testing sample r in each cluster, μirimp, using dir = xr − υiimp ( x test ) and mSEir Step 6: Map the rth testing vector onto every cluster i using the structure Φi to form mapped testing vector in each cluster space, Φir, similar to input matrixes such as in (4.14) and infer their output values, ŷir, in each cluster using functions such as (4.16). Step 7: Find a single output value for each data object in the testing dataset using (4.17).
4.4.4 Inference with T1IFF-C for Classification Problems The inference (reasoning) mechanism of the T1IFF-C approach differs from the inference mechanism of the T1IFF in the way the fuzzy function parameters are obtained. Hence, the new membership function for IFC-C is re-displayed as follows:
μ ikimp
1< i ≤ c 1< k < n
( (
)) ))
⎛ ⎡ 2 * * ⎜ c ⎢ d ik ( xk ,υi ) + yk − pˆ ik yk =1 h (τ ik , wˆ i ) = ⎜∑⎢ ⎜⎜ j =1 ⎢ d jk2 ( xk ,υ j ) + yk − pˆ jk yk =1 h (τ *jk , wˆ *j ) ⎝ ⎣
( (
2
⎤ ⎥ 2 ⎥ ⎥⎦
1 /( m −1)
⎞ ⎟ ⎟ ⎟⎟ ⎠
−1
(4.24)
The membership function in (4.24) is used in the new inference method to find the improved membership values of the testing data samples in each cluster. In order to calculate the membership values of the testing data vectors, whose output
136
4 Fuzzy Functions Approach
values are unknown, we need to review the new membership value calculation equation in (4.24). The new membership function of IFC-C has an additional term, * * 2 SEik=(yk - pˆ k (yk=1|h(τiq ,ŵi ))) , where yk’s are class labels and pˆ i (.) are posterior probabilities. One should measure the value of the error term before calculating improved membership values of the testing samples. The inference algorithm of T1IFFC is similar to the T1IFF systems for regression models; so the ALGORITHM 4.5 is slightly updated for T1IFF-C models. In Step 4 of ALGORITHM 4.6, the IFC-C * * model output values ŷiq= h(τiq |ŵi ) of each training sample are replaced with model posterior probabilities, pˆ i (.) . ALGORITHM 4.6 Inference Mechanism of Type-1 Improved Fuzzy Functions (T1IFF-C) for classification problem Select k∈Ζ+. Iterate for each testing data sample r:1,…,nte; Step 1: Find the distance of the rth testing vector to each training sample, x and store them in D:
{
Dr = drq ∈ℜ drq = xte r - xq
2 A
}
, q=1,…,n, te: testing vectors
Step 2: Sort distances in Dr in ascending order and choose the first k training vectors as the κnearest data sample to the testing sample r. Step 3: Save the distances of κ-nearest training sample in Dr=[dr1…drκ]T, their corresponding improved membership values in cluster i in U*i=[μi1imp,…,μiκimp]T, i=1,…,c*, and their actual class labels in yr=[y1,…,yκ]T. Step 4: Calculate the average error term (SEir) of the κ-nearest data samples in each cluster: m SEir
= ∑ κq =1 SEiq ⋅η rq
η rq = 1 - (drq
κ
∑ drs ) s =1
SEiq = ( yq - pˆ i (τ iq* , wˆ i* ) )
2
Step 5: Using the membership value calculation equation in (4.24) to calculate the improved membership values of testing sample r in each cluster, μirimp. Step 6: Map the rth testing vector onto every cluster i using the structure Φi to form the mapped testing vector in each cluster space, Φir , input matrixes such as in (4.14) and infer their probability values, Pˆir , for each cluster using functions such as in (4.19) and (4.20). Step 7: Find the single posterior probability for each data point in testing dataset using (4.22).
4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems Genetic Algorithms (GAs) [Holland, 1973; Goldberg, 1989] are adaptive heuristic search algorithms premised on the evolutionary ideas of natural selection and genetics. GA’s were introduced as a computational analogy of adaptive systems and they are the most commonly implemented evolutionary algorithm methods. They are modeled broadly on the principles of the evolution via natural selection,
4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems
137
employing a population of individuals that undergo selection in the presence of variation-inducing operators such as mutation and recombination (crossover). Each individual represent one chromosome and a genetic algorithm model hold a number of individuals, determined by the domain expert, to apply the genetic operations on. A fitness function is used to evaluate individuals, and their reproductive success. Nearly everyone can gain benefits from GA, once they can encode solutions of a given problem to chromosomes in GA, and compare the relative performance (fitness) of such solutions. An effective GA representation and meaningful fitness evaluation are the keys to success in GA applications. The appeal of GAs comes from their simplicity and elegance as robust search algorithms as well as from their power to discover good solutions rapidly for difficult high-dimensional problems. GA’s are useful and efficient when the; ¾ ¾
Search space is large, Domain knowledge is scarce or expert knowledge is difficult to encode to narrow the search space, ¾ Traditional search methods fail. The advantage of the GA approach is the ease with which it can handle arbitrary kinds of constraints and objectives; all such things can be handled as weighted components of the fitness function, making it easy to adapt the GA scheduler to the particular requirements of a very wide range of possible overall objectives. GAs has been used for problem-solving and for modeling practices. Researchers have been investigating automatic learning methods for designing fuzzy inference systems by deriving an appropriate Knowledge Base automatically, without the necessity of its human operator. Genetic Algorithms [Goldberg, 1989] are stochastic search methods, based on natural genetics, which are valid approaches for fuzzy systems requiring efficient and effective search processes. One of the areas to improve the performance of the proposed T1IFF systems as in Figure 4.5, is that the optimum system parameters of both fuzzy clustering and local fuzzy functions could be determined automatically by an optimization method instead of an exhaustive search algorithm. Among many search algorithms, e.g., hill climbing, simulated annealing, gradient descent, etc., genetic algorithms have been the optimum solutions for fuzzy systems, as mentioned above. Fuzzy systems have a complex decision space which has a highly nonlinear structure due to the non-linear membership functions. This makes it hard to implement gradient descent algorithms while searching for the global minima; therefore, genetic algorithms, which are rather generalized optimization methods, are preferable for fuzzy system parameter optimization. One of the strengths of genetic algorithms is that different types of parameters can be easily encoded and with a diverse population, the algorithm can choose its own path based on fitness evaluation towards where the optimum parameters are in the global search space. Hence, the genetic algorithms utilize mutation operators to generate diversities in the population which present a global search space. The algorithm tends to keep the best solutions and generate similar ones until it finds the optimum solution.
138
4 Fuzzy Functions Approach
In this work, we introduce a new hybrid fuzzy system by encoding different types of parameters, e.g., binary or scalar values, of Type-1 Fuzzy Functions (T1FF) and Type-1 Improved Fuzzy Functions (T1IFF) systems into genomes to optimize them with GAs. Since, genetic algorithms are the most commonly known type of evolutionary computing methods, the new fuzzy functions methods are called “Evolutionary Type-1 Fuzzy Functions", in short denoted with ET1FF and “Evolutionary Type-1 Improved Fuzzy Functions", in short denoted with ET1IFF. Structurally speaking ET1IFF and ET1FF systems are similar to each other as do the T1FF and T1IFF systems are. The difference between them is that ET1FF systems implement the standard FCM clustering algorithm, whereas ET1IFF methods implement the proposed IFC method. Therefore, their structure identification and inference methods differ slightly. Details of these two methods are presented in the previous section. Since ET1IFF and ET1FF systems optimize the parameters of T1FF and T1IFF systems, we will only present ET1IFF systems in the next section. In the chapter of experiments (Chapter 6), both of these methods will be applied on benchmarking datasets to compare their performances based on generalization and robustness perspectives. The structure of each chromosome (genome) encodes the proposed ET1IFF system, which are parameters of the IFC algorithm including parameters of the membership function and the structure of the fuzzy functions, i.e., interim and local fuzzy functions. The proposed evolutionary fuzzy system with improved fuzzy functions –ET1IFF- optimizes system parameters with genetic algorithms which automatically determines the optimum parameters and at the same time reduces the number of iterations to optimize the learning parameters as opposed to T1IFF strategies which are based on exhaustive search methods. ET1IFF is introduced to solve regression type problems. An extension of ET1IFF for classification problems, ET1IFF-C, is also briefly presented in the next section.
Fig. 4.10 Evolutionary Improved Fuzzy Functions (ET1IFF) System Modeling Framework –Extension of the Improved Fuzzy Functions (T1IFF) Approach to Genetic Fuzzy Systems. i=1…c.
4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems
139
ET1IFF is an iterative hybrid system, in which, the structure is build and the parameters are constructed and tuned by the genetic learning algorithm. The learning algorithm determines the size and structure of the information structures in a novel way, which is the two fundamental phases of system identification [Pedrycz and Reformat, 2003]. The proposed fuzzy model as depicted in Figure 4.10, is comprised of two fundamental phases based on cross validation: ¾
Phase 1: Determination of optimum parameters using genetic learning process: learning from training and fitness evaluation with validation data, ¾ Phase 2: Inference with testing data using optimum model parameters.
4.5.1 Genetic Learning Process: Genetic Tuning of Improved Membership Functions and Improved Fuzzy Functions This section presents, in turn, the basic mechanism of the genetic learning process, e.g., coding, initial population creation, fitness function, genetic operators and stopping criterion, as implemented into the fuzzy functions learning strategy. Chromosome Encoding
The structure of each chromosome (genome) encodes the proposed ET1IFF model, which are parameters of the Improved Fuzzy Clustering (IFC) algorithm and structures of the fuzzy functions. The parameters of IFC identify the shapes of the improved membership functions. The membership function parameters are encoded as a separate gene (token) in each chromosome of the proposed ET1IFF systems. Since we implement the new IFC method into ET1IFF systems, the parameters to tune the membership functions will represent the parameters of the IFC method. Hierarchical heterogeneous chromosome formulation of genetic algorithms [Wang et al. 2005] is implemented to build ET1IFF model chromosomes. The genes of chromosomes are classified into two different types and structures: parameter genes that are real numbers, and control genes that are binary codes. Hierarchical arrangement of genes is shown in Figure 4.11. The structure of the parameter genes, as shown in Figure 4.11, vary based on the selected fuzzy regression type. The parameter genes take on real or integer values. The first four parameter genes are common in all ET1IFF systems, and they represent type of fuzzy function approximators, e.g., 0 indicates linear regression, 1 indicates support vector regression, and so on, clustering parameters, + m∈[1.1,∞], c∈Z ⊂[2,n(1/℘)], ℘>0 where n is the number of training data samples and α-cut≥0. The rest of the parameter genes are determined by the structure of the selected fuzzy regression type. For instance, if a simple linear regression is used to identify the fuzzy functions, then parameter genes consist of only the first four IFC parameters as seen in Figure 4.11 (Top). For other modeling types such as the support vector regression (SVR), additional model parameters are augmented to the genome. In the SVR case, two additional parameter genes that are Creg∈ℜ>0, and epsilon are appended to the chromosome structure. Creg is the
140
4 Fuzzy Functions Approach
type
Membership Function Structure
Fuzzy Function Structure
m
μ
c
alpha
Parameter Genes
type
Control Genes
Membership function structure
Fuzzy Function Type
Fuzzy Function Structure
m
Creg
μ
c
alpha
K(.)
Parameter Genes 1
... 1/(1+e−μ)
eμ
μp
1.75
3
0.15
μp
eμ
1/(1+e−μ)
Control Genes 54.4
0.115
1
0
0
1
0
Fig. 4.11 Genome Encoding for ET1IFF systems. (Top) Hierarchical structure genome encoding when linear regression is used (Middle) Encoding when SVM is used (Bottom) Example chromosome structure.
regularization parameter to balance the SVR objective function, viz., the weight vector and the error margin, and epsilon (ε) limits the error margin. These parameters determine the shape of the improved membership values as well as the improved fuzzy function structures. Similarly, the list of control genes, which take on only binary values, {0,1}, also varies based on the modeling type that is chosen by the expert. Only the first part of the control gene differs for the ET1IFF systems based on the type of the regression models used. For simple regression models such as LSE, no additional control gene is defined (Figure 4.11- Top). However, if SVR is implemented into the ET1IFF modeling structure, then the first control genes start with a kernel type, K(⋅), see Figure 4.11 (Middle). Two separate kernel types are used for SVM T formulation: linear K(xk,xj)=xk xj, and, non-linear Gaussian radial basis kernel (RBF), K(xk,xj)=exp(-δ||xk-xj||), δ>0. The rest of the control genes in Figure 4.11, are common to all types of ET1IFF systems. They represent the structure of the fuzzy functions, in other words, the list of membership values to be used as additional predictors of the input matrix to be mapped onto higher dimensions. Thus, the ET1IFF approach dynamically determines the feature mapping structure of the fuzzy functions. They are identified with different forms of type-1 fuzzy sets. The length of the fuzzy function structures, viz. the number of membership value types used in regression functions, is determined prior to chromosome formation. These form the control genes. The control genes represent fuzzy function structures, τi, implemented into IFC. These structures will then be used to identify local fuzzy functions structures, i.e., Φi,(x,τi) ,i=1,..,c. Activation of the control genes is set accordingly: an integer 1 is assigned to represent ignition, whereas 0 is for turning it off. Hence 0 in control gene represents that the corresponding membership value transformation will not be used in
4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems
141
the model if that chromosome is used to set the parameters. In other words, fuzzy function structure tokens, i.e., alleles (gene location), are assigned 1 if genetic algorithm decides that the particular dimension will appear in shaping fuzzy functions and 0 vice versa. In ET1IFF models, we used the same fuzzy function structures and parameters for each cluster. Therefore, static length chromosomes represent every cluster structure. An example of a chromosome structure in Figure 4.11 (Bottom) indicates that, for the given model, only eμ will be used as an additional dimension to map the original input space onto a feature space. In the next chapter, with the implementation of evolutionary methods using type-2 improved fuzzy functions, different structures of the fuzzy function for every cluster can be identified. Initial Population Creation and Fitness Function
The initial population is randomly generated. Parameter genes are formed by random numbers that can assume values between type∈{0,1,..}, alpha∈[0,1], + m∈[1.1,∞] and c∈Z ⊂[2,n(1/℘)], Creg∈ℜ>0. The epsilon variable should be between zero and the range of the target values. The control genes are randomly assigned 0 or 1 at the start of the genetic learning process. Fitness function could be any function determined by the user. Some of the fitness functions used in this work is discussed in the chapter of experiments in Chapter 6. The sub-sampling (three-way) cross validation method that is described above is also implemented in ET1IFF strategies. Even though the learning strategy of the ET1IFF approach is applied on the training dataset, the performance of each individual in the population is measured using the randomly selected validation dataset. Genetic Operators, Selection and Stopping Criteria
Different genetic operators are utilized for the parameter and control genes since they are real and binary numbers, respectively. For parameter genes, we used arithmetic and simple crossover and non-uniform and uniform mutation operators. For control genes, simple crossover and shift mutation operators are utilized. Selection is the process by which individuals in the population are chosen to reproduce. We implemented a tournament selection that extracts k individuals from the population with uniform probability (without re-insertion) and makes them play a “tournament”, where the probability for an individual to win is generally proportional to its fitness. Elitism is a variation on selection in which a user determined percentage of the highest ranked individuals is allowed to “walk-over” into the new population. This ensures that there will always be a minimum amount of highly fit genetic material to apply genetic operators to. In our models we employed tournament selection based on elitist strategy. The purpose of ET1IFF is to optimize the learning parameters, i.e., the number (c), the level of fuzziness, m, the membership value structure and parameters as well as the structure and parameters that would hold the optimum interim and local fuzzy functions structure and parameters, i.e., τ, Φ, ŵ, Ŵ, Creg, ε, K(⋅), etc., so
142
4 Fuzzy Functions Approach
that the optimum model to represent the actual system has a good performance. The learning process of proposed ET1IFF, as shown in Figure 4.10, Phase 1, is given in ALGORITHM 4.7. chrpop=1,…,max-population-size, indicates each chromosome, th i.e., individual, in the population of chromosomes at t iteration. The genetic learning process uses training and validation datasets to learn the model structure and select the best individuals to form different generations (the population). The T1IFF training dataset is used to learn a model using the parameters of each chromosome where GLP identifies optimum values of such parameters. At the start of the GLP process, each chromosome is randomly assigned linear or non-linear fuzzy function approximator types. One individual may identify a linear model, which uses membership values as additional predictors to shape membership values and fuzzy functions. Another individual may be assigned a non-linear fuzzy function type to build a T1IFF model. Based on the evolution strategies, the algorithm determines which fuzzy function approximator type would yield better performance based on the fitness function. During the genetic learning process, the performance of each individual is measured by applying T1IFF’s inference mechanism as in ALGORITHM 4.5 on the validation dataset by decoding their chromosomes. The individuals with the best performance are selected based on the tournament selection method. ALGORITHM 4.7 Genetic Learning Process of ET1IFF systems GA initializes chromosomes to form the initial populationt=0 as in Figure 4.11 depending on the system model type, i.e., linear or non-linear functions. Start iterating, t=1… max-number-iterations, { chrpop : popth chromosome in the population at tth iteration including parameters of typepop, alphapop, mpop, cpop, Cregpop, εpop, Kernel-typepop {K(⋅)}, and list of membership value transformations to construct {τpop,Φpop}. if chrpop has not been used in the past iterations { LEARNING - ALGORITHM 4.4 - Compute IFC with parameters from the chrpop using training data to obtain improved membership values, (μipimp) and calculate their possible transformations to form the input matrix τi,pop∈Uimppop - Map each original data onto the individual cluster to form feature vectors for each cluster cpop, xÆΦi,pop(x,τi,pop) using control genes to determine which membership value form to chose for chrpop - Approximate fuzzy functions, fi,pop(Φipop,Ŵipop) for each cluster i=1…cpop such as (4.15), (4.16) using chrpop parameters such as in Figure 4.13. -
REASONING- ALGORITHM 4.5 Find improved membership values of the validation data and infer their output values using ALGORITHM 4.5 based on chrpop, Measure fitness value for validation data.
} Genetic Algorithm generates next populationt+1 by means of selection, crossover and mutation operations Next iteration (t=t+1) } End
4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems
143
The difference between the proposed T1IFF and ET1IFF methods is that, in T1IFF systems only one model is built for a set of parameters. However, in ET1IFF systems, we deal with a population of set of parameters to build different ET1IFF models. These models are structurally different from each other. The best parameter is sought among potential individuals based on their fitness value on a validation dataset. In ET1IFF systems, each individual in the population represent different fuzzy decision surfaces based on their parameter and control genes of chromosome structure. They also represent different functions, such as a least squares regression, support vector regression, etc., which is identified by the first token of each chromosome. On the other hand, in T1IFF approaches, the fuzzy function of each cluster would have the same structure and type. At the start of the T1IFF, the domain expert determines that the model would use linear or non-linear fuzzy function approximators during the exhaustive search process. Hence, different sets
2
output-y
1 0 -1 1
1 0.5
output-y
2
2 output-y
Cluster 3
Cluster 2
Cluster 1
1
-2 1
0 1
0
0
1
1 0.5
0.5
uc1
0
input-x
0.5
uc2
0 0
0.5
0.5
uc3
input-x
0
0
input-x
(a) Fuzzy Functions – f(x,eμ), SVM-Linear Kernel, Kernel token=0. Cluster 1
Cluster 2
4
output-y
2 0
2
2 output-y
4
output-y
Cluster 3
0 -2
-2 1
1 1 0.5
u
c1
0.5
0.5 0
0
1
u
input-x
c2
0.5 0
0
input-x
1 0 -1 1 1 0.5
u
c3
0.5 0
0
input-x
μ
(b) Fuzzy Functions – f(x,e ), SVM-Gaussian Kernel, Kernel token=1. Fig. 4.12 Decision surfaces of one chromosomes of ET1IFF model using two different fuzzy function structures. (m,c,Creg, )={1.75,3,54.5,0.115}. uci represents membership values of corresponding cluster.
144
4 Fuzzy Functions Approach
of models are to be constructed for T1IFF models, whereas ET1IFF handles the fuzzy function approximators type during the optimization process and chooses the optimum type based on the stochastic search method. The crossover and mutation operators are applied on the randomly chosen chromosomes of the same type. Figure 4.12 depicts two sample decision surfaces of two separate models from a corresponding gene pool constructed for a non-linear sinusoidal artificial dataset. The top and bottom graphs represent two separate decision surfaces of T1IFF fuzzy functions of each cluster identified by two separate chromosomes of the population of chromosomes of a ET1IFF model. The difference between the upper and lower graphs is that all of their tokens of their corresponding chromosomes are kept the same but the kernel type (token #7 in Figure 4.13), which determines the non-linearity of the fuzzy functions. Hence, the upper three graphs represent the decision surfaces of a model from the gene pool with a three clustered structure where linear support vector is used to estimate each hyper-plane. Whereas in the lower graphs, a radial basis kernel function is implemented to build non-linear hyper-planes. The presented ET1IFF modeling strategy searches for the best local fuzzy decision surfaces by optimizing the parameters that are represented with each chromosome.
Fig. 4.13 Genome Two different chromosomes from the GLP algorithm of ET1IFF modeling approach applied on the Artificial Dataset. The dark colored token is the only difference between the two chromosomes. ‘1’:Linear Kernel Model, ‘2’:Non-Linear Kernel Model.
We can extend the proposed ET1IFF systems to build models of evolutionary improved fuzzy functions for classification problems, ET1IFF-C. The only difference between the ET1IFF and ET1IFF-C models is that in ET1IFF-C models, we approximate decision surfaces that can discern between two dichotomous classes. Posterior probabilities for each data vector in each cluster are estimated instead of continuous output values. Thus, each chromosome indicates the parameters of a T1IFF-C model. In this work, we only deal with binary classification problems. The proposed ET1IFF-C applies IFC-C during structure identification. Therefore, the structure identification (learning) phase – Phase 1 – of ET1IFF-C system is slightly different than the structure identification stage of the ET1IFF approach in ALGORITHM 4.7. The learning process of the ET1IFF-C utilizes classification type fuzzy functions, e.g., logistic regression, support vector classification (SVC), etc., and the chromosome structures will be slightly different than the ET1IFF chromosome structures, as shown in Figure 4.14. When the SVC is selected to approximate fuzzy classifier functions, the chromosomes will
4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems
145
not include the epsilon parameter, since it is a parameter for support vector regression models. The rest of the parameters will be as same as the ET1IFF parameters.
type
Membership Function Structure
Fuzzy Function Structure
m
μ
c
alpha
Parameter Genes
Control Genes
Membership function Fuzzy Function Type structure type
m
c
alpha
Creg
Parameter Genes
1/(1+e−μ)
eμ
μp
K(.)
Fuzzy Function Structure
μ
μp
eμ
1/(1+e−μ)
Control Genes
Fig. 4.14 Genome Encoding for ET1IFF-C systems for Classification Domains. (Top) Hierarchical structure genome encoding for linear regression (Middle) Encoding when SVM is used (Bottom) Example chromosome structure.
The same genetic operators used in ET1IFF approach is also used in ET1IFF-C approach. The fitness function of ET1IFF-C is different than ET1IFF methods’ fitness function due to the structural differences between the problem domains. For ET1IFF-C fitness evaluation, we implemented measures such as accuracy, area under the curve, etc., to be discussed in the chapter of experiments (Chapter 6). The best models of the gene pool are evaluated based on their fitness function. For each model, one T1IFF-C model is identified and then evaluated using the inference method in ALGORITHM 4.6. As a result of the genetic learning process, in both ET1IFF and ET1IFF-C, the parameters of the individual, i.e., chromosome, which identifies a model that has the optimum performance measure is retained to do reasoning for new data vectors. The chromosomes represent the optimum model parameters of the T1IFF model. Similarly, the optimum chromosome of ET1IFF-C represents the parameters of the T1IFF-C model. The next step is to measure the general performance of the optimum model by the new inference method.
4.5.2 Inference Method for ET1IFF and ET1IFF-C The main objective of the proposed ET1IFF and ET1IFF-C strategies, which are based on cross validation techniques, is to find a crisp output for a given input vector. The crisp output value would either be a real number for the ET1IFF models, or dichotomous class label for the ET1IFF-C models. Therefore, as a result of the algorithm, the type of the output is reduced from type-1 to type 0 by using the ET1IFF or ET1IFF-C model parameters. For the regression problem domains, during the first phase of the ET1IFF method, the genetic learning phase, the best
146
4 Fuzzy Functions Approach
individual is identified based on a performance measure. It should be noted that, each individual in the population represents one T1IFF model. During the genetic learning process, the inference engine of T1IFF system in ALGORITHM 4.5 is used to estimate the model output of the validation data in order to measure the fitness values of each population. After the genetic learning process, the optimum parameter of the best model from the gene pool is used to score the testing dataset to evaluate the optimum model performance using the same T1IFF inference mechanism. At this point ET1IFF and T1IFF implement the same inference mechanism as shown in ALGORITHM 4.5. Herein, the second phase of the ET1IFF as shown in Figure 4.10 is the inference method using testing data. The testing data, which has not been used for learning or validation purposes, is used to evaluate the overall model performance. Similarly, the proposed inference method for classification models in ALGORITHM 4.6 applies for ET1IFF-C model performance evaluation on testing datasets as well.
4.5.3 Reduction of Structure Identification Steps of T1IFF Using the Proposed ET1IFF Method In type-1 improved fuzzy functions - T1IFF systems, initial parameters are optimized with a exhaustive search based on the supervised learning method by iterating the list of parameters; degree of fuzziness (m), number of clusters (c), types of membership value transformations (τ, Φ) the of fuzzy functions, both of which use the same membership value transformations to identify interim and local fuzzy functions, alpha-cut constant and support vector regression (SVR) parameters such as regularization constant (C-reg), and error margin ( ). The values of these parameters should all be set prior to the model execution. Assuming each parameter has N different discrete values and two different types of kernel types are iterated, 6 the number of iterations of an exhaustive search would be 2N as shown in Table 4.1. Roughly speaking, N can correspond to ~10. For instance, we use 10 different values of number of clusters, e.g., c={2,3,4,…,11}, during exhaustive search optimization process. Table 4.1 Number of parameters of a Type-1 Improved Fuzzy Functions – T1IFF experiment
Parameter m: degree of fuzziness c: number of clusters α-cut {τ, Φ} matrix structures to form fuzzy functions Creg: regularization constant of SVR epsilon constant for SVR Kernel Type Total
Number of discrete values N N N N N N 2 6 2N
4.6 Summary
147
On the other hand, for the ET1IFF, let the total number of iterations of GA be 2 N and each chromosome represent a randomly initialized T1IFF model. Roughly speaking, we set the number of iterations to 100, so the correspondence of N dif2 ferent values of a parameter of T1IFF listed in Table 4.1 will be ~N . The popula2 tion size will be set to N , since we use 50-100 different populations in our experiments. At each iteration, two child chromosomes are selected for crossover and their fitness function is evaluated, and one child is mutated from two selected parents and its fitness is also evaluated. Hence, three T1IFF models are built at each iteration of GA. Therefore the number of iterations of the ET1IFF would be 2 2 2 N +2N =3N as shown in Table 4.2. Table 4.2 The number of parameters of an Evolutionary Type-1 Imssproved Fuzzy Function experiments
Parameter 2 Initial Run: population size = N Secondary run: evaluation of 1 crossover and 1 mutation operations Total
Number of T1IFF iterations 2 N 2 2N 2
3N
It is evident that the number of iterations is dramatically reduced when ET1IFF is used for modeling instead of the T1IFF method. On the other hand, one of the steps of T1IFF structure identification method is to analyze optimum number of clusters with a cluster validity index. Cluster validity functions are not used in the proposed ET1IFF due to the fact that number of clusters is optimized during ET1IFF structure identification. Therefore, ET1IFF systems eliminate the need for an additional function to evaluate the number of clusters. With ET1IFF systems, we evaluate the optimum number of clusters based on the system model performance. The optimum model is chosen from among the populations with the best performance measure on the validation dataset at the end of the GA iteration.
4.6 Summary In knowledge discovery applications, one tries to discover models with potentially higher prediction accuracy. Fuzzy Functions can be useful in such cases, if one succeeds in learning them automatically while at the same time preserving their predictability. In this chapter, we outline some fundamental properties of new fuzzy functions systems that are introduced instead of fuzzy rule based systems. The new fuzzy functions systems are designed on the bases of fuzzy clustering and regression techniques. We discussed structure identification and reasoning mechanism of the fuzzy function methods for two different system domain structures; regression and classification systems. We developed a hybrid fuzzy function system using evolutionary algorithms to optimize the membership functions and fuzzy functions dynamically with less user interventions. In the final part, we briefly presented the potential opportunity to reduce the number of iterations to find the optimum model with the hybrid fuzzy function systems using genetic algorithms as opposed to standard fuzzy functions systems based on the exhaustive search method.
Chapter 5
Modeling Uncertainty with Improved Fuzzy Functions 5 Modeling Uncertainty with Improved Fuzzy Functions This chapter introduces a new uncertainty modeling architecture for the new improved fuzzy functions systems1. The theory is based on a new interval type-2 fuzzy system. The uncertainties are captured by automatic identification of the structure of fuzzy functions, and upper and lower boundaries of key parameters that define fuzzy sets2. A new type reduction method is introduced.
“Inferences of Science and Common Sense differ from those of deductive logic and mathematics in a very important respect, namely, when the premises are true and the reasoning correct, the conclusion is only probable.”
—Bertrand Russell
5.1 Motivation Clearer characterization of mathematical models are required, when uncertainty in distinguishing various facets of fuzzy systems increase. For instance, we use fuzzy sets of type-1 when there are uncertainties in determining the membership value of an element in a set. Similarly, when there are additional uncertainties to determine type-1 fuzzy sets, or the circumstances are so fuzzy that we cannot determine the membership values in a type-1 fuzzy set, we may use type-2 fuzzy sets, which were initially introduced by Prof. Zadeh in 19753. We can continue on this line of 1
The first part of this chapter is an extension of [Celikyilmaz and Turksen, 2007f;2008d]. The last section of this chapter is an extension of [Celikyilmaz and Turksen, 2007h]. 3 Type-1 fuzzy sets are described by membership functions that are certain, whereas type-2 fuzzy sets are described by membership functions that are themselves type-1 fuzzy sets. 2
A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 149–215. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com
150
5 Modeling Uncertainty with Improved Fuzzy Functions
thinking and use type 3, and higher fuzzy sets and end up with type ∞ fuzzy sets, which are not feasible and as a result the complexity of the fuzzy logic system increases rapidly. Most of the latter uncertain circumstances arise from uncertain measurements in system models. There are situations where it is not possible to determine the exact value of a measurement. For instance, different people would obtain different results in decimals while measuring the weight of 1 pack of a substance, or the relevancy of a document would differ from one person to another while determining the class of relevant user requests to retrieve documents. In such cases, we use type-1 fuzzy sets instead of crisp fuzzy sets; however, as Mendel [2001] points out, “…even in type-1 fuzzy sets we specify the membership function exactly, which is ‘counter-intuitive. If we cannot determine the exact value of an uncertain quantity, how can we determine its exact membership value in a type-2 fuzzy set? We need to use a type-∞ fuzzy set to ‘completely’ represent uncertainty”. So, type-1 fuzzy sets can be thought of as a first-order approximation to the uncertainty in real life. With the type-2 fuzzy sets, one tries to get at a second-order approximation. One may investigate higher types; but as one goes on to higher types, the complexity of systems increases rapidly. Thus, in this part, we just deal with type-2 sets. In the previous chapter, Type-1 Improved Fuzzy Function (T1IFF) systems are proposed as opposed to conventional Fuzzy Rule Base systems by implementing the improved fuzzy functions approach to reduce the number of fuzzy operators and operations, as well as to improve system model performance. Additionally, T1IFF optimized with evolutionary algorithms (ET1IFF), are introduced to optimize the learning parameters by a stochastic search method. Although these systems are hypothesized to improve the performance, they have limited capacity in identifying uncertainties, because type-1 fuzzy sets are used in these systems. Let us explain the uncertainty concept in T1IFF approaches. In an analogical structure to Type-1 Fuzzy Rule Base (FRB) as sketched in [Mendel, 2001], we identify four components in T1IFF where we use (1) “Fuzzifier” in place of Improved Fuzzy Clustering (IFC), (2) “Rules” in place of “Fuzzy Functions”, (3) “Inference” and (4) “Output Processor” is the function to obtain the membership weighted average output value, in other words “Fuzzy Output Weighing”. We make this distinction to emphasize that in our work we introduce processes that are distinctly different from the processes applied by Mendel’s approach. Furthermore, as it was pointed out, our structures are different from Mendel’s, e.g., the structure of fuzzy rules and structure of “Fuzzy Functions” are different. In addition, in Mendel’s approach, generally, one obtains membership functions from experts, whereas in Fuzzy Functions strategies the membership values are obtained through the use of fuzzy clustering methods which are to be designed to structure these fuzzy functions. We have introduced the details of each T1IFF system component separately in the previous chapter. The question we are concerned with is that; why is it necessary to identify uncertainties in fuzzy function systems? Quite often, the knowledge that we use to construct the fuzzy functions is uncertain. T1IFF systems are not able to deal with fuzzy function uncertainties directly. Such uncertainties lead to fuzzy functions, whose membership values are
5.1 Motivation
151
uncertain. Type-2 fuzzy sets, on the other hand, can be very useful to identify an inexact membership function of a fuzzy set. In this respect, they can be used to handle fuzzy function uncertainties, and measurement uncertainties while determining membership values. In a T1IFF system as shown in Figure 5.1, crisp input numbers are mapped onto fuzzy sets using fuzzy clustering methods, e.g., Improved Fuzzy Clustering (IFC). The fuzzy sets are represented by membership values in each cluster. We can easily list some possible uncertainties of T1IFF systems here; the inputs to T1IFF systems prior to fuzzification may be uncertain, the parameters of fuzzy clustering method may be uncertain. In particular, among the most important uncertain parameters of the improved fuzzy clustering method are: (i) the number of clusters representing each local fuzzy functions, (ii) the degree of overlap between clusters, (iii) the type of similarity measure, and most importantly (iv) the type of membership value calculation equation, (v) the structure of local models, viz., the interim fuzzy function parameters. “RULES”* Fuzzy Functions Crisp Inputs
Crisp Output Type-0
“FUZZIFIER”* Improved Fuzzy Clustering
Fuzzy Input Sets of Type-1
“OUTPUT PROCESSOR”* Fuzzy Output Weighing
“INFERENCE”*
Fuzzy Output Set of Type-1
Fig. 5.1 Type-1 Improved Fuzzy Functions Systems. ‘*’ are Mendel’s terminology which correspond to terminology used for fuzzy function strategies in this book.
The inference (reasoning) process in Figure 5.1 determines the computation of improved fuzzy functions. It handles the way the fuzzy functions are activated to what degree and combines them. That is; the types of functions depend on the problem domain. For classification models, the classifier functions such as logistic regression is used, whereas for prediction (regression) domains regression type functions such as ordinary least squares are implemented. The inference mechanism of T1IFF systems is simpler than in Type-1 FRB systems, since aggregation, and implication operators are not required during inference. Other features of T1IFF are as follows: One fuzzy output from each fuzzy function is obtained. The fuzzy outputs can be continuous valued numbers for prediction problem domains or posterior probabilities, e.g., between 0 and 1, for classification problem domains. T1IFF systems can identify different fuzzy function structures for each local structure (cluster). Even within one cluster, the choice of an optimum function structure may be uncertain based on parameter uncertainties. Here, the degree of membership affects the activation and structure of a fuzzy function.
152
5 Modeling Uncertainty with Improved Fuzzy Functions
In many applications, a crisp output value must be obtained at the end of the modeling process. The output processor in Figure 5.1 converts the fuzzy output set into a crisp output. Several different defuzzification methods can be used to accomplish this. For Mamdani type fuzzy systems [Mamdani, Assilian, 1974], where the output is a fuzzy set, a defuzzification method such as center of gravity is used in order to obtain a crisp value. In T1IFF systems, since one single output from each fuzzy function is obtained, a defuzzification method is not required. Another point one should consider in T1IFF systems is that, some of its essential system parameters (such as the number of clusters, degree of fuzziness, parameter and structure of fuzzy functions of each cluster) are provided by the designer before the method of execution, which potentially introduces various sorts of uncertainties. We used the exhaustive search method to optimize these parameters for given list of their values. For continuous valued parameters such as fuzzifier, m>1.001, we use crisp (discretisized) list of values such as m={1.1, 1.3, .. 2.6..}. Unless one uses an infinite number of values for continuous valued parameters, capturing the global optimal solution is not certain. Infact this may increase the processing time which is not something we want to increase in the knowledge extraction processes. Since type-1 fuzzy sets are used in these models, most of the uncertainties cannot be captured and this may hamper the possibility of developing an optimal model. To capture some of the uncertainties of system models, as well as to improve performance of earlier T1IFF approaches, in this section, a new Type-2 Improved Fuzzy Function System – T2IFF – is introduced. The topology of the new approach is shown in Figure 5.2. It is very similar to T1IFF systems; however, in the new type-2 fuzzy system, the fuzzifier component i.e., degree of fuzziness of structures, identifies type-2 fuzzy sets and a new case-based type reduction method reduces the type of fuzzy sets down one level, from two to one, right at the beginning of the inference method. The rest of the components in the framework are similar to the T1IFF systems. The fuzzifier of T2IFF systems maps the crisp input onto fuzzy sets of type-2. The distinction between type-1 and type-2 systems can be identified based on the nature of fuzzy membership values. In this book, the fuzzy membership values of T2IFF systems are expressed with interval type-2 fuzzy sets. Crisp Inputs
“RULES”* Fuzzy Functions
“FUZZIFIER”* Improved Fuzzy Clustering
“OUTPUT PROCESSOR”* Fuzzy Output Weighing
Crisp Output Type-0
Input Fuzzy Sets of Type-2
Case-Based Type-Reducer*
Input Fuzzy Sets of Type-1
“INFERENCE”
Fuzzy Output Set of Type-1
Fig. 5.2 Type-2 Improved Fuzzy Functions System. * indicate difference between Mendel’s terminology and Fuzzy Function terminology.
5.1 Motivation
153
In analogical structure to Type-2 Fuzzy Logic System [Mendel, 2001], we identify five components in Type-2 Improved Fuzzy Functions (T2IFF) strategy as follows: (1) “Fuzzifier” in place of “Improved Fuzzy Clustering (IFC)”, (2) “Rules” in place of “Fuzzy Functions”, (3) “Inference”, (4) Case Based TypeReducer and (5) “Output Processor” in place of “Fuzzy Output Weighing”. After the fuzzification step, the type of the interval type-2 membership values are reduced down to type-1 based on the new Case-Based Type-Reduction method. This method is based on a search algorithm, which does not require complex fuzzy operators as do the other type reduction methods of standard Type-2 Fuzzy Logic Systems based on Fuzzy Rule Bases [Karnik et al. 1999; Mendel 2001]. After the type of the fuzzy sets is reduced down to type-1, the rest of the steps of the new T2IFF system are similar to T1IFF system. In this work, we introduce two new architectures of structure identification for the Type-2 Improved Fuzzy Functions (T2IFF) systems, which identify interval type-2 fuzzy functions. Therefore, we present them as two separate system modeling tools (strategies) as; ¾
Descrete Interval Type-2 Improved Fuzzy Functions (DIT2IFF),
¾
Evolutionary Design of Discrete Interval Type-2 Improved Fuzzy Functions (EDIT2IFF),
The two new T2IFF strategies are structurally different from each other based on the search method used to identify the global optimum solutions. Structure identification of the new DIT2IFF is based on a supervised Improved Fuzzy Clustering (IFC) method. Uncertainties are identified based on the learning parameters of the IFC and fuzzy function structures and the parameters are optimized based on an exhaustive search. On the other hand, EDIT2IFF has a three-phase structure identification method, which initially identifies the optimum uncertainty interval of the membership values, interval fuzzy functions and the optimum values of the rest of the inference parameters by implementing genetic algorithms. Then using these optimum parameters, the new system implements DIT2IFF system structure based on the identified uncertainty interval. The EDIT2IFF is proposed to reduce the number of iterations required for structure identification of the DIT2IFF by optimizing the system parameters automatically based on a stochastic search algorithm. The fact is that, there is infinite number of discrete type-1 membership values within the interval type-2 fuzzy sets. Even if one chooses the optimum ones from within this interval, it would either take too long time to find the optimum one(s) or there is a chance to get stuck at the local minima and not achieve the best model at all. With the implementation of evolutionary search algorithms, we are trying to identify the uncertainty interval, which encapsulates the optimum DIT2IFF model. There is still a chance of getting stuck at the local minima with the evolutionary methods; however, with more iterations and larger populations, this can be avoided to a certain degree. EDIT2IFF will have still less number of iterations compared to DIT2IFF, to be proved at the very last section of this chapter. An evolutionary model would be capable to shift, locate and possibly narrow the interval to find the optimum interval based on the given data and interval valued parameters.
154
5 Modeling Uncertainty with Improved Fuzzy Functions
It should be pointed out that, the type-2 fuzzy system modeling tools in this chapter applies the Improved Fuzzy Clustering (IFC) methodology. In addition to IFC, Fuzzy C-Means Clustering (FCM) algorithm is also used in these type-2 fuzzy system modeling tools in structure identification method to build two new strategies: ¾
Discrete Interval Type-2 Fuzzy Functions (DIT2FF),
¾
Evolutionary Design of Discrete Interval Type-2 Fuzzy Functions (EDIT2FF).
The common characteristic of these strategies is that they implement FCM clustering algorithm [Bezdek, 1981a] instead of IFC clustering methods. In Chapter 4, we presented the differences between the application of FCM and IFC methods into type-1 fuzzy functions approaches. These differences also apply to type-2 fuzzy functions methods of this chapter. Thus, we will only present details of the type-2 fuzzy functions using Improved Fuzzy Clustering (IFC) method. In this work, we not only deal with the regression models, where we estimate continuous fuzzy functions, we also extend the two novel T2IFF methods for classification problems as well and denote them with Discrete Interval Type-2 Improved Fuzzy Functions for Classification (DIT2IFF-C) and Evolutionary Design of Discrete Interval Type-2 Improved Fuzzy Functions for classification (EDIT2IFF-C). The only difference between DIT2IFF/EIT2IFF models of regression and DIT2IFF-C/EDIT2IFF-C models of classification is the way the output variables are represented. DIT2IFF and EDIT2IFF models try to model a continuous output variable by approximating regression surfaces of each local domain (cluster). On the other hand, DIT2IFF-C and EDIT2IFF-C models estimate decision surfaces for each cluster to discriminate two separate classes of input data points denoted by their dichotomous output variable. The extensions of type-2 fuzzy functions methods for classification problem domains are also investigated. Hence, Discrete Interval Type-2 Fuzzy Functions for Classification (DIT2FF-C) and Evolutionary Design of Discrete Interval Type-2 Fuzzy Functions for classification (EDIT2FF-C) are two methodologies that implement FCM clustering method instead of IFC method. Thus, in this chapter we will only give the details of the type-2 improved fuzzy function strategies for classification using IFC method, since the differences between them for regression problem domains also apply in classification problem domains. In what follows, we present the uncertainty concept using fuzzy theory and preview different types of uncertainty modeling methods. We then present novel uncertainty modeling approach, namely Discrete Interval Type-2 Improved Fuzzy Functions system, DIT2IFF. We introduce the case-based type reduction method. Finally, we present evolutionary design of discrete interval type-2 improved fuzzy function system, EDIT2IFF, which is capable of automatically identifying uncertainty bounds of DIT2IFF system parameters.
5.2 Uncertainty Uncertainty is one of the key characteristics of any mathematical system model. It has a pivotal role in any efforts to maximize the usefulness of system models [Klir and Yuan, 1995]. Uncertainty becomes very valuable when analyzed in
5.2 Uncertainty
155
combination with different characteristics of models: in general, identifying uncertainty to reduce the complexity and increase the performance. The term ‘modeling uncertainty’ defines the effort for identifying uncertainties when building a system model. The challenging part of modeling uncertainty is to determine the optimal level of allowable uncertainty. In this work, this allowable uncertainty will be identified automatically by evolving stochastic models of T1IFF approach. In many research papers on uncertainty modeling and books on mathematics of uncertainty, it is generally agreed that Zadeh’s seminal paper in 1965 has the pioneering role as the starting point of the modern concept of uncertainty. In his paper, Zadeh [1965] introduced a theory whose objects are fuzzy sets of which the boundaries are not precise and the membership to a fuzzy set is not a matter of confirmation, but rather a matter of degree. Let x be an object in fuzzy set A. The proposition “x is a member of A” is expressed with a degree to which x is actually a member of fuzzy set A. The capabilities of fuzzy sets provide us the meaningful and powerful representations of measurement uncertainties, but also meaningful representation of vague concepts expressed in natural language.
Randomness Fuzziness
Possibility
Ambiguity
Uncertainty
Probability
Imprecision
Vagueness Similarity
Fig. 5.3 Different Types of Uncertainty
Among many types of uncertainties, as shown in Figure 5.3, two categories of uncertainty emerge quite naturally: vagueness and ambiguity. Vagueness is usually defined as the difficulty of making sharp and precise distinctions in the world. On the other hand, ambiguity is defined as one-to-many relations in which the choice between many alternatives is left unspecified. Each of these concepts relates to fuzzy set concepts differently. Vagueness is connected with fuzziness, in which fuzzy set theory provides a basic mathematical framework. On the other hand, the concept of the fuzzy measure provides our underlying framework for dealing with ambiguity. Hence, fuzzy measure and fuzzy sets reflect two fundamentally different types of uncertainty⎯vagueness and ambiguity concepts, and they provide frameworks for different types of uncertainty.
156
5 Modeling Uncertainty with Improved Fuzzy Functions
Mathematical models are simplified representations of the phenomena being studied and a key aspect of the modeling process is the careful choices of model assumptions. The structure of mathematical models employed to represent systems is often a key source of uncertainty based on these assumptions. In this context, different uncertainty modeling methods are summarized and they are categorized into four different groups: 1. Uncertainty in Natural Language Processing. The meaning of words in natural language that corresponds to fuzzy sets can be different, since same words may be used for different meanings. This phenomenon brings about the notion of computing with words [Zadeh, 1996]. In addition, the difficulty arises when one wants to model with words such as “most”, “popular”, which have meanings that are vague. In addition, uncertainty in decision-making arises due to the non-uniform data characteristics such as noisy data. 2. Uncertainty defined in the break down of classical equations. More explicitly, as proposed by Turksen [1986, 1989], representation of combined linguistic expression of linguistic values with linguistic connectives should not just be mapped arbitrarily to one of the two fuzzy canonical forms, but to the “interval” generated by both fuzzy canonical forms. This is required because Turksen proved that the equivalence between DNF(.)≡CNF(.) breaks down in fuzzy theory, i.e., FDCF(.)≠FCCF(.)4 for certain families of t-norms and tconorms, and more specifically he explains that the containment between two fuzzy canonical forms, should be represented as FDCF(.)⊆FCCF(.) for archimedian t-norms and t-conorms. 3. Uncertainty defined with Type-2 Fuzzy Sets, which handles uncertainties about the meanings of words. All the uncertainty about the words can be explained by blurring the boundaries of type-1 membership function into a footprint of uncertainty, which is then called type-2 membership function [Mendel and Karnik, 1998; Mendel et al., 2006; Mendel and Liang, 2000; Klir and Folger, 1988; Klir and Wierman, 1998; etc.]. 4. Uncertainty defined with the changing values of parameters (attributes) of Fuzzy Systems, e.g., degree of freedom, number of clusters, alpha-cut, fuzzy function structures etc. Turksen [2001; 2002], Uncu, and Turksen [2007], Uncu et al. [2004a,b], Uncu [2003], Ozkan-Turksen [2004, 2007] identified other sources of uncertainties and introduced type-2 fuzziness analysis, which can explain uncertainties about the parameters of the fuzzy systems, especially unknown interval values of degrees of fuzziness of Fuzzy c-Means Clustering (FCM) algorithm. In [Ozkan and Turksen, 2007], lower and upper boundaries of the level of fuzziness parameter of the FCM clustering method is proven to be [1.4, 2.6] based on numerical analysis. In addition to investigating the boundaries of level of fuzziness of systems models, in this work, an additional source of parameter uncertainty is used, namely, the structure uncertainty of the fuzzy functions [Turksen, 2006; Turksen and Celikyilmaz, 2006; 4
FCCF: Fuzzy Conjunctive Canonical Form, FDCF: Fuzzy Disjunctive Canonical Form.
5.3 Conventional Type-2 Fuzzy Systems
157
Celikyilmaz and Turksen, 2007g, 2007h]. This uncertainty emerges from identification of different membership value transformations in approximating local fuzzy functions. In this chapter, the last two categories will be explained in more detail. For the first two categories, references on computing with words [Zadeh 1996;2001;Turksen, 2002; 2007] are suggested.
5.3 Conventional Type-2 Fuzzy Systems Today, the research on Type-2 Fuzzy Systems is a growing area and more publications are presented in recent conferences than ever. This section briefly introduces well-known Type-2 Fuzzy Systems and their recent advancements, which were mentioned in the last two categories of uncertainty modeling approaches in the previous subsection. These systems are based on Fuzzy Rule Base approaches and they implement type-2 fuzzy sets. (type-2 fuzzy sets and operations are reviewed in Chapter 2.) This sub-section presents a brief description of most commonly implemented and applied type-2 fuzzy systems presented in series of papers by Mendel, Karnik, and Liang [Mendel and Karnik, 1998; Liang and Mendel, 2000; Mendel et al, 1999; Mendel and Liang, 2000; Mendel, 2001;2003]. Additionally, some recent developments on type 2 fuzzy systems using discrete interval type-2 fuzzy sets e.g., [Uncu and Turksen, 2007] are reviewed and comparative analyses between proposed and earlier discrete type-2 fuzzy systems are discussed. It should be recalled that there are two types of type-2 fuzzy sets, generalized type-2 and interval type-2 fuzzy sets, as presented in Chapter 2. Type-2 fuzzy systems are categorized based on these two different types of fuzzy sets. Hence, ‘Generalized Type-2 Fuzzy Systems’ is the general name given to type-2 fuzzy systems whose antecedent and/or consequent part of the rule base implements a type-2 fuzzy set. Interval type-2 Fuzzy Systems use interval type-2 fuzzy sets.
5.3.1 Generalized Type-2 Fuzzy Rule Bases Systems (GT2FRB) A generalized Type-2 Fuzzy Rule Base (GT2FRB) [Mendel and Karnik, 1999] is an extended version of Type-1 Fuzzy Rule Base structure, which was presented in chapter 2. The GT2FRB is represented by; nv
% i : IF AND R j =1
(x
j
∈ X j isr %Aij
)
%i THEN y ∈ Y isr B
(5.1)
Let u and w represent primary memberships values and f(u) and f(w) represent the secondary membership values. In (5.1) Ãij is the linguistic label associated with jth input variable in the ith rule represented by a type-2 membership function, % is μ% i ( x j ) ∈ X j → f x j (u ) / u, u ∈ J x j , J x j ⊆ [0,1] , which is depicted in Figure 5.4. B i the linguistic label associated with output variable in the ith rule with type-2 membership function μ% i ( y ) ∈ Y → f y ( w) / w, w ∈ J y , J y ⊆ [0,1] .
158
5 Modeling Uncertainty with Improved Fuzzy Functions
Fig. 5.4 Example of a full type 2 membership function. The shaded area is the ‘Footprint of Uncertainty’ (FOU). The amplitude of the sticks is secondary membership values.
Each combination of (x,u, f(u)) denotes an embedded set and there may be infinite number of embedded sets for the universe of discourse X and Jx. The operations of GT2FRB are very much the same as the Type-1 FRB operations; however, in GT2FRB systems type-2 fuzzy operations are used. This introduces an additional step, e.g., type reduction, that is shown in Figure 5.5. Identification of Degree of Fire Fuzzification + Aggregation of Antecendets
Implication
Fuzzy Input Sets
Rules + Aggregation of Deduced output Fuzzy sets
Fuzzy Output Sets
Type Reduction
Defuzzifier
Reduction of Type 2 to Type 1
Defuzzification Reduction of Type1 to Type 0
Crisp Output
Fig. 5.5 The structure of Generalized Type-2 Fuzzy Rule Base Systems (GT2FRB)
The following type-2 fuzzy rule base operations are a summary of definitions extracted from [Mendel, 2001] by using PRODUCT, MIN or MAX connectives via Mamdani approach. In general however, other t-norms and t-conorms can also be used. From (5.1), the antecedent type-2 fuzzy sets Ãi is determined follows: nv
%A = ∩ i
j =1
( %A ) ij
∀i = 1,...,c
(5.2)
In (5.2) the membership values of an object x is calculated by
% i= μ
nv j =1
% i (xj ) μ
(5.3)
5.3 Conventional Type-2 Fuzzy Systems
159
It should be noted that, there is an independency assumption in type-2 fuzzy system models, in that, each type-2 fuzzy set associated with each antecedent variable is defined separately in each rule. is the meet operator used to combine type-2 fuzzy sets, which was defined in Chapter 2. Consequently, the relation between % is determined by the ith type-2 fuzzy relation as inputs and output in each rule R i follows:
μ% R% ( x, y ) = μ% A% ( x) → μ% B% ( y ) = ⎛⎜ μ% A% ( x) ⎞⎟ μ% B% ( y ) j nv
i
⎝
i
i
ij
⎠
i
(5.4)
Equation (5.4) is an extension of equation (2.80) for type-2 fuzzy sets. From (5.4), the output fuzzy set of each rule, %Bi* , is calculated using Generalized Modus Ponens as follows:
( )(
B%i* = A%i′ ° A%i → B%i
)
(5.5)
For a given observation x′, (5.5) is represented with membership functions by:
μ% i* ( y ) =
x
⎡ ⎣
( μ% ′( x′), μ%
R%i
)
( x, y ) ⎤ = ⎦
x
⎡ μ% R% ( x, y ) ⎤ , ∀i = 1,..., c ⎣ i ⎦
(5.6)
Since the given observation x′ is a crisp number, (5.6) is reduced to ⎡% ′ ⎤ x ⎣ μ R%i ( x , y ) ⎦ . Finally, each output fuzzy set i, i=1…c, is aggregated to obtain one fuzzy set, and the aggregation formula is computed by:
μ% * ( y ) = μ% i* ( y ), ∀i = 1,..., c i
(5.7)
Mendel and Karnik’s GT2FRB strategy is an extension of Type-1 FRB using type-2 fuzzy sets and extension principle proposed by Zadeh [1975a]. GT2FRB strategies [Mendel, 2001] can be computationally challenging when the number of variables and number of data vectors are large. The output set corresponding to each rule of the type-2 fuzzy system, as shown in (5.1), is a type-2 fuzzy set. After the implication step in (5.6), defuzzification takes place. Defuzzification of a type-2 fuzzy set consists of two stages: ¾
Type-reduction, creating a type-1 fuzzy set, and
¾
Defuzzification of the resultant type-1 fuzzy set.
The second step, defuzzification, can be much easily implemented compared to the first stage, which is more problematic. The type-reduction is reliant on the concept of embedded type-2 sets. A type-reducer combines all these type-2 output fuzzy sets, just as type-1 defuzzifier method, and then performers a centroid calculation on this type-2 fuzzy set (similar to center of gravity of type-1 fuzzy sets, as presented in Chapter 2). This leads to a type-1 fuzzy set; therefore, this operation was named “type-reduction”. In the next, we will present the most common type reduction method. The challenge with the type-reduction method is that the
160
5 Modeling Uncertainty with Improved Fuzzy Functions
centroids and the membership value computations has to be repeated for every (x, u, f(u)) pair, and there can be infinite number of membership functions. In the experiments, these intervals can be discretisized into embedded sets, however, the degree of discretization is an unknown value. Therefore, computational complexity can be high in generalized type-2 fuzzy sets if the discretization degree is very small, i.e., precision is high, which results too many embedded sets. This would increase the complexity of system modeling strategy. Most recently, to reduce some of the complexities of type-2 fuzzy logic systems, a geometric approach to implement GT2FRB is proposed in [Coupland and John, 2007]. Although it is hypothesized that it presents promising results, it is still in development stage and the secondary membership values are still determined by the experts. A challenging part of generalized type-2 fuzzy systems is the uncertainty in the determination of secondary membership values. For instance, the primary membership values can be determined from the given dataset by fuzzy clustering methods. Even the footprint of uncertainty can be defined by varying the fuzziness parameter of the FCM clustering algorithm [Hwang, Rhee, 2007; Ozkan and Turksen, 2007], however, to date, there is no algorithm that can automatically extract the secondary membership functions from the data. Initial studies of automatic determination of full type-2 membership values, i.e., the secondary membership values, are presented in [Celikyilmaz and Turksen, 2008a] and they use fuzzy regression clustering methods, which is rather a simple method compared to earlier type-2 fuzzy systems, to identify secondary membership functions. To cope with these computational complexities of generalized type-2 fuzzy systems, many researchers today use interval type-2 fuzzy structure identification and inference methods instead of GT2FRB.
5.3.2
Interval Valued Type-2 Fuzzy Rule Bases Systems (IT2FRB)
In the previous section, the generalized Type-2 Fuzzy Rule Base (GT2FRB) proposed by Karnik and Mendel [Mendel and Karnik, 1998; Karnik et al, 1999] are presented. Karnik and Mendel have pointed out the complexity of the GT2FRB systems. Thus, Liang and Mendel have proposed a special case of GT2FRB systems in [Liang and Mendel, 2000] in which only interval type-2 fuzzy sets are used. In an analogical manner to the GT2FRB systems, the Interval Valued Type2 Fuzzy Rule Bases (IT2FRB) systems considered only crisp MIN, PRODUCT and MAX connectives. In general however, other t-norms and t-conorms can also be used. Here we will give a brief definition of interval type-2 fuzzy sets as explained in Chapter 2. Let à be a linguistic label with type-2 membership function on the universe of discourse of base variable x, μ% i ( x j ) ∈ X j → f x (u ) / u, u ∈ J x , J x ⊆ [0,1] . In order to j
j
j
consider μ% A ( x) as interval valued type-2 membership function, the following condition has to be satisfied:
f x (u ) = 1, ∀x ∈ X , u ∈ J x , J x ⊆ [0,1]
(5.8)
5.3 Conventional Type-2 Fuzzy Systems
161
Let upper, μ% UA ( x) , and lower membership values, μ% AL ( x) , of a base variable are defined as follows:
μ UA% ( x) ∈ J xU , μ AL% ( x) ∈ J xL
(5.9)
The upper and lower bounds are denoted with JxU and JxL [Liang and Mendel, 2000]. Then, each data vector is assigned a secondary membership value of 1 between the upper and lower boundaries:
μ A% ( x) : X → 1/ u , u ∈ ⎡⎣ μ AL% ( x ), μ UA% ( x ) ⎤⎦
(5.10)
As in GT2FRB approach, we define the fuzzy rule base structure of the interval type-2 fuzzy rule base structure (IT2FRB) as follows: nv
% i : IF AND R j =1
(x
j
∈ X j isr %Aij
)
%i , THEN y ∈ Y isr B
(5.11)
where Ãij is the linguistic label associated with jth antecedent in the ith rule with type-2 membership function of μ% i ( x j ) : X → 1/ u, u ∈ ⎡⎣ μ% iL ( x j ), μ% iU ( x j ) ⎤⎦ . In general case each ith rule is represented by:
% i : IF x ∈ X isr %Ai THEN y ∈ Y isr B %i , R
(5.12)
where Ãi is the linguistic label, associated with aggregated antecedents in the ith rule with type-2 membership function of μ% i ( x) : X → 1/ u, u ∈ ⎡⎣ μ%iL ( x), μ%iU ( x) ⎤⎦ . Aggregated antecedent membership values for each rule are calculated by
μ% iL ( x) = T jnv=1 ( μ% iL ( x j ) ) , μ% iU ( x) = T jnv=1 ( μ% iU ( x j ) )
(5.13)
In (5.13), T denotes the t-norm connective. Interval valued type-2 membership functions further reduce (5.6) as follows:
μ% i* ( y ) : Y → 1/ w, w ∈ ⎡⎣ μ% i*L ( y ), μ% i*U ( y ) ⎤⎦
(5.14)
In (5.14), ⎡⎣ μ% i*L ( y ), μ% i*U ( y ) ⎤⎦ are lower and upper membership functions that define the boundaries of the interval valued type-2 model output fuzzy set μ% i* ( y ) . Let T denote the t-norm connective, then, for each given data object, the interval model output-membership function in the ith rule is calculated as follows:
μ% i* L ( y ) = T ( μ% iL ( x), μ% iL ( y ) )
μ% i*U ( y ) = T ( μ% iU ( x), μ% iU ( y ) )
(5.15)
In order to obtain one single type-2 membership function of the final model output fuzzy set, μ% * ( y ) , output type-2 membership function of each ith rule are aggregated as follows:
162
5 Modeling Uncertainty with Improved Fuzzy Functions
μ% *L ( y ) = Sic=*1 ( μ% iL ( y ) )
μ% *U ( y ) = Sic=*1 ( μ% iU ( y ) )
(5.16)
In (5.16) S denotes t-conorm, which was taken as MAX operator in [Liang, and Mendel, 2000]. In order to obtain a crisp value, type-reduction can be applied on [ μ% i*L ( y ), μ% i*U ( y )] as explained in the previous section with only one distinction that is the secondary membership functions only take on values of 1. Thus the output of type-reduction applied on [ μ% i*L ( y ), μ% i*U ( y )] , y% * , is an interval where the lower and upper bounds are denoted as y*L and y*U respectively, i.e., y% * ∈[y*L,y*U]. If the goal is to obtain a crisp output, Liang and Mendel proposed to calculate the model output, y*, as follows: y* = [y*L+y*U]/2
(5.17)
5.3.3 Most Common Type-Reduction Methods The output fuzzy sets corresponding to each rule of Generalized or Interval Type-2 Fuzzy Rule Base system is a type-2 fuzzy set. In general case, type-reducer combines type-2 output fuzzy sets to perform centroid calculation. As a result, the type of the fuzzy set is reduced from type-2 to type-1 and resulting type-1 fuzzy set is called the “type-reduced” set. Here we will introduce type reduction method for generalized and interval type-2 fuzzy systems. Generalized type-2 fuzzy set, Ã, has primary membership values (type-2 membership functions), μÃ(x), i.e., Ã={(x,u),μÃ(x,u)|∀x∈X,∀u∈Jx⊆Jx[0,1]}, in which 0≤μÃ(x,u)≤1, which are mapped to a secondary membership function f(x), which maps values in [0,1] to values in [0,1]. Jx is the primary membership function of x, and the amplitude of a secondary membership function is called a secondary grade, fx(u), where u’s are the primary membership values. The type-2 fuzzy set is represented with: à =∫x∈X [∫u∈Jx fx(u)/u]/x
(5.18)
In (5.18) Jx⊆[0,1], x∈X, u∈[0,1], fx(u)∈[0,1] [Mendel et al., 2006]. Hence, each datum X has many primary membership values, u, which also has secondary membership functions. Each u of x has a secondary degree of membership; as shown in Fig. 2.9, that assigns the degree to which u belongs to x. On the other
% , has secondary membership values, μÃ(x), hand the interval type-2 fuzzy set, A which are mapped to a secondary membership function f(u), which maps values in [0,1] to values in {0,1}. Let the domain of the secondary membership function denoted by Jx; (5.19) à =∫ [ ∫ 1/u]/x x∈X
u∈Jx
Type-2 membership functions are a limited version of generalized type-2 fuzzy sets, where the secondary membership grade is always 1. Only boundaries of membership values are defined for each datum, which form the footprint of
5.3 Conventional Type-2 Fuzzy Systems
163
uncertainty, as shown in Fig. 2.11. This limitation allows type-2 interval fuzzy sets to be processed a lot more quicker than generalized type-2 fuzzy sets. The interval type-2 fuzzy sets are bounded with upper and lower type-1 fuzzy sets. Type reduction takes a type-2 fuzzy set and reduces its type down to a type-1 fuzzy set. Type reducing an interval type-2 fuzzy set results in crisp interval or discretisized values. Next, type reduction for both type-2 fuzzy sets will be briefly reviewed. The most commonly used type-reduction method of type-2 fuzzy rule base systems is the ‘centroid type reducer’. In this section, we will only review one type reducer in order to present how a well-known type reduction can be applied. More detailed information about other types of type-reducers can be found in [Mendel, 2001]. A general type reduction works as follows: type-2 fuzzy sets can be represented as a collection of embedded type-2 fuzzy sets [Mendel et al. 2006]. An embedded set is contained in a type-2 set in such a way that for every value of x, there is one value of u (primary membership function), and the secondary membership grade being determined by f(u), Each of these embedded type-2 fuzzy sets has a centroid that can be calculated in different ways one of the common method is the centroid method. Each of these centroid values provides a point in the domain of the type-reduced set. The centroid of a type-1 fuzzy set A, whose domain x∈X is discretized into N points, x1,…xN is given as:
∑ k =1 xk μ A ( xk ) N ∑ k =1 μ A ( xk ) N
cA =
(5.20)
Similarly, the centroid of a type-2 set Ã, Ã={(x,u),μÃ(x,u)}, whose x domain is discretized into N points, so that Ã= ∑k=1 [∫u∈Jxi fxi(u)/u]/xi, can be defined using the Extension Principle as follows:
★ ★f
C A% = ∫θ1∈J x L ∫θ N ∈J xN ⎡⎣ f x1 (θ1 ) L 1
In (5.21),
C A% is a type-1 fuzzy set,
xN
N ∑ k =1 xkθ k (θ N ) ⎤⎦ ∑ kN=1θ k
(5.21)
★ represent the general t-norm (minimum or
product) and ∫ is the union operator. Every combination θ1,…, θN, and its associated secondary grade f x1 (θ1 ) L f xN (θ N ) forms an embedded type-2 set Ã, and
★ ★
we can express them as follows: a(θ ) =
∑ kN=1 xkθ k ; ∑ kN=1θ k
★ ★f
b(θ ) = f x1 (θ1 ) L
xN
(θ N )
(5.22)
The computation of CÃ involves computing the tuple (a(θ),b(θ)) as many times as possible, possibly infinite times. A special sequence of computations to obtain CÃ for discretized intervals is as follows: ¾ ¾
Discretize the x domain into N points, x1,…, xN Discretize each J xk (the primary membership of xj) into a suitable number of points, say Mk.
164
¾ ¾
5 Modeling Uncertainty with Improved Fuzzy Functions
Enumerate all the embedded type-1 sets: there will be ∏k Mk, k=1…N, embedded type-1 fuzzy sets. Compute the centroid using (5.21), i.e., compute the tuples (ak,bk), say p times, l=1,2,…,∏kMk, k=1…N, where ak and bk are given in (5.22), respectively. In this case, p=∏kMk, k=1…N.
A simpler choice would be to implement interval type-2 fuzzy sets and use a type reducer for interval type-2 fuzzy sets. The generalized centroid of an interval type-2 fuzzy set à over the domain X is given by: GC A% = ∫θ1∈J x L ∫θ N ∈J x 1 1
N
∑ kN=1 xkθ k = [Cl , Cr ] ∑ kL=1θ k
(5.23)
In (5.23) JxN is the secondary membership grade at N in the secondary membership function Jx and x∈X. Since it is a crisp interval, the type-reduced set only needs two end-points to define it, Cleft and Cright, i.e., Cl, Cr. Briefly, the goal in type reduction is to identify the embedded sets that represent the centroid of a given set. With generalized type-2 fuzzy sets, the abundance of embedded sets make type-reduction by far the most computationally expensive stage of the inference process. In this work, we try to solve this problem with a different type-reducer method based on a semi-parametric “cased-based” method. The IT2FRB methods presented above are most commonly used uncertainty modeling tools based on fuzzy rule base methodologies. One of the challenges of these methodologies are that they require quite a lot of parameter initializations. Generally, these system models are subjective, since expert knowledge is required to decide on the shapes and parameters of fuzzy sets, e.g., triangular, Gaussian, etc, their parameters, number of rules, etc. These models also assume that antecedent variables are separable and neglect the interactivity assumption, which increases the number of fuzzy operators to conduct fuzzification. Recent investigations mostly lean towards new approaches to find different shapes of fuzzy sets and further reduce complexity and subjectivity of type-2 fuzzy systems. One of the recent investigations [Uncu, 2003; Uncu and Turksen, 2007], to be discussed next, implements a discrete interval type-2 fuzzy rule base algorithm, DIT2FRB, where the interval membership function are obtained from fuzzy c-means (FCM) clustering method and the consequents are represented with linear functions. The discretization concept in DIT2FRB strategy is similar to the the new type-2 strategy of this work.
5.3.4 Discrete Interval Type-2 Fuzzy Rule Bases (DIT2FRB) Mendel identifies type-2 fuzzy set parameters associated with each variable in each rule mostly using supervised learning methods. Turksen [1999] introduced the concept of implementing Fuzzy c-means (FCM) [Bezdek, 1981a] clustering algorithm to identify type-2 fuzziness in system models. In [Kilic, 2002; Uncu et al, 2004a; Uncu and Turksen, 2007] FCM clustering algorithm is used to identify the structure of the system models. In [Kilic, 2002] a new type-2 FCM algorithm is constructed using
5.3 Conventional Type-2 Fuzzy Systems
165
upper and lower values of the overlapping constant of clusters. Generally, in these studies, they assume that the selection of the degree of fuzziness parameter, m, of the FCM clustering algorithm is the only source of uncertainty and identify embedded type-1 fuzzy sets for each m to represent Discrete Interval Valued Type-2 Fuzzy Rule Base System (DIT2FRB). Exhaustive search methods are used to optimize the parameters of fuzzy rule base structures. Here we will briefly explain the most recent version of these types of type-2 studies [Uncu and Turksen, 2007]. Let mr be the rth level of fuzziness, mr∈{m1.. mM}, where M is the number of disjoint m values. Thus, rth embedded type-1 FRB structure is identified by using mr as the learning parameter. μAr represents the membership function obtained from using mr value. Each fuzzy rule represents one embedded type-1 fuzzy set A. r They construct Takagi-Sugeno (TS) rule base structure, represented with R% i , i=1,..,c*, as follows:
% T + b%i R%ir : IF x ∈ X isr A%i THEN y% i = ax
(5.24)
or more specifically, r r r r T r R% i : IF x∈X is Ai THEN yi =ai x +bi
(5.25)
In (5.25) r=1…M, air xT +bir are the regression coefficients associated with the ith rule of the rth embedded type-1 fuzzy system. Thus, the problem of building DIT2FRB systems is reduced to finding type-1 fuzzy rules. They also implement Mizumoto [1989] type fuzzy rule bases system models instead of Takagi-SugenoKang models. Mizumoto fuzzy rule base systems require a defuzzification step to obtain crisp output from output fuzzy sets. They assume interactivity between antecedents, so they assign one joint input-membership function for the complete antecedent part of each rule using Fuzzy c-Means (FCM) clustering algorithm. Implication
Discrete Type 2 Fuzzy Rule Base
New Data Vector
Type Reduction
Fuzzification Identification of Degree of Fire
Aggregation of Deduced Model Output For Each Rule Defuzzification
m-lookup table Deduced Output
Fig. 5.6 DIT2FRB inference strategy by [Uncu, 2003; Uncu and Turksen, 2007]
Their algorithm initially captures one output value for each possible 〈mr,c〉 pair by using the min-max fuzzy inference method. For each datum, xk, k=1,..n, the 〈mr,c〉 that yields the minimum prediction error is retained in an m-lookup table as follows:
166
5 Modeling Uncertainty with Improved Fuzzy Functions
⎛ rkc = ⎜ q yk − yˆ kq ,c ⎝
(
)
2
M
(
= min yk − yˆ kr ,c r =1
)
2
⎞ , q = 1...M , k = 1..n, c = cmin ...cmax ⎟ (5.26) ⎠
Here the cmin and cmax are the boundaries of number of clusters that the algorithm is iterated on. The mr value that reveals the minimum error for the kth data vector, is selected as the best m value for the corresponding data vector and retained in the lookup table. When a new testing vector, x′ is introduced, the first step is the type reduction, as shown in Figure 5.6. They identify the first closest training data vector xk ′ based on Euclidean distance as follows:
( (
))
n ⎛ ⎞ xk ′ = ⎜ xk d xk , xk ′ = min d xk , xk ′ , k = 1...n ⎟ k =1 ⎝ ⎠
(
)
(5.27)
The model output of a given observation x′ using the optimum embedded type-1 fuzzy system model is determined based on the value of the m parameter of xk ′ in the lookup table. After selecting the best parameter, the inference is as the same as the type-1 fuzzy rule base inference as explained in Chapter 2. Based on the rule base structure identified, e.g., Takagi-Sugeno, Mizumoto, etc., their reasoning calculations differ. A detailed mathematical background of the algorithm can be found in [Uncu, 2003; Uncu and Turksen, 2007]. The challenging part of DIT2FRB method is to determine the initial parameters, e.g.,: ¾
M value to discretisize the membership interval into discrete membership values.
¾
Boundaries of number of clusters, cmin and cmax
¾
The type of the fuzzy operators, MIN, MAX, PRODUCT, etc., used in fuzzy rule base structure.
With DIT2FRB, Uncu and Turksen [2007] identify embedded Type-1 Fuzzy Rule Base (FRB) for each selected m to represent Discrete Interval Type-2 FRB. Their fuzzy rule base is essentially a collection of embedded type-1 fuzzy systems. They assume interactivity between antecedent variables, i.e., the antecedent fuzzy sets are not independent from each other and only one membership function is defined for the entire antecedent part of each rule, which eventually combine fuzzification and antecedent aggregation steps to one, reducing the amount of work. Their inference mechanism does not require a defuzzification step when first order Takagi-Sugeno-Kang Fuzzy Systems (TSK) [Takagi and Sugeno; 1995; Sugeno and Kang; 1996] used; only a type reduction is conducted using an additional parameter from the learning stage, the m-lookup table. Inference is conducted by assigning one embedded Type-1 FSM for each vector. In a more recent study, Garibaldi and Ozen [2007] investigated the process of modeling the variation in human decision-making processes by applying a fuzzy expert system. They introduced a non-stationary fuzzy inference approach. Their
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
167
underlying idea is that any conventional Type-1 Fuzzy Expert System always provides the same output(s) when supplied with the same input(s). However, experts do not always make the same decisions; even one expert can make a different decision one day from the next. Considering that the expert decisions are vague, they designed a method, which accounts these variations in the outcome of the system. Type-1 fuzzy systems are deterministic and do not allow this variation because of the obvious fact that given the same inputs the outputs are usually the same. So, in [Garibaldi and Ozen, 2007] such vagueness in measurements is investigated by initially perturbing the membership function parameters slightly and then constructing various type-1 fuzzy expert systems by combining the fuzzy-output and defuzzification to get a crisp output. They do not utilize the type-2 fuzzy sets in their models. Hwang and Rhee [2007] focus on the uncertainty associated with the fuzzifier parameter, i.e., level of fuzziness parameter denoted with m, that controls the amount of fuzziness of the final –partition in the fuzzy -means (FCM) clustering algorithm [Bezdek, 1981a]. To design and manage uncertainty of the fuzzifier parameter, they extend a pattern set to identify interval type-2 fuzzy sets using two fuzzifiers, which creates boundaries of the footprint of uncertainty (FOU) using the interval valued fuzzifier. Then, they incorporate this interval type-2 fuzzy set into FCM clustering to observe the effect of managing uncertainty from the two fuzzifiers. They also provide some solutions to type-reduction and defuzzification (i.e., cluster center updating and hard-partitioning) in FCM clustering algorithm. This study is another approach to finding uncertainty interval of membership functions based on FCM clustering parameters, namely the fuzzifier, m. It should be pointed out that there are other parameters in the fuzzy systems that should be considered to obtain the unknown uncertainty interval. This chapter will present a new approach that focuses on these topics using the new Improved Fuzzy Clustering methodology. The new type-2 fuzzy system modeling tools are inspired by the last type-2 fuzzy rule base systems that identify the uncertainty in fuzzy systems based on fuzzy clustering parameters. Next, we introduce the new Type-2 Improved Fuzzy Functions strategy based on discrete interval valued type-2 fuzzy membership values.
5.4 Discrete Interval Type-2 Improved Fuzzy Functions The type-2 fuzzy systems that are closest to the type-2 improved fuzzy functions, e.g., [Uncu, et al., 2004; Uncu and Turksen, 2007; Hwang and Rhee, 2007], demonstrated better performances than conventional type-1 and type-2 fuzzy systems in terms of modeling error. In these studies it is argued that the computation complexity is reduced to a degree, i.e., less number of fuzzy operators, an easier typereduction method, etc., compared to previous interval type-2 fuzzy systems of [Mendel and Karnik, 1999]. Nevertheless, these approaches are based on fuzzy rule base structures. We are replacing these fuzzy rule base structures with “Fuzzy Functions” to improve performance and require even less fuzzy operations and operators as explained in Chapter 4. Furthermore, Uncu and Turksen [2007] and Hwang and Rhee [2007] only deal with the uncertainty of system models using
168
5 Modeling Uncertainty with Improved Fuzzy Functions
one of the learning parameters of FCM clustering method; that is the fuzziness constant (m) of FCM clustering algorithm, leaving out other important sources of uncertainties of fuzzy systems. Since in these systems Takagi-Sugeno-Kang [Takagi and Sugeno; 1995; Sugeno and Kang; 1996] rule base structures are employed, a general function structure is used for each fuzzy rule. Employing different mathematical models for each rule, viz. fuzzy functions, in structure identification as opposed to building one general structure as in latter fuzzy rule base systems, or as in type-1 improved fuzzy functions (T1IFF) strategies, can also potentially identify some of the uncertainties in structure identification. Hence, in this work we propose a new uncertainty-modeling framework for T1IFF systems to identify uncertainties that may arise from two separate concerns: ¾
selection of the learning parameters of the proposed system,
¾
determination of the structure of fuzzy functions.
5.4.1
Background of Type-2 Improved Fuzzy Functions Approaches
During structure identification of the type-1 improved fuzzy functions (T1IFF) system, each data vector is assigned a membership value in each fuzzy set. Improved membership values are obtained through fuzzy membership functions of IFC method. We assume interactivity between each input variable (antecedent), so for each rule (cluster) there is only one type-1 membership function identified via IFC algorithm5. The challenge of working with T1IFF models is that; since type-1 fuzzy sets express belongingness of an object, xk, to a fuzzy set A by a crisp membership value, μA, they cannot capture the uncertainties in identifying membership functions nor the uncertainties in determining structure of fuzzy functions. In order to remedy this problem, we propose to implement type-2 fuzzy sets into fuzzy functions systems. Next, we will present the underlying representation of the interval membership functions into fuzzy functions systems. Generally; Fuzzy c-means (FCM) clustering algorithm is employed in order to identify fuzzy sets during structure identification of recent fuzzy systems based on fuzzy rule bases [Emami et al., 1998; Uncu and Turksen, 2007; Delgado et al., 1997; Hwang and Rhee, 2007] as well as fuzzy systems based on fuzzy functions [Turksen, 2008; Turksen and Celikyilmaz, 2006; Celikyilmaz, 2005]. In chapter 3, we introduced Improved Fuzzy Clustering (IFC), which can re-shape membership values in order to predict the output variables in local models. One of the common parameters of the FCM clustering and IFC method, is the degree of fuzziness, m, viz., a constant to represent overlapping degree of identified clusters. Fuzziness parameter has been investigated by researchers, because it is believed to have the capability to identify the uncertainty intervals of membership functions of any
5
The membership values obtained from clustering method are not mapped onto each individual variable due to interactivity assumption between variables. Only one input membership function is identified to define fuzzy sets of the antecedents.
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
169
given system [Ozkan and Turksen, 2007]. Nonetheless, there is also conflicting research on fuzziness of clustering methods. In [Yu et al, 2004], a method to choose the optimum weighing exponent is investigated around the mid-points of the clusters. They proved that, m=2 is not a reasonable heuristic rule for FCM clustering algorithm. In [Ozkan and Turksen, 2007], upper and lower bounds of the degree of fuzziness is approximated to be [m-lower=1.4, m-upper=2.6], by investigating the FCM model behavior around mass center for large m values and around cluster centers for smaller m values. They suggest that the fuzziness interval represents the uncertainty interval of the fuzzy systems. This interval is investigated and identified via a Taylor expansion of actual membership functions. Nonetheless, further experimental analysis is required to prove the performance of this interval in system modeling practices. In a similar study, Celikyilmaz and Turksen [2008b] investigated whether similar boundaries could be identified for different fuzzy clustering methods as well. On top of the uncertainties induced by the determination of the level of fuzziness parameters, uncertainties in fuzzy function systems may also arise from determination of the structure of the local fuzzy functions. By the structure of the fuzzy functions, we generally mean different forms of membership value transformations used to identify the local system fuzzy functions. A unique property of fuzzy function modeling strategy is that, membership values, μi(x), i=1,…,c, and their transformations are introduced to local functions as additional predictors. The way these membership values are used to map the given input dataset to a feature domain depends on the user and one may use many different variations of membership values, e.g., power transformations, (μi(x))p, p>0, logarithmic, ln((1μi(x))/μi(x)), etc. These power sets of membership values help to identify the structure of fuzzy functions. Hence, optimum membership forms to structure a fuzzy function of a cluster are an unknown parameter prior to the model execution. In brief, there are uncertainties in identifying membership values, as well as fuzzy functions, which are formed based on these membership value calculation equations. Therefore, in this chapter, in order to identify these uncertainties we introduce a new architecture with interval valued type-2 fuzzy sets and define the upper and lower boundaries of their interval. Hence, one of the parameters we use to identify the interval valued type-2 fuzzy sets is by analyzing different structures of fuzzy functions. It should be summarized again that the fuzzy clustering algorithms are used to generate the interval valued type-2 membership value sets with a clustering algorithm. Then, one defines a membership function via a curvefitting technique and this is used symbolically as the membership function to represent the scatter of the membership value set. In this work we use the fuzzy clustering methods, e.g., Fuzzy C-Means Clustering [Bezdek, 1981a], or Improved Fuzzy Clustering [Celikyilmaz and Turksen, 2007b] algorithm to obtain membership functions. In the following, when we use the membership function term, we mean the membership value calculation equation of a corresponding fuzzy clustering algorithm. As pointed out in the previous section, in modern practices, interval type-2 fuzzy logic systems are most efficient uncertainty modeling methods because it is easier to conduct type-reduction with them. Secondary membership functions are not required in interval valued type-2 fuzzy functions. Membership values of
170
5 Modeling Uncertainty with Improved Fuzzy Functions
interval valued type-2 fuzzy sets have secondary membership values equal to 1 and one tries to identify the upper and lower bounds of these membership functions to form an interval. Thus; interval type-2 fuzzy sets, Ã, map the domain of base variable onto membership values in the interval of [0,1] as follows;
μ A% ( x) : x → 1/ u , u ∈ [ μ AL% ( x), μ UA% ( x )]
(5.28)
In (5.28) μ AL% ( x) and μ UA% ( x) are lower and upper membership functions, u is the primary membership value assigned to each datum x. One of the most common ways to fuzzify a given observation, which is most commonly used in fuzzy system modeling approaches, is by using a membership function. The shape of the membership functions is generally selected a priori and a surface or a curve is fitted depending on whether the antecedent is represented with a single multi-dimensional fuzzy set or aggregation of fuzzy sets of each input dimension. Hence, membership of a crisp observation, x', in a type-1 fuzzy set A, μA(x'), is a function of x' and pM is the set of parameters which mathematically defines the fitted membership curve. Thus μA(x') is formulated as
μA(x')=f(x'; pM)
(5.29)
Let A be linguistic label on the universe of discourse of base variable x with type-1 membership function characterized by a set of membership function parameters pM=(p1M,p2M,…,pnpM), where np is the number of parameters used to characterize the membership function, ptM is the membership function parameter, for which the domain is denoted by PtM, i.e., ptM∈PtM. If we assume that all ptM are independent parameters, the domain of set of membership function parameters, PM, can be defined as the Cartesian product of the domains of its elements, i.e., P M = P1M × P2M × ... × PnpM . For instance, rth specific combination of parameters, pM,r, is defined as pM,r=(p1M,r,p2M,r,…,pnpM,r). Thus, one can define PM as, PM= { pM,r} , r=1,…,nr, where nr is the number of all possible value that pM,r can take. Then the number of all possible values for ptM can take is defined by | PtM |, then NM=∏t | PtM |, t=1,…,np. The challenge is that, if PtM has a continuous domain, then there will be infinitely many pM,r. Based on this information, one can define the primary type-1, interval valued and discrete interval valued type-2 membership functions as follows: Definition 5.1. Primary Membership Function: The type-2 membership function for a given parameter combination pM,r , is defined as the rth primary membership function of interval type-2 fuzzy membership function, μÃ(x) and it is denoted with μrA(x). Definition 5.2. Interval Type-2 Membership Function: If the primary membership function of an interval type-2 membership function can be characterized by the set of parameters pM=(p1M,p2M,…,pnpM) and PM is a “continuous” domain, then interval valued type-2 membership function is the collection of primary membership functions determined for all possible values of pM,r ∈PM.
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
171
Definition 5.3. Discrete Interval Type-2 Membership Function: If the primary membership functions of an interval type-2 membership function can be characterized by the set of parameters pM=(p1M,p2M,…,pnpM) and PM is a “discrete” domain, then interval valued type-2 membership function is the collection of primary membership functions determined for all possible values of pM,r ∈PM. Therefore, the membership function of a discrete interval type-2 fuzzy set A can be reformulated as follows:
μÃ(x)={ μrA(x)}, r=1,…,nr
(5.30)
In (5.30), μ Ar ( x) is the rth primary membership function of a datum x and μÃ(x) is the collection of all the primary membership functions that define an interval and nr is the total number of possible values of pM . We explain the idea above with an example. Let a Gaussian membership function (5.31) be defined with two parameters, the mean, m , and the standard deviation, σ . Let the mean parameter be a constant value, and standard deviation be a variable. The primary membership function of an interval type-2 fuzzy set formed with Gaussian membership function is defined as having a fixed mean, m , i.e., Pm =0, and an uncertain standard deviation parameter, σ , i.e., Pσ = nr. ‘nr’ is the number of possible values that the standard deviation parameter may get. The rth primary membership function of the discrete interval valued type-2 fuzzy set A, μ Ar ( x) for the particular Gaussian membership function is defined as;
⎛ 1 ⎛ x − m ⎞⎞
μ Ar ( x) = exp ⎜ − ⎜ r ⎟ ⎟ , ∀σ r ∈ {σ 1 ,...,σ nr } , r = 1,..., nr ⎝ 2 ⎝ σ ⎠⎠
(5.31)
Since membership values are within [0,1] interval, a type-2 membership function for a given datum x of every set of parameters, will eventually have a maximum and minimum membership value, the parameters of which will define the upper and lower limits of the parameter combination respectively. In this example, let the σ and σ U represent the lower (L) and upper(U) limits of the standard deviation parameter, which results in minimum and maximum values of membership values for a given data vector in a fuzzy set A. Then, the boundaries of the discrete interval valued type-2 membership function can be defined for this example using the upper and lower values of the uncertain standard deviation parameter, σ , as follows: L
μ Ar ( x) = μ ( x, mean,σ U ) ⎫⎪ ⎬ , r = 1,..., nr μ Ar ( x) = μ ( x, mean,σ L ) ⎪⎭ In (5.32)
(5.32)
μ Ar ( x) is the upper primary membership function which uses upper limit
value of the uncertain standard deviation parameter, σ U and μ Ar ( x) is the lower primary membership function which uses the lower limit value of the uncertain standard deviation parameter, σ L . Figure 5.7 displays the graph of the membership
172
5 Modeling Uncertainty with Improved Fuzzy Functions
function of a given variable, x, i.e., x={0,0.01,…,10}, with mean, mean=5, and uncertain standard deviation, σ = [1,1.01,.., 2] .The upper and lower boundaries of the interval membership function in Figure 5.7 are drawn in bold.
Fig. 5.7 Interval Type-2 membership function with fixed mean and uncertain standard deviation. The shaded area includes infinite number of membership values.
Each point within the uncertainty interval denotes a membership value obtained from one embedded membership function, which represents a type-1 fuzzy set. In literature, this interval is denoted as the ‘footprint of uncertainty’ [Mendel, 2001] as shown in Figure 5.7. In this work, one of the aims of the proposed methods is to identify the footprint of uncertainty with the proposed methodologies. The latter representation of discrete interval type-2 fuzzy sets will be the foundation of the Discrete Interval Valued Type-2 Improved Fuzzy Functions (DIT2IFF) approach to be introduced next. We will start with the identification of the uncertainty interval of fuzzy functions based on parameters of the Fuzzy c-Means (FCM) clustering method. We will then move on to the IFC method to represent the uncertainty interval of the improved fuzzy function systems.
Representation of Type-2 Fuzzy Sets Using FCM Clustering Method The fuzzy function systems implement the FCM [Bezdek, 1981a] clustering method and then identify local fuzzy functions using membership values obtained from the FCM clustering method. The parameters of the membership value calculation equation of the FCM clustering algorithm are: the number of clusters in the system, c, the degree of fuzziness parameter (weighing exponent) m, which defines the degree of overlap between the clusters, and the cluster centers of each cluster defined for each m, υX,m ∈X⊂ℜnv, where nv is the number of the input variables in the system. We may represent only a part of the learning parameters of the type-2 fuzzy functions with FCM clustering algorithm to build models for Discrete Interval Type-2 Fuzzy Functions (T2FF) strategies in the following way. Let pL represent each of the learning parameters of FCM clustering algorithm, pL ={m,υX,m,c}, where L implies learning, the membership function of the FCM clustering algorithm for cluster i can be defined as follows:
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
173
μi (x) = f (x, m, υX,m,c )
(5.33) th
Using the parameters defined in (5.33), membership value of any k input vector, xk, k=1,…,n, in ith cluster is calculated with the membership value calculation equation as follows: ⎛ ⎜ c ⎛ d xk ,υix, m μik ( x) = ⎜ ∑ ⎜ m ⎜ j =1 ⎜ d xk ,υ xj , m ⎜ ⎝ ⎝
( (
(
In (5.34) d xk ,υi
x ,m
) )
2 ⎞ ⎞ m −1 ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎟ ⎠
−1
(5.34)
) is the distance defined as: (
)
d xk ,υix,m = ( x − υix,m ) A( x − υix,m )T
(5.35)
The norm matrix A in (5.35) is equal to the covariance inverse of the training matrix. Since each input variable is normalized to mean=0 and standard deviation =1, A is the identity matrix, d(xk,vi) is the Euclidean distance. Therefore the membership function of the FCM clustering algorithm is determined by the combination of parameters, i.e., (m,υX,m,c). It should be noted that, based on the FCM clustering iterative algorithm, the cluster centers are calculated as follows:
⎛
n
( )
υix ,m = ⎜ ∑ μik ⎝ k =1
m
⎞ xk ⎟ ⎠
∑ ( μik ) n
m
, ∀i = 1,..., c
(5.36)
k =1
In cluster center calculation function in (5.36) the 〈m,x,μi〉 are input variables and the cluster centers are the output variables. They are measured values and therefore they are not the parameters of the FCM clustering. Cluster centers are however affected by different values of m since it is one of the input parameters of the cluster center function in (5.36). However, for FCM clustering algorithm, one still needs to determine the optimum number of clusters, c*, and the optimum degree of fuzziness value m* before calculating cluster centers. (*) indicate the optimum value of the corresponding parameter. Here we will assume that the optimum number of cluster is identified by a cluster validity measure or by an exhaustive search based on systems’ performance measure [Turksen and Celikyilmaz, 2006]. Therefore, c* will be assumed as a fixed variable of FCM clustering algorithm. From the latter discussion, one may conclude that the only uncertain parameter that is left to identify the uncertainty interval is the degree of fuzziness, m. Thus, the domain of the learning parameters, PL, of the FCM clustering method is defined for the changing values of m parameter and is defined as, PL={c*,υX,r | mr}, r=1,…, nr
(5.37)
where nr is the number of different values of the degree of fuzziness parameter of the FCM. The parameter list PL in (5.37) indicates that, the parameter, c*, is fixed and the degree of fuzziness parameter, m, is kept uncertain and vX,r is measured during the FCM clustering process based on changing values of m. These
174
5 Modeling Uncertainty with Improved Fuzzy Functions
parameters make up of a part of the learning parameters of the type-2 fuzzy functions approaches. The affect of choosing the right m value on the cluster centers and membership values can be demonstrated with an example. Here, we used an artificial sparse dataset of 100 data samples including one input variable, x, and a single output variable, y. This dataset was introduced in [Celikyilmaz and Turksen, 2007a] which is depicted in Figure 5.8. The dataset is clustered into 2 clusters, i.e., c* is fixed to c=2, using mr ={1.2, 1.5, 1.8, 2.0, 2.5, 3.2}, nr=6 representing each six different m values. Artificial Dataset 4
output - y
2 0 -2 -4 -1.5
-1
-0.5
0
0.5 input - x 1
1
1.5
2
2.5
Fig. 5.8 Artificial dataset with one input variable (x1) and a single output (y)
For each x vector in the dataset, x={x1, x1,.., x100}, a membership value is calculated of each cluster i,i=1,2, using the membership value calculation equation of FCM clustering as follows:
⎛ ⎜ ⎛ d x ,υ x,mr k i c ⎜ μik ( x) = ⎜⎜ ∑ ⎜ x ,mr j =1 mr ⎜ ⎜ d xk ,υ j ⎜ ⎝ ⎝
( (
−1 2 ⎞ r ⎞ m −1 ⎟
) ⎟⎟ ) ⎟⎠
⎟ , ∀i = 1, 2, r = 1,.,6, k = 1,..,100 ⎟ ⎟ ⎟ ⎠
(5.38)
r
In (5.38), υix ,m is the cluster center of cluster i calculated using the degree of fuzziness parameter mr. For each discrete mr, one can identify one discrete membership function, i.e., the membership values of each cluster that form scatter clouds and the membership function is the idealized distribution of these membership values of a particular cluster. Hence, interval type-2 membership values of FCM clustering algorithm can be constructed by:
μ% i ( x) = {μir ( x)}, ∀i = 1, 2, r = 1,., 6
(5.39)
Membership value calculation equation of a particular cluster for each discrete mr is shown in Figure 5.9 and Figure 5.10, which are bounded with the degree of fuzziness, i.e., [mL,mU]=[1.2,3.2]. Natural outcome of this definition is that, one can identify upper and lower membership values, μir ( x), μir ( x) for each cluster i,
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
175
eventually identifying the uncertainty interval of the membership value calculation equation. In Figure 5.10 we observe the changes in cluster center locations for cluster 1 with respect to the changes in m values. It can be seen that, one can define an interval of membership values based on the m values which form an interval, i.e., [m-lower, m-upper]. In this example the m interval is determined by the maximum and minimum values of the mr values.
1 m1=1.2 m2=1.5 m3=2.0 m4=2.3 m5=2.5 m6=3.2
0.8 0.6 0.4 0.2 0 -1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Fig. 5.9 Membership values for cluster 1 using the artificial dataset determined by changing m values
m 1=1.2
m2=1.5
m3=2.0
m 4=2.3
m5=2.5
m6=3.2
2 1 0 -0.6
-0.59
-0.58
-0.57
-0.56
-0.55 X
-0.54
-0.53
-0.52
-0.51
-0.5
Fig. 5.10 Changing cluster center locations of cluster 1 based on changing m values. The square dots in bold indicate the cluster centers for changing m values.
Representation of Type-2 Improved Fuzzy Sets Using Improved Fuzzy Clustering
In the previous section, the representation of the uncertainty interval of the membership functions of the standard Fuzzy c-Means (FCM) clustering algorithm is presented based on discretization of the level of fuzziness, m, parameter. This section presents the representation of the uncertainty interval when one applies the new improved fuzzy clustering (IFC) method for the determination of fuzzy functions. The IFC method has a unique membership value calculation equation that can capture improved membership values to reduce the error of local fuzzy functions as follows:
176
5 Modeling Uncertainty with Improved Fuzzy Functions
1/( m −1) ⎛ ⎞ ⎛ ( d )2 + ( y − h (τ , wˆ ) )2 ⎞ c ik k i ik i ⎜ ⎟ imp ⎜ ⎟ μik = ⎜ ∑ j =1 2 2 ⎟ ⎜ ( d ) + ( y − h (τ , wˆ ) ) ⎟ 1
−1
(5.40)
In (5.40), hi(τi|ŵi) is the interim fuzzy function of cluster i to approximate an output value, ŷk, using the input matrix, τi, which is composed of the membership values and/or their user defined transformations, e.g., τi′=[( μiimp ) (( μiimp )m) e μ ]. ŵi‘s are the interim fuzzy function parameters of membership value calculation equation of the IFC method. For instance, when the LSE is used to approximate the interim fuzzy functions, the latter sample fuzzy function structure, τi′, where three different membership value transformations are used to form the interim fuzzy functions, (nm=3) takes the form: imp i
(
)
(
hi (τ i , wˆ i )=wˆ 0,i +wˆ1,i μiimp +wˆ 2,i μiimp
τ i ∈ R n×( nm +1)
⎡ imp ⎢1 μi ,1 = ⎢M M ⎢ ⎢ imp ⎢⎣1 μi , n
(μ ) imp i ,1
M
(μ ) imp i ,n
)
m
( )
⎤ ⎥ M ⎥ , wˆ i ∈ R ( nm +1)×1 = [ wˆ 0,i ⎥ μ imp ⎥ e i ,n ⎥ ⎦ e
imp
μ +wˆ 3,i e i
μiimp ,1
(5.41) wˆ1,i
wˆ 2,i
wˆ 3,i ]T
It is to be noted from the new IFC membership value calculation equation in (5.40) that, during optimization, the error between the actual output yk and the estimated output, ŷk, obtained from approximated interim fuzzy functions of each cluster, hi(⋅), is used as additional input variables of the new membership value calculation equation of the IFC algorithm. Hence, input variables of these fuzzy functions, hi(⋅), are membership values and their user defined transformation. One should decide on the type of membership value transformation to be used to identify the behavior of the output variable. These eventually constitute the structure of local fuzzy functions. When the IFC method is used during the structure identification of the local fuzzy functions, an additional parameter is introduced, denoted with ‘τ’ (spelled TAO) which is an interim matrix that represents the structure of interim fuzzy functions constructed during the IFC algorithm, as shown in the sample in (5.41). Thus, this introduces an additional uncertainty during the extraction of the membership values, since the error of the interim fuzzy functions of each cluster is an input variable for the new membership value calculation equation. During IFC algorithm, we assume that the interim fuzzy functions hi(τi|ŵi) all have the same structure in an experimental trial. Let τ represent the structure of the input interim matrix composed of membership values and their possible transformations, i.e., squared membership value, m-powered membership value, logarithmic transformation of the membership value, etc., to form the structure of the interim fuzzy functions during the IFC algorithm. Since, different types of membership values make up the structure of the τ-matrix, in shortτ, and membership values are affected by the degree of fuzziness
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
177
variable m, each different structure of τ for each mr is represented as τs⊂ℜnif . Here, nif, i.e., number of interim fuzzy functions, is the number of different types of interim fuzzy functions composed of different membership value forms, r represents each discrete value of m variable, r=1,…, nr, s represents each different interim fuzzy function, s=1,…, nif, as shown in Table 5.1. For instance, one interim fuzzy function might use only membership values, and another might use membership values and their logarithmic transformations. One should identify the optimum one that would improve the performance of the model; hence, there is an uncertainty in identifying the interim fuzzy function structures. The domain of learning parameters, PL, of the new membership value extraction function of the Improved Fuzzy Clustering algorithm can be re-defined as, Table 5.1 Improved Fuzzy Functions Parameter Representation
Parameter
Name
m∈P
Level of fuzziness
τs∈ Pτ L
Interim Fuzzy Function Matrix composed of only membership values and/or their different transformations. (System) Fuzzy Function Matrix composed of original input variables, membership values and/or their transformations.
r
L m
Φψ∈ PΦL
PL{ PmL , Pτ L }={ c*,υX,r,s| mr, τs}, r=1,…, nr, s=1,…, nif
Range r=1…nr
s=1…nif
ψ=1…nf
(5.42)
In (5.42) PmL indicates the domain of parameters identified by level of fuzziness, m, and Pτ L indicates the domain of parameters identified by different interim matrix structure. Let nr represent the values of the level of fuzziness of the IFC and nif is the number of different structures of interim fuzzy functions. The parameter list PL in (5.42) indicates that the parameters, c*, and υX,r,s are fixed and two different parameters namely the degree of fuzziness parameter mr and the structure of the input matrix τs are kept uncertain. If the Cartesian product of each discrete values of these two parameters, 〈mr,τs〉 form each tth IFC model, then there would be t=1,…,nr*nif different IFC models, where (nr*nif ) is the total number of parameters and at this point only two types of parameters are used to describe the interval valued type-2 improved membership values with IFC method. Up to this point, we presented the representation of discrete parameters of interval valued type-2 membership functions due to uncertainties in forming type-1 improved membership functions. There are other uncertainties that one should take into account when building models using type-1 fuzzy function systems. It is mentioned earlier that based on the structure of type-1 improved fuzzy
178
5 Modeling Uncertainty with Improved Fuzzy Functions
functions (T1IFF) systems, local system fuzzy functions are identified for each cluster after the structures of fuzzy sets are captured. The difference between the “Interim Fuzzy Functions”, hi(τi|ŵi), and the “System Fuzzy Functions”, in short the “Fuzzy Functions”, fi(Φi|Ŵi), presented in Chapter 4, should be stated at this point. The “Interim Fuzzy Functions” are used to approximate the membership values in Improved Fuzzy Clustering (IFC) algorithm. The parameters of interim fuzzy functions of cluster i, ŵi, are identified using the matrix, τi, which is composed of the membership values and/or their transformations, evaluated at each step of the IFC algorithm, excluding the original input variables. These functions are called the interim functions since they are only used for finding the improved membership values. Then, after the improved membership values, μiimp(x), are obtained as a result of the IFC algorithm, the parameters of the “System (Local) Fuzzy Functions” in short, “Fuzzy Functions”, denoted with Ŵi, are approximated for each cluster using the matrix, Φi, (spelled FEE), which composes of the original input variables of nv different candidate input variables, x={x1,x2,…,xnv}, their membership values of the corresponding cluster, and their possible user defined transformations. For instance a fuzzy function of cluster i, fi(Φi|Ŵi), can be identified using a sample input matrix structure, Φi′, e.g., Φi′=[( μiimp ) (( μiimp )m) e μ (x1) (x2) (x3)], where there are nm=3 different membership values transformations, and nv=3 different input variables. For instance, when the LSE is used to approximate the fuzzy functions, fi(Φi|Ŵi), using Φi′ matrix, the fuzzy functions, their coefficients and the input matrix used to identify these parameters are denoted as follows: imp i
(
)
(
)
fi Φ i ,Wˆ i =Wˆ0,i +Wˆ1,i μiimp +Wˆ2,i
Φ′i ∈ℜ n×( nm + nv +1) nm =3, nv =3
⎡ imp ⎢1 μi ,1 = ⎢M M ⎢ ⎢ 1 μi*,n ⎣⎢
(( μ ) )+Wˆ (e ) + Wˆ m
imp i
(μ )
imp m i ,1
3,i
e
M
(μ )
imp m i ,n
e
μiimp ,1
x1,1 M M μiimp , n xn ,1
μiimp
x1,2 M xn ,2
4,i x1
⎤ x1,3 ⎥ M ⎥, ⎥ xn ,3 ⎥ ⎦⎥
+ Wˆ5,i x2 + Wˆ3+ 3,i x3
(5.43)
Wˆi ∈ℜ( nm+ nv +1)×1 = [Wˆ0,i L Wˆnm ,i Wˆnm +1,i Wˆnm + nv +1,i ]T nm =3, nv =3
A unique property of the type-1 improved fuzzy functions (T1IFF) strategies is that in approximating fuzzy functions, the membership values are used to map the input space onto a feature space by using them as additional predictors and different functions can be defined as a result. Based on the type and number of different forms of membership values, the matrix to identify the fuzzy functions, Φψ, ψ=1,…,nf, can take on ψ (spelled SIGH) different function types, which would be an additional parameter to determine in optimizing the model parameters. Note that, the interim fuzzy functions and system fuzzy function of each cluster of a particular fuzzy functions model may have different membership value transformations, which is a unique property to be presented in this chapter. Each set of parameters can form one fuzzy function structure for each cluster i, i=1…c*, c* being the optimum number of clusters. Consequently, the domain of
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
179
the T1IFF can be represented with the learning parameters of membership functions as well as the structure of the fuzzy functions as shown in Table 5.1 by: PL={ Pm , Pτ , PΦ }={ c*,υX,r,s| mr, τs, Φψ}, r=1,…, nr, s=1,…, nif, ψ=1,…, nf L
L
L
(5.44)
L
Let PΦ be the domain of different local improved fuzzy function structures that one can define and q represent a particular improved fuzzy function structure of type-1. Then, there would be q= 1,…,nr×nif×(nf)c* different type-1 improved fuzzy functions (T1IFF) models within an interval valued type-2 improved fuzzy functions system. In short, we identify the uncertainties of T1IFF models using m, τ, Φ parameters to define the uncertainty interval of membership values and fuzzy functions. These parameters denote in sequence, each of the level of fuzziness (mr), r=1,…,nr, each interim matrix, (τs), s=1,…,nif, to identify different interim fuzzy functions and each input matrix (Φψ), ψ=1,…,nf, to define (local) improved fuzzy functions, keeping the rest of the parameters constant. In the next section, an example to these parameters of the type-2 fuzzy system modeling strategy will be explained with the help of Figure 5.12.
5.4.2 Discrete Interval Type-2 Improved Fuzzy Functions System (DIT2IFF) In this section, we present Discrete Interval Type-2 Improved Fuzzy Functions system (DIT2IFF) – that is based on the new Improved Fuzzy Clustering (IFC) method. The Discrete Interval Type-2 Fuzzy Functions –DIT2FF– which implement the standard FCM clustering method, is an extension of DIT2IFF. Here, we will only give the background of the DIT2IFF systems. The difference between two fuzzy functions strategies, which implement FCM clustering and IFC methods, is explained in Chapter 4 in detail. Hence, we will present a detailed representation of interval valued type-2 improved fuzzy functions in this section. The new DIT2IFF system introduces some unique properties that no other type-2 fuzzy system modeling tools implement. Nonetheless, some of the ideas are inspired by earlier type-2 fuzzy system, the closest one being the type-2 fuzzy rule base approach, in short DIT2FRB [Uncu et al., 2004a; Uncu and Turksen, 2007], which implement discretization of membership values to represent embedded fuzzy rules. In Table 5.2 some of the major differences are listed to emphasize the uniqueness of the proposed DIT2IFF compared to DIT2FRB approach. Next, we will present components of the new structure identification and reasoning methods of the proposed type-2 fuzzy systems approach in detail.
I. Structure Identification of Proposed DIT2IFF Let mr be the rth level of fuzziness, mr∈{m1.. mnr}, where nr is the number of discrete m values, e.g., mr∈{1.1, 1.3,…, 3.5}, and τs be the sth interim fuzzy function structure, τs∈{τ1... τnif}, where nif is the number of different interim fuzzy-function structures to build nif different IFC models. Examples of different interim fuzzy functions can be
180
5 Modeling Uncertainty with Improved Fuzzy Functions
identified with different interim matrixes denoted with τs∈{τ1=[(μi)], ..., τnif=[(μiimp) ((μiimp)p>0) (ln(1-μiimp)/μiimp)]}, or as shown in (5.41). Let t represent each IFC model. Then, one can define t=1,…,(nrif=nr×nif) different embedded improved fuzzy clustering – IFC – models, each are represented with tuple6 of 〈c,mr,τs〉, r=1,…,nr, s=1,…,nif. In addition, a list of local fuzzy function structures are represented with Φψ, ψ=1,…,nf. A fuzzy function of a particular cluster is represented with an improved fuzzy function structure, Φiψ from within the list of improved fuzzy functions. Hence that in one DIT2IFF model, two different clusters may or may not have the same structures, i.e., Φi=1ψ≠Φi=2ψ. A sample fuzzy function of a cluster i using the LSE approximators are shown in (5.43). Here ψ represents a single fuzzy function structure, which comprises of membership values and their possible user defined transformations as additional input variables. Table 5.2 Differences between the and Earlier Type-2 Fuzzy System Modeling Approach
Discrete Interval Type-2 Fuzzy Rule Bases approach (DIVT2FRB) ¾ Based on Fuzzy Rule Bases ¾ ¾ ¾ ¾
¾ ¾
6
Discrete Interval Type-2 Improved Fuzzy Functions (DIT2IFF) ¾ Based on Fuzzy Functions approach Implements FCM clustering dur- ¾ Implements FCM clustering or ing structure identification IFC during structure identification For each cluster a linear hyper¾ Linear or non-linear models can plane is approximated be applied to any cluster The structure of the linear func¾ Different fuzzy function structures tions are identical in each cluster can be used across clusters Membership values are used as ¾ Membership values are used as weighing parameter in defuzzifiadditional predictors to approxication method during inference. mate local functions as well as the weighing parameter in finding crisp outputs during inference. Implements parametric inference ¾ Implements semi-parametric inmethod to estimate membership ference method to estimate memvalues. bership values. Identifies the uncertainty interval ¾ Identifies an uncertainty interval of the system model based on user of the system model based on user defined discretisized fuzzifier, m, defined discretized fuzzifier paparameter. rameter, m, different types of interim fuzzy function structures, τ and local fuzzy functions and their parameters
In this work tuple represent finite sequence of different parameters with no particular order. A tuple containing n objects is also represented with "n-tuple".
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
181
Initially, the embedded IFC models will form the interval membership values as described in the following: Step 1: Clustering with IFC
IFC clustering is applied to partition the training data into c clusters for each parameter set 〈c,mr,τs〉. As a result, c number of cluster centers, represented with υic,r,s(x|y)=(υi,1c,r,s,υi,2c,r,s,…,υi,nvc,r,s,υi,nv+1c,r,s) and membership values, μic,r,s(x|y) are captured in input-output (x|y) space. Implementation of IFC into T1IFF strategies for a set of variables, 〈c,mr,τs〉, is presented in Chapter 4, section 4.4. Next, membership values of each training data vector of the input space, μic,r,s(x), is calcux, and the cluster centers, lated using input matrix, υic,r,s(x)=(υi,1c,r,s,υi,2c,r,s,…,υi,nvc,r,s) as input parameters of membership value calculation equation in (5.40). Figure 5.11 depicts the uncertainty interval of a membership functions induced by changing values of mr and τs. It should be recalled that these membership functions are idealized representations obtained after a curvefitting over the scatter diagram of membership values. The bounded interval in Figure 5.11 comprises of discrete membership functions (of a particular cluster i) one for each different parameter set obtained from Cartesian product of its discrete values, nr×nif . Hence, for each datum x, e.g., x=7.5, t=1,…, nr×nif different membership values can be defined within this interval. Each one of these t membership values are represented with μic,r,s(x), associated with tth fuzzy clustering model using tuples of 〈c,mr,τs〉.
7.5 Fig. 5.11 Uncertainty interval of a cluster represented with membership functions of 3-tuples, 〈c,mr,τs〉. c is the number of clusters, mr defines the level of fuzziness, τs represents the interim fuzzy function structures of the Improved Fuzzy Clustering (IFC).
Step 2: Approximation of Improved Fuzzy Functions
Once membership values are obtained from the IFC method, nf different improved fuzzy functions, each of them represented with Φiψ, ψ=1,…,nf=nif, i=1,..c, is approximated for each cluster using a list of fuzzy function structures represented
182
5 Modeling Uncertainty with Improved Fuzzy Functions
with Φψ∈{Φ1…Φnf}, ψ=1,…,nf, i=1,…c. It should be noted that, since the same list of membership value transformations are used to identify interim and local fuzzy functions, the total number of different IFC models and the number of fuzzy functions of each cluster would be equal to the Cartesian product parameters, i.e., nr×nif=nr×nf. The local fuzzy functions are formed by using different membership value transformations in each local fuzzy function. Here ψ represent a fuzzy function structure. Thus, for each tth embedded IFC model viz., each set of 〈c,mr,τs〉 parameters using improved membership values, μ r,s(x), nf different system fuzzy functions for each cluster, denoted with fir,s,ψ, i=1,…c, are identified. It should be recalled that τs denotes a different membership value transformation matrix than Φψ to identify the interim fuzzy functions of the IFC clustering algorithm, whereas, Φiψ identifies one fuzzy function structure that may or may not pickup the same membership value transformations used in τs matrix as additional input variables along with the original input variables. Therefore different indices are listed for each of the matrix structures, i.e., interim, τs, and local input matrices, Φψ. Thus, from the pool of different membership value transformations defined in the beginning of the structure identification method, the IFC model uses Cartesian product of the these membership values to identify different interim fuzzy functions in order to build t different IFC models, and then the same pool of membership value transformations are used to approximate t different fuzzy functions for each cluster The novelty of fuzzy functions systems is that a different dataset is structured for each cluster i by using improved membership values, μr,s(x), and/or their transformations as additional predictors to the original input matrix, as shown in Figure 4.6. It should emphasized again that the membership value transformations used to construct the Φiψ matrix should not be the same as the membership value transformation used to construct τs matrix. Hence, each cluster may have a different structure of fuzzy functions. For each tth IFC model denoted with 〈c,mr,τs〉 tuples, one can define different models using different fuzzy functions structures in each cluster from a list of fuzzy function structures. Figure 5.12 depicts an example topology for structure identification of the DIT2IFF system using the following parameter list;
mr∈{m1,m2,m3}, nr=3, τs∈{τ1,τ2}, nif=2 and Φψ∈{Φ1,Φ2}, nf=2. The parameter mr can be assigned any real (scalar) value, i.e., 1<mr<∞. The matrix structures τs, such as in (5.41), and Φψ such as in (5.43), can also be formed with any possible transformation of membership values to convert them into a higher dimensional feature space. For this example there are only two different types of fuzzy functions identified and the interim and local fuzzy functions are approximated using these structures. Each τs represents one interim fuzzy function structure to be used to build a tth IFC model, a total of 2 different interim fuzzy function structures. It composes of the membership values and their transformations only. As a result of fuzzification, {(nr=3)×(nif=2)}=6 separate IFC models are identified, t=1,…,6, where each discrete IFC model is denoted with t.
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
183
At this stage, optimum number of clusters, c*, is identified based on the new cluster validity index or evolutionary algorithms (to be discussed in the next section), therefore c* is a known variable. Each Φiψ represents one (system) fuzzy function structure to be used to identify local fuzzy functions of each cluster. Each
Fig. 5.12 A general topology of the DIT2IFF structure identification. The shaded fuzzy functions represent the fuzzy functions of the corresponding cluster. The output value of kth data point, yk, is calculated using each output obtained from these fuzzy functions.
local fuzzy function of each cluster fir,s,ψ is identified using the matrix which is interpreted as a mapping of the original input space, ℜnv onto a higher dimensional feature space ℜnv+nm, viz. xÆΦiψ(x,τr,s) identified by the user. Parameters of an optimum regression function are sought in this new space. For each tth IFC model, t=1,…,nif×nr, e.g., in our example 6 different IFC models can be build, nf different local fuzzy functions, fir,s(Φiψ) can be approximated for each cluster i, ψ=1,…,nf. If there are c* number of clusters and for each cluster nf different fuzzy functions can be identified, then there can be p=1,…, (nf)c* different embedded type-1 improved fuzzy function models identified for each tth IFC model. In Figure 5.12, Φ = {Φ1 , Φ 2 ...Φ c*} represents the structure of one embedded T1IFF model fuzzy functions obtained using tth IFC model outputs, e.g., membership values and parameters. Each Φiψ in Φp represents the fuzzy function structure denoted with ψ, ψ=1,…,nf, for a cluster i. In the example above, there are two different fuzzy function structures to choose from to identify interim fuzzy functions as well as local fuzzy functions of p
1
2
1
184
5 Modeling Uncertainty with Improved Fuzzy Functions
each cluster. We can summarize the discretization process of structure identification as follows: ¾
There would be t=1,…, (nr×nif) different IFC models for each discrete mr, r=1,…,nr, and τs, s=1,…,nif.
¾
There would be p=1,…,(nf)c* different embedded T1IFF models that one could identify for any tth IFC model denoted with tuples of 〈mr,τs〉, r=1…nr, s=1…nif. Each cluster chooses one or the other structure to approximate a function. Hence, a Cartesian product of models can be build as a result. If there are nf different fuzzy function structures to choose from that could be defined for each different combination of the membership values transformations, ψ=1,…,nf, and if there are c* different clusters, then, there can be p=1,…,(nf)c* different T1IFF models that one could identify for any tth IFC model denoted with tuples of 〈mr,τs〉, r=1…nr, s=1…nif. For instance, for cluster 1 in Figure 5.12, the local fuzzy function structure takes on the membership values and transformations identified with Φi=1ψ=1 matrix, for cluster 2 the second type of fuzzy function structures are used to construct input matrix, Φi=2ψ=2. Fuzzy function structures of each cluster identifies one typical structure, represented with p, p=1…,(nf)c*.
¾
In overall, there can be a total of nif×nr×(nf)c* different embedded T1IFF models, each of which will be denoted by q=1,…, nif×nr×(nf)c*, to define the uncertainty interval using type-1 membership values as well as the fuzzy functions using the tuples 〈c*,mr,τs,Φp〉. Let there be c*=4 clusters identified, then there would be a total of (nif=2)×(nr=3)×(nf2=2)c*=4=96 different embedded T1IFF models, q=1,…,96, within the uncertainty interval identified by 4-tuple representation, 〈c*,mr,τs,Φp〉, i.e., c* is the number of clusters, mr is the level of fuzziness parameter, τs is the interim fuzzy function structure, Φp is the local fuzzy function structures identified separately for each cluster, Φp:={Φiψ}, p=1,…,(nf)c*,ψ=1,…,nf, i=1,…,c*.
During the structure identification of DIT2IFF system, one fuzzy function, fi(Φir,s,ψ), is approximated for each cluster i for each set of 〈c*,mr,τs〉 to approximate output values of each vector in each cluster i. Each embedded T1IFF model of DIT2IFF system may have different function structures in each cluster based on this architecture. It should be noted that in our earlier T1IFF models, each cluster takes on the same function structure. The strength of employing different structures in clusters of DIT2IFF models will be discussed in the next section. We also use pre-determined α-cut>0 value to eliminate some data points μik<α-cut that do not affect the decision surfaces of the corresponding clusters. It should be emphasized that, in the new structure identification method, the fuzzy functions can be estimated using any fitting model either using non-complex linear models like least squares or more complex non-linear models like probabilistic support vector machines with radial basis kernel functions. For instance a fuzzy function, fi(Φir,s,ψ), of a particular cluster i can be represented with a hyperplane in ℜnm+nv+1 using membership values, their transformations and original
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
185
input variables using a linear function as shown in (5.43) or a more formal form, ŷi,kr,s,ψ=fir,s,ψ=(Φir,s,ψ)TŴir,s,ψ, where Ŵir,s,ψ=[Ŵ0,ir,s,ψ Ŵ1,ir,s,ψ … Ŵ(nm+nv),ir,s,ψ] define the coefficients of the fuzzy functions. If SVR is used to formulize the fuzzy functions, the function to capture the estimated output would be as follows:
ˆyikr ,s,ψ = fi r ,s,ψ (Φik r ,s,ψ ,α ir ,s,ψ ,α i*r ,s,ψ ) =
∑ j =1 (αijr ,s,ψ − αij*r ,s,ψ )K n
(
In (5.45), α ikr ,s ,ψ ,α ik* r ,s ,ψ
r ,s,ψ Φijr ,s,ψ ,Φi,k + bir,k,s,ψ
(5.45)
) are the Lagrange Multipliers of the SVM model of clus-
ter i, K(.) is the kernel function to transform the (Φir,s,ψ) matrix of cluster i. As a result, different clusters would have different fuzzy functions to approximate local input-output relations. In different clusters, different fuzzy function models may produce better results based on a pre-defined performance measure. For instance, one specific model can reduce the error better than the others for a specific m value. In another cluster, different fuzzy functions for different fuzziness levels could be more preferable. Details of this concept will be presented in the next section entitled the advantages of uncertainty modeling. One needs to determine the best fuzzy function structures in each cluster separately. Therefore, we estimate output values for each combination of the embedded T1IFF models and choose the optimum models for each training vector separately. This is achieved by identifying the best parameters for each training sample and retaining them in matrixes entitled m-collection, τ-collection and Φ-collection tables, which include optimum parameters of each training vector. These collection tables can be constructed in the following way: Let ŷkq , q=1,…nif×nr×(nf)c*, be embedded model output value of kth data point using different fuzzy functions fir,s,ψ in each cluster i, the fuzzy function structures of qth embedded model is represented by Φp={Φi=1ψ,…,Φi=c*ψ}, Φiψ represent each local fuzzy function structure, ψ=1,…nf, and p, p=1,…, (nf)c* represents one T1IFF model structure identified based on each improved fuzzy clustering model denoted with 〈mr,τs〉 parameters, mr be the level of fuzziness, r=1,…nr, τs be the IFC interim fuzzy function structures, s=1…nif, and c* represents the optimum number of rules. One fuzzy function, f(Φiψ) or f(Φir,s,ψ) is approximated for each cluster i for each set of 〈c*,mr,τs〉 to approximate output values of each vector in each cluster i. The optimum model output ŷkq is calculated by fuzzy weighted average method using tuples 〈c*,mr,τs,Φp〉 as follows
⎧ ∑ c* μ r ,s y r ,s ,ψ (Φψ ) ⎫ ⎪ ik ik i ⎪ yˆ k = ⎨ i =1 c* ⎬ r ,s ⎪⎩ ⎪⎭ ∑ i=1 μik q
(5.46)
Φ p = {Φψi ==11 , Φψi ==22 ..., Φψi ==c1*} In (5.46) for each cluster i, a different fuzzy function structure, Φiψ can be identified from a pool of ψ=1,…nf different fuzzy function structures, each one of
186
5 Modeling Uncertainty with Improved Fuzzy Functions
which can be defined such as in (5.43) given the list of membership value transformations. yikr,s,ψ represents each model output of kth data vector in ith cluster obtained from these fuzzy functions of each cluster. Hence, the overall fuzzy function structure of an embedded T1IFF model identified for tth IFC model represented with tuples 〈mr,τs〉, is represented with the matrix array, Φp={Φi=1ψ,…, Φi=c*ψ}, which defines different matrixes for each cluster to identify the fuzzy functions, fi(Φir,s,ψ). Thus, for each embedded type-1 fuzzy function model, T1IFF, of a DIT2IFF system, represented with q, q=1,…,nif×nr×(nf)c*, one output value for each kth data point, ŷkq, can be obtained, k=1,…n. One can identify the optimum output for each k data point from among these embedded model outputs, which would give the least prediction error. Hence, for each data point k in the dataset, we measure the squared error between the actual and predicted output obtained from each qth embedded model, each of which is represented with tuples 〈c*,mr,τs,Φp〉, Φp:={Φiψ}, p=1,…, (nf)c*, ψ=1,…,nf, i=1,…,c*. Then the optimum parameter set for each data point k is measured with the following equation:
(
arg min yk − yˆ kq q
)
2
⎪⎧ ∈ ⎨q ∃q′ , yk − yˆ kq ⎪⎩
(
)
2
(
< yk − yˆ kq′
) ⎭⎪⎬⎪ 2⎫
(5.47)
As a result, for each kth training vector, the optimum output, ŷkq, among q=1,…,nif×nr×(nf)c* different embedded models is identified. The parameters of this optimum model are retained in collection tables, one for each kth data point as follows (k=1,..,n):
(
mCol
n×1
)
(
⎡ Φ1,1 ,Wˆ1,1* ⎡ τ1 , wˆ1* ⎤ ⎡ m1 ⎤ ⎢ ⎢ ⎥ ⎢ ⎥ n×1 n×c* M = ⎢ M ⎥ , τ Col = ⎢ M ⎥ , ΦCol =⎢ ⎢ ⎢ ⎥ * ⎢⎣ mn ⎥⎦ ⎢ Φ1,n ,Wˆ1,*n ⎢ τ n , wˆ n ⎥ ⎣ ⎦ ⎣
(
)
(
) )
L O L
(Φ (
ˆ* c *,1 ,Wc*,1
) ⎤⎥
⎥ ⎥ * ˆ Φ c*,n ,Wc*, n ⎥ ⎦ M
)
(5.48)
In collection tables, as shown in (5.48), we also retain the interim fuzzy function parameters of the IFC method, ŵk* and system local fuzzy function parameters,
Wˆk*,i , k=1..n, i=1…c*, to be later used during reasoning. These parameters ˆ * , Wˆ * } are defined in detail in chapter 4, during structure identification of the {w k
i ,k
T1IFF systems. In short, they represent function parameters, e.g., coefficients of LSE regression functions, or number of LaGrange multipliers, support vectors, or the confidence interval of SVM method, etc. The difference between the T1IFF of chapter 4 and DIT2IFF of this chapter is that, in DIT2IFF systems, each cluster can have different optimum local fuzzy functions that are identified from an uncertain interval of the fuzzy function structures. These optimum parameters are determined during learning algorithm and retained in collection tables. Then these optimum parameter collections calculated in structure identification stage of DIT2IFF are used to infer output values of each test case. The cluster center matrix for each 〈mr,τs〉 is retained as well to be later used to calculate the improved membership values of the testing cases in inference.
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
187
Equation (5.49) is a sample collection table structure for a two input variable training dataset, i.e., x∈X, xk={xk,j}, k=1…n, j=1…,nv=2, e.g., x={x1,x2}, where the optimum c*=2. ⎡ ⎤ m =1.3 ⎡ m1 = 1.3⎤ μ 1 ⎢ ⎥ ˆ 0,* i,1 + wˆ1,*i,1μim,11 =1.3+w ˆ 2,* i,1 e i,1 ] ∈h1(τ1, w ˆ1* ); hi,1(τi,1, wˆi*,1) = w ⎢m = 2.2⎥ ⎢ ⎥ ⎥ τ Col n×1 = ⎢ ⎥ mCol n×1 = ⎢ 1 M ⎢ M ⎥ ⎢ ⎥ mn =1.7 mn =1.7 * * * * * * ⎢ ⎥ ˆi,n ) = [w ˆ 0,i,n + w ˆ1,i,n μi,n +w ˆ 2,i,nμi,n + wˆ3,i,n (ln(1− μi,n )] ∈hn (τ n , wˆ n )⎥ hi,n (τi,n , w ⎢ 424444444444444444 3⎥ ⎣mn = 1.7⎦ ⎢1444444444444444 Interim Fuzzy Function Structure ⎣ ⎦ ⎡ ⎢ f1,1(Φ1,1,Wˆ1,1* ) ⎢ n×2 ⎢ ΦCol = M ⎢ ⎢ fn,1 Φ1,n ,Wˆ1,*n ⎢1444424444 3 ⎢⎣ Fuzzy Function Structure of Cluster1
(
)
⎤ ⎥ ⎥ ⎥ M ⎥ * ˆ ⎥ f2,n (Φ2,n ,W2,n ) 14 42443 ⎥ Fuzzy Function Structure of Cluster 2⎥ ⎦ * f2,1(Φ2,1,Wˆ2,1 )
(5.49)
e.g., for a two input dataset where number of clusters, c*=2. f (Φ ,Wˆ * ) = Wˆ * +Wˆ * μm1=1.3 +Wˆ * x +Wˆ * x 1,1
1,1
1,1
0,1,1
1,1,1 1,1
2,1,1 1,1
3,1,1 1,2
m1 =1.3 * * * * * ) = Wˆ0,2,1 f2,1(Φ2,1,Wˆ2,1 +Wˆ1,2,1 μ2,1 +Wˆ2,2,1 x1,1 +Wˆ3,2,1 x1,2 mn =1.7 * * ˆ* ˆ* ) +Wˆ2,1, f1,n (Φ1,n ,Wˆ1,*n ) = Wˆ0,1, n +W1,1,n (ln(1− μ1,n n xn,1 +W3,1,n xn,2 * ˆ * mn =1.7 +Wˆ * (ln(1− μmn =1.7 ) +Wˆ * x +Wˆ * x f2,n (Φ2,n ,Wˆ2,*n ) = Wˆ0,2, 2,2,n 2,n 3,2,n n,1 4,2,n n,2 n +W1,2,n μ2,n
In (5.49) the optimum number of clusters is set to c*=2 and {ŵk*, Ŵi,k*} indicate the optimum interim and system local fuzzy function parameters of a kth training vector. Each of these collection tables contain the same number of rows, indicating the optimum embedded fuzzy model, T1IFF, identified for the kth data vector in the training matrix of the same row. For instance, the first row of each collection table, which indicate the optimum embedded T1IFF model parameters of the first training vector, are as follows: the fuzziness parameter of improved fuzzy clustering is m*=1.3, the interim fuzzy functions, are identified with interim m =1.3 μ 1
matrix, τi,k=1∈τk=1, (hi,1(τi,(k=1)s,ŵ*i,(k=1))= wˆ 0,* i ,1 + wˆ 1,*i ,1μim,11=1.3+wˆ 2,* i ,1 e i ,1 ), which take on m-powered membership values, and exponential transformations, and finally, the two fuzzy functions, i.e., there are only two clusters, c*=2, in this model, have the same structures, Φ1,1=Φ2,1, by chance, which use the power transformation of membership values, ( μ1m1 =1.3 ), and original input variables, e.g., Φ1 p = {Φ1,1 , Φ 2,1} : "2-cluster model" * * * * f1,1 (Φ1,1 ,Wˆ1,1* ) = Wˆ0,1,1 + Wˆ1,1,1 x1,1 + Wˆ3,1,1 x1,2 μ1,1m1=1.3 + Wˆ2,1,1
(5.50)
m1=1.3 * * * * f 2,1 (Φ 2,1 ,Wˆ2,1* ) = Wˆ0,2,1 + Wˆ1,2,1 + Wˆ2,2,1 x1,1 + Wˆ3,1,1 x1,2 μ 2,1
In (5.50), Φkp holds the fuzzy function structure information of every cluster of an embedded T1IFF model picked up for the kth training vector, viz. for each training data point one T1IFF model is chosen among many embedded models within the identified interval of fuzzy functions. Thus each Φkp defines the input matrixes of each cluster i in feature space, e.g., Φi,k,
188
5 Modeling Uncertainty with Improved Fuzzy Functions
which are identified by different forms of membership values, their transformations and the function parameters, Ŵi of each cluster. For instance Φi=1,k=1, k=1,…,n, i=1,…,c*, represents the optimum matrix to structure the local fuzzy functions of one T1IFF model, which is chosen to be the optimum model structure for the first training vectors of cluster 1 and Φ1,2 is used to represent matrix to identify the fuzzy functions of the first training vectors of cluster 2. It should be recalled that, the optimum interim and system local fuzzy function parameters, i.e., {ŵk*, Ŵi,k*}, can be identified with simpler regression functions such as LSE as shown in (5.49). In such cases, these parameters are simply the scalar values that represent * ˆ *j ,i , k ∈ ℜ, j=1…nm, each coefficient, e.g., wˆ i*, k ∈ wˆ k* = {wˆ 0,* i , k , wˆ1,* i ,k ,..., wˆ nm ,i , k } w * ˆ* ˆ* ˆ* Wˆi*,k = {Wˆ0,*i ,k Wˆ1,*i , k LWˆnm ,i , k LWnm +1,i , k LWnm + nv ,i , k } , W j ,i , k ∈ ℜ, j=1,…nm+nv+1, k=1…n, i=1…c*. Whereas, when non-linear regression methods such as support vector regression is utilized to identify the interim and system fuzzy functions, then the parameters, i.e., {ŵk*, Ŵi.k*}, are not just the scalar coefficients, then they represent the optimum Lagrange multipliers, (α ik* ,α ik ) , the kernel type (K(.)), for non-
linear transformation of vectors, and additional parameters such as optimum regularization parameter, Cregk*, and optimum epsilon value, εk*. Thus, the values that the collection tables take on depends on the type of the fuzzy function approximation method used in the experimental trial. Details of these structures will be presented in chapter of experiments (Chapter 6). Next, we explain the proposed inference method.
II. Inference Method of the DIT2IFF The main objective of the proposed method is to find a crisp output for a given input data vector using the DIT2IFF model. During the structure identification step of the algorithm, discrete interval type-2 fuzzy sets are identified. Each
Fig. 5.13 DIT2IFF Inference Framework. Grey boxes depict enhanced/newly introduced operations compared to earlier DIT2FRB approach.
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
189
combination of parameters represents a single embedded T1IFF system. During inference method, we start with the reduction of the type of the interval-valued type-2 membership values down to one single membership value using a `casebased type reduction` method. Figure 5.13 depicts the steps of the proposed inference structure.
Step 1: Cased-Based Type Reduction Cased-based reasoning (CBR), which is a novel Artificial Intelligence (AI) problem solving paradigm, involves adaptation of old solutions to meet new demands, explanation of new solutions using old instances (called cases), and performance of reasoning from precedence to interpret new problems. It has a significant role in today’s pattern recognition and data mining applications. In this work, this phenomenon is implemented to reduce the type-2 fuzzy sets down to type-1 fuzzy sets, namely, the type-reduction method. The type reduction method is an extension of the defuzzification procedure. In order to implement type-2 fuzzy systems, we need to compute the centroid of a type-2 fuzzy set. In this work, we only work with interval valued type-2 input fuzzy sets such as in Figure 5.11. In the structure identification section, we identified discrete parameters that are used to construct a list of type-1 membership values. When all the different sets of parameters are used to draw a membership scatter diagram, they form interval membership functions to form discrete interval valued type-2 fuzzy sets as shown in Figure 5.11. In the standard interval valued type-2 fuzzy logic systems, the first step is to identify the interval membership values. Then type-2 operations are applied on these interval valued type-2 fuzzy sets to construct aggregation and implication operations. Then a type reducer is applied on the output fuzzy set first to reduce the type-2 to type-1 and then defuzzification to find a crisp output value, viz. reducing the type-1 to type 0. In the new inference method, since the output fuzzy sets are represented with fuzzy functions, but not with output fuzzy sets, defuzzification method is not required. However, we do need a type-reduction to reduce the type-2 fuzzy sets down to type-1. Instead of working with the interval valued type-2 membership values during inference, we apply the new case-based typereduction method, right at the beginning of the algorithm and work with type-1 fuzzy sets throughout the inference mechanism. It is this concept of the new inference mechanism that separates it from the earlier type-2 inference methods. Earlier type-2 inference methods implement fuzzy operators on type-2 fuzzy sets and then reduce the type of fuzzy sets down to type-1, which requires more calculations. After this point, since we reduced the type from type-2 to type-1, the computations with type-1 fuzzy sets will be much easier than the type-2 fuzzy sets (since it is much easier to do operation with type-1 fuzzy sets which are crisp values not intervals). The discrete nature of fuzzy sets and fuzzy functions enable easy type reduction. The steps of the new inference methodology are sketched in Figure 5.14.
190
5 Modeling Uncertainty with Improved Fuzzy Functions Uncertainty interval of membership functions
For each testing vector l, l=1...nte Find optimum parameters of the nearest training vector, which is denoted with x'k from collection tables- m*,τ*,
Type 1 Fuzzy Set of cluster i
Case Based Type reduction
∑ i=1 c*
μir *, s* f i r *, s*,ψ * (Φψ * )
Actual Output
Uncertainty interval identified by different fuzzy functions
Actual Output
Type 1 Fuzzy Set of cluster i
μil ( xltest , ml* ,τ l* ) ml* ∈ mColκ , τ l* ∈τ Colκ
Interval Type 2 Fuzzy Set of cluster i
fil* (ml* ,τ l* | Φ*il ) Type 1 Fuzzy Set of cluster 1
Case Based Type reduction
μir *,s* Selected best output values
Model output from each discrete fuzzy function of cluster i Type 1 Fuzzy Set of cluster c
Fig. 5.14 Cased Based Type reduction for inference method of DIT2IFF system. The dark dots (●) represent the optimum membership values and fuzzy functions determined for each testing vector, l, l=1…nte.
It should be noted that a unique property of the uncertainty modeling with the DIT2IFF strategy is that, the uncertainty is not only identified by constructing interval valued membership values, but also constructing interval valued local fuzzy functions. The optimum fuzzy function of each local model is uncertain. Thus we approximate a list of fuzzy functions based on different forms of membership values, to define a footprint of uncertainty for each cluster. Each data point k would have a number of different crisp output values when different fuzzy functions from the fuzzy function interval are used. The aim is to select the optimum local fuzzy function which would be the best fit to the given data vector. There could be unlimited number of functions, but we define only a list of (embedded) fuzzy functions within an uncertainty interval by discretization of the uncertainty interval to allow easy computations with type-1 fuzzy functions and membership values. This structure is depicted in Figure 5.14. Given the test data vectors with unknown output values, (xtest) and the training input data vectors with known outputs, (xk,yk), k=1,…,n and their corresponding n×1
n×1
n×c*
collection tables, { mCol , τ Col , ΦCol }. For each lth testing vector we execute the following. As the first step of the case-based type reduction method, we identify the nearest training vector, k' to the lth testing vector using Euclidean distance measure as follows:
{
}
arg min d ( xltest , x) ∈ l ∃xk , xk ′ d ( xltest , xk ′ ) < d ( xltest , xk ), k , k ′ = 1...n xk′
(5.51)
5.4 Discrete Interval Type-2 Improved Fuzzy Functions
191
The new type-reduction is based on a search method. For each testing datum, the closest training datum, xk’, is identified based on Euclidean distance. The optimum parameters, indicated with a (*), to identify improved membership values and the structure of each fuzzy function in each cluster, i, i=1,…,c* are captured from collection tables based on k' as follows:
μilimp ( xltest , ml* ,τ l* ) ml* ∈mColk ′ , τ l* ∈τ Colk ′ fil (ml* ,τ l* | Φ*ik ′ ) Φ*il ∈ Φ i Colk ′
(5.52)
In (5.52) m* represents the optimum level of fuzziness identified from the m-Col, i.e., collection table, for lth test vector, xltest, based on the closest training vector, xk'. Similarly, τl* represents the optimum interim fuzzy function structure, identified specifically for xltest from the τ-Col table based on xk' and it holds the information about the membership values and their transformations and their regression coefficients, e.g., such as in equation (5.49). In analogical form, Φil* holds the optimum local fuzzy function structures, one for each cluster, i=1…c*, identified test
specifically for xl using the Φ-Col table based on xk' and it holds the information about the membership values and their transformations and input variables and their regression coefficients such as in equation (5.49). The optimum parameters to do reasoning is determined based on the optimum parameters of the nearest training vector, xk', from the collection tables. Then these parameters are used in two separate places of the inference mechanism as shown in Figure 5.14. The first one is when identifying the membership value calculation equation parameters for IFC. Thus, from within the uncertainty interval of membership values, we identified the optimum membership values for each testing datum, xl, based on the optimum 〈ml*,τl*〉 in the first step of cased-based type reduction. Next, the optimum fuzzy function parameters are identified among many fuzzy function structures for each data point, l, from the uncertainty interval of fuzzy functions as shown in Figure 5.14, based on optimum parameters of the nearest training vector. In other words, the optimum fuzzy function parameter of a testing vector l is equal to the optimum fuzzy function parameters of the closest training vector and these parameters are obtained from {ΦColln ×c*}. As a result, for each testing vector, xl, the optimum parameters of the embedded T1IFF models are identified based on the case base type reduction method. Step 2: Fuzzification
At this stage, since the type of the membership values are reduced from type-2 to type-1, we apply a similar method to the inference methodology of T1IFF as described in ALGORITHM 4.5. in Chapter 4. Here we give a summary of these methods. In (5.52), ml* and τl* represent the optimum values of degree of fuzziness and interim fuzzy function parameters such as in equation (5.49), of the testing vector, l. The membership values of the lth testing vector is measured using IFC membership value calculation equation in (5.40), using the optimum parameter tuple, 〈c*,ml*,τl*〉.
192
5 Modeling Uncertainty with Improved Fuzzy Functions
In the first step, κ training data samples that are nearest to the lth testing data sample, l=1…nte, are identified based on the Euclidean distance measure. Using interim fuzzy function parameters, {τl*,ŵl*}, improved membership values of κ-nearest training data samples are calculated using (5.40). Using each improved membership values of these κ-nearest training data samples, κ different vectors are formed to build the interim matrix of κ data vectors based on the membership value transformation identified by τl*. As a result an interim matrix for each cluster i, τi*=[τi1*…τik**]T which compose of κ vectors are formed. The interim matrix structure is similar to (5.41). This interim matrix is used to estimate the membership values of the corresponding testing sample, l. Thus, using regression function parameters, ŵil*={ŵ1l*,…, ŵil*,…, ŵc*l*}, and the interim matrix structure, τi*, i=1…c, the interim fuzzy function model output values of each κ-nearest training sample is calculated using hi (τ iq* , wˆ il* ) , q=1… κ, i=1,…,c*. The next step, i.e., step 4 in ALGORITHM 4.5, is to measure squared error values using the actual and model output values obtained from the interim fuzzy functions for each cluster using these nearest κ data samples, i.e., SEiq=(y*qhi (τ iq* , wˆ il* ) )2, q=1… κ, i=1…c*, to be used to approximate the average SEi for the lth test data sample. Next, error values, SEiq, are weighted with weight constants,ηrq, which are normalized distances of the κ-training samples to testing sample l. The average approximate squared error of the lth testing sample in ith cluster is calculated with weighted square error, mSEil, which is used in the new membership function, i.e., equation (5.40), to calculate improved membership values of the testing samples, μil. The next step is finding the fuzzy model outputs. Step 3: Identification of Fuzzy Model Outputs
The fuzzy model output value is calculated for the lth testing vector, one for each cluster i, i=1,..,c*, using linear functions such as LSE as in (5.43) or non-linear methods such as SVM as in (5.45). The optimum parameters of the input matrix, Φil*, are captured from the collection table {ΦColln ×c*} during the case-based type reduction step of the algorithm. Using these optimum parameters, the fuzzy model outputs are calculated for each testing vector l by using optimum fuzzy function parameters as follows;
( )
yˆ1,l = f1 Φ1* : xltest → Φ1*l ( xltest , ml* ,τ l* )
M yˆ c*,l =
( )
f c* Φ*c*
(5.53) : xltest
τ
→ Φ*c*l ( xltest , ml* , l* )
Single crisp output value for lth testing vector is obtained by weighted fuzzy output method;
5.5 The Advantages of Uncertainty Modeling
∑ i =1 μ il ˆyi ,l c* ∑ i = 1 μ il
193
c*
ˆyl =
(5.54)
In (5.54) μil represents the improved membership values calculated for lth testing vector in cluster i. The new DIT2IFF strategy is presented precisely for regression problem domains. In this work, we extended the proposed strategy for classification problem domains, denoted with DIT2IFF-C. For these models, Improved Fuzzy Clustering for classification problems (IFC-C) is implemented. The interim fuzzy functions of IFC-C and system fuzzy functions are classification type functions, e.g., logistic regression or support vector classification. The number of parameters or the structure identification and inference steps does not change when classification methods are used. There are many structural differences between the new inference method (in Figure 5.13) and the inference methods of earlier approaches, DIT2FRB [Uncu and Turksen, 2007] (in Figure 5.6) or [Hwang and Rhee, 2007]. To begin with, the new method is based on the type-2 interval valued fuzzy functions approach, the earlier methods are based on type-2 fuzzy rule bases. The new type-2 inference approach replaces the implication and aggregation of the output fuzzy sets of the earlier DIT2FRB inference method with one single step, as described in Step 3. This is due to the fact that the new method does not utilize fuzzy rule base (FRB) operations during inference. Type reduction is processed as the first step, and it is based on a search process, as opposed to earlier type-2 fuzzy systems in [Mendel, 2001], which requires complex fuzzy operators on type-2 fuzzy sets. Additionally, the new approach can identify the uncertainties based on a new additional tuple, 〈τs,Φψ〉, which can improve the prediction performance of the system model in particular structures, i.e., each τs , s=1,…,nif, representing the interim matrix composed of membership values and their transformation obtained at each step of the Improved Fuzzy Clustering (IFC) algorithm in order to identify the improved membership values with the interim fuzzy functions, hi(τi,ŵi), and Φiψ, ψ=1,…,nf, is the matrix of membership values , their transformations and original input variables to identify the fuzzy functions of each cluster, fi(Φiψ,Ŵi). No defuzzification operation is required. Next, we will demonstrate the importance of specification viz. identification of different parameters for each cluster in uncertainty modeling methods introduced by the new type-2 fuzzy system modeling approaches.
5.5 The Advantages of Uncertainty Modeling In fuzzy set theory, the uncertainty is represented by the graded boundaries of sets that contain imprecise characterization of objects [Turksen, 1996]. Two different uncertainty concepts are found in the fuzzy functions approaches, the classification being dependent on the structure of the membership functions that define the fuzzy sets and the structure of the fuzzy functions. Hence, one needs to identify these uncertainties in clustering level, i.e., the different fuzzy function strategies are identified for each cluster, to capture the optimum structures of the local models.
194
5 Modeling Uncertainty with Improved Fuzzy Functions
An important problem in knowledge extraction based on clustering algorithms is to decide on the way to discover hidden structures that can identify different levels of information efficiently. In system modeling problems, there is a need to split the problem into a sequence of more manageable and smaller subtasks to improve efficiency [Pedrycz, 2004]. Fuzzy clustering methods play an important role in this sense because the local fuzzy systems identified with fuzzy clustering methods determine the overall system model. Thus, each fuzzy cluster is denoted as fuzzy information units. In this respect, here we will demonstrate how the new uncertainty modeling technique can help to improve information extraction with fuzzy clustering methods. Interpretations of Hidden Structures in the DIT2IFF Approach
System models that hold different local structures can be identified with different forms of fuzzy functions, Φiψ, in each cluster i=1…c, e.g., equations (5.49) and (5.50). It should be recalled that different fuzzy function structures are constructed for different clusters by utilizing different forms of improved membership values as additional column vectors (attributes) to map the original space onto a higher dimensional feature space. Hence, the proposed interval type-2 inference mechanism of the DIT2IFF enables employment of different fuzzy function structures (regression/classification models) for each cluster. The output value of each test vector is calculated by using the optimum parameters of the closest training vector from collection tables, e.g., equation (5.49) and (5.50) as explained in section II. “Inference Method of the New DIT2IFF”. The latter phenomenon can be explained with a graphical example. In this example, we use 16 different values for m parameter, mr∈{1.1,…,2.6}, r=1…16, and four different types of fuzzy functions, ψ=1…4, Φiψ∈{Φi1=[x μ*r,s], Φi2=[x μ*r,s (μ*r,s)2], Φi3=[x μ*r,s exp(μ*r,s)], Φi4=[x μ*r,s ln((1-μ*r,s)/ μ*r,s)]} to construct different structures in each cluster. Here we assumed that there is only one type of interim fuzzy function such as in (5.41) to be used in improved fuzzy clustering models, τs=1, i.e., τs∈{[μ*r,s]}, but this time only membership values are used to construct the interim fuzzy functions for Improved Fuzzy Clustering (IFC), such as in (5.41). We applied the type-1 improved fuzzy functions (T1IFF) method for each tuple 〈mr=1..16,τs=1,Φiψ=1,…,4〉, i.e., a total of q=1,…,(nif=1)×(nr=16)×(nf=4)=64 embedded T1IFF models of each cluster i. Then for a particular cluster i, we approximate a linear fuzzy function and find the fuzzy output value for the corresponding cluster using (5.53). Using root mean square error (RMSE) as the performance measure, i.e., RMSEi=
1 n
∑ nk ( ykactual − yˆikq )
2
, we estimate one
RMSE value for each qth embedded T1IFF model just for a particular cluster i. We used linear regression to estimate the fuzzy functions. The graph in Figure 5.15 is a function of RMSE for given m values. Each of the four lines indicates one type of fuzzy function, ψ=1…4. Each point is a different embedded T1IFF model output specific to the corresponding cluster. RMSE starts to increase for m values greater than 1.8. For this dataset, lower m values yield
5.5 The Advantages of Uncertainty Modeling
195
better predictions. The fourth fuzzy function in Figure 5.15 defined as f(u,ln(1u),x), where u represents the membership values of the corresponding cluster, has the minimum error, which indicates that we can explain the output behavior with linear functions in local models better than the rest of the models when we use the non-linear logarithmic form of membership values. 0.52 0.5 0.48
RMSE
0.46 0.44 0.42 0.4 f(u,x) f(u, u2, x)
0.38
f(u, e u , x) 0.36 0.34
f(u, ln(1-u), x) 0
1.1
1.2
1.3
1.4
1.5
1.6
1.7 1.8 1.9 m Values
2.0
2.1
2.2
2.3
2.4
2.5
2.6
Fig. 5.15 RMSE values of improved fuzzy functions for a specific cluster i. Each line represents one type of fuzzy functions denoted in the legend of the figure. u represents membership values.
This simple example above points out that one can find optimum decision surfaces among given list of parameters. In one type-2 fuzzy functions system model, using a wide range of different local fuzzy function models for each fuzzy clusters may increase the chances of capturing the optimum function to define their input and output relationships. In particular, different fuzzy function structures may affect the system performance differently in different parts of the given datasets. With the implementation of the DIT2IFF models, these structures can be identified. Then, in the inference method, the optimum fuzzy function structures specific to each data point are identified using a case-based reasoning method. It should also be noted again that the earlier type-1 improved fuzzy functions (T1IFF) strategies identify one unique structure for each cluster. This example is important to demonstrate graphically why one may get better results when one applies type-2 fuzzy functions strategies instead of type-1 strategies. The challenge of the discrete interval valued type-2 improved fuzzy functions (DIT2IFF) strategy being presented in the latter section is that, it is tedious to identify the m intervals, and different fuzzy functions structures. One should find various embedded T1IFF models for different values of these parameters.
196
5 Modeling Uncertainty with Improved Fuzzy Functions
In the next section, we introduce the evolutionary design of DIT2IFF approach to identify the upper and lower boundaries of interval membership functions automatically based on learning parameters and fuzzy function structures. The proposed method of the next section is applied on the regression and classification problem domains. The new evolutionary approach tries to solve some of the difficulties of the DIT2IFF strategy.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions with Evolutionary Algorithms 5.6 Discrete Interval Ty pe-2 I mproved F uzzy Functions
5.6.1 Motivation In the earlier sub-section, we introduced a new interval valued type-2 improved fuzzy functions approach (DIT2IFF), which can identify system uncertainties based on learning parameters. Interval valued membership functions and fuzzy functions are discretisized based on human expert knowledge. Each discrete function represents one embedded type-1 fuzzy function structure. To identify the optimum values of these uncertain parameters, we applied an exhaustive search method by defining different degrees of fuzziness values, mr, and interim fuzzy function structures, {τs}, s=1,…,nif, e.g., equation (5.41) and (system) fuzzy function structures, Φp={Φ1ψ…Φc*ψ}, p=1…,(nf)c*, ψ=1,…,nf such as in (5.43) or in Figure 5.12, representing different membership values and their transformations to define a pool of function structures for each cluster i, Φiψ. The optimum parameters are determined based on sub-sampling cross validation to be described in the chapter of experiments. The parameter set 〈c*,mr,τs, Φp〉, Φp={Φi=1ψ,…, Φi=c*ψ}, with the least model error on the validation data are retained as the optimum system parameters. As its name implies, the exhaustive search tries out every possible set of parameters to identify the optimum parameter set, e.g., equation (5.49) and (5.50). Although one could almost certainly find the global optimum of the objective function with the exhaustive search, given a sufficient number of different values of parameters, it would take very long time to find the global optimum. There is also the chance of being stuck at the local minima earlier in the prediction process, if a sufficient amount of the parameter search space is not searched. Instead, one could use non-linear algorithms, to search for the global optimum. Hence, among such search algorithms, the notoriously fast and effective stochastic search algorithm based on evolutionary algorithms, namely the Genetic Algorithms (GA), is employed to find the optimum parameters of the DIT2IFF models. We try to identify the uncertainty interval of the membership values and the fuzzy functions automatically from the data. The Genetic Learning Process (GLP) identifies the uncertainty interval of membership values and fuzzy functions. Then, this interval is discretisized based on different discrete fuzziness values of the fuzzy clustering method, {mr| r=1,…,nr} values and list of the structures of fuzzy functions,{τs,Φψ| s=1,…,nif, ψ=1,…,nf}. Using the discrete parameters and interval type-2 membership values and fuzzy function parameters, the DIT2IFF method is applied. Implementing the GLP enables to narrow down the search space of the DIT2IFF
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
197
method, where the parameters are assumed to represent the global optimum model. The three-phase model, as shown in Figure 5.16, enables to build an evolutionary fuzzy model that can identify uncertainties automatically based on the given data compared to the type-1 improved fuzzy functions (T1IFF) and DIT2IFF–methods of this work.
collection − tables
Fig. 5.16 The architecture of Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions
In summary, all these new strategies are combined together in a three-phase algorithm to build an Evolutionary Interval Valued Type-2 Improved Fuzzy Functions (EDIT2IFF), which can optimize the parameters with genetic algorithms. The next section presents the architecture of the new system modeling approach.
5.6.2
Architecture of the Evolutionary Type-2 Improved Fuzzy Functions
The new Evolutionary Discrete Interval Valued Type-2 Improved Fuzzy Functions (EDIT2IFF) is an iterative hybrid system, in which, the structure is built and parameters are tuned by a Genetic Learning Process (GLP) to determine the size and the structure of the clusters and fuzzy functions, which are the two fundamental concepts of system identification. The new approach as depicted in Figure 5.16 comprises of three fundamental phases based on the concept of cross validation: ¾
Phase 1: Determination of the optimum parameters using a genetic learning process; learning from training and fitness evaluation with validation data, and identification of the uncertainty interval of the type-2 fuzzy sets and fuzzy function structures.
¾
Phase 2: Building DIT2IFF using the optimum parameters and identified uncertainty intervals from the first phase,
¾
Phase 3: Inference with test data using optimum parameters.
198
5 Modeling Uncertainty with Improved Fuzzy Functions
We use two different sources of uncertainties in the new system to represent Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions (EDIT2IFF) method: ¾
selection of the degree of fuzziness parameter, m, of Improved Fuzzy Clustering (IFC) by identifying upper and lower values of the degree of fuzziness parameter, m,
¾
resolution of fuzzy function structures, (1) interim fuzzy functions, τs, of IFC clustering which comprise of the membership values and their transformations only, and (2) (system) local fuzzy function structures, Φp={Φ1ψ…Φc*ψ}, p=1…,(nf)c*, ψ=1,…,nf, which comprise of the original input variables, membership values and their transformations, e.g., equation (5.43).
Phase 1: Learning Optimum Parameters with Genetic Algorithms: Genetic Learning Process (GLP)
Genetic Algorithms (GA) are the most widely known type of evolutionary algorithms [Eiben and Smith, 2003]. They are stochastic search techniques based on the principles of process evolution, natural selection, genetic recombination and mutation in a population, which comprises of potential individuals (solutions). GA’s are most commonly used for searching the optimum parameters of the system models. They help to produce different search paths, whilst analyzing random solutions in order to reduce the likelihood of being trapped in possible local minima. On top of determining the principle processes of GA, one should define an initialization procedure and a termination condition to obtain a running algorithm. In this work, GA is employed to find optimum system parameters, and boundaries of interval type-2 fuzzy membership values and structure of fuzzy functions of DIT2IFF method. The structure of each chromosome encodes the new EDIT2IFF model parameters, which are parameters of the new IFC algorithm and fuzzy function structures. Hence, each chromosome represents parameters of a single DIT2IFF method whose parameters are uncertain. Recall that the DIT2IFF forms an interval that contains embedded type-1 improved fuzzy functions (T1IFF) system models, i.e., interval valued membership values and fuzzy function structures. In order to find the optimum interval that would contain the optimum discrete T1IFF models, we use evolutionary methods. Each chromosome is coded in such a way that DIT2IFF can be identified with boundary values of its uncertain parameters. Thus, each chromosome contains an upper and lower value for the level of fuzziness parameter to define the uncertainty interval of membership values. Similarly, a long list of different fuzzy function structures, in order to define an interval for fuzzy function structures, are identified so that the optimum structures can be identified from within this interval that contains a list of the optimum fuzzy function parameters. In addition to the level of fuzziness and fuzzy function structure parameters, each chromosome has an identifier for the type of the regression method, viz., SVM, or LSE, to approximate fuzzy functions. Depending on the type of the regression method used, the chromosome appends additional parameters with respect to the regression function parameters. Even though the new EDIT2IFF does not entail fuzzy rule base (FRB) structures, the learning approach utilizes alike Pittsburg ap-
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
199
proach [Ishibuchi et al, 1999], where each chromosome encodes different type-1 fuzzy function, T1IFF model, parameters.
( τ ,Φ )
α -cut
μ
1/(1+e−μ)
Fig. 5.17 EDIT2IFF design – Phase 1. The dark dot (●) represent the tokens (each gene) corresponding to the parameters specific to the function type, e.g., if SVM is chosen, these parameters would include the regularization (Creg), epsilon (ε), Kernel Type (K(.)). Darker colored tokens represent control genes.
Inspired by the insight of the biological DNA structure, a hierarchical heterogeneous chromosome formulation of GA [Wang et al, 2005] is implemented as in Figure 5.17, where the genes of chromosomes are classified into two different types based on their structures: parameter genes that are real numbers, and control genes that are binary codes. The simplest coding method, which first comes to mind, is to use bit representation. Nowadays, it is generally accepted that it is better to encode the numerical variables directly as integers or floating-point variables. In our genetic process, we used numerical values in chromosomes and applied corresponding genetic operators instead of encoding them as decimals. The structure of the genomes differs based on the fuzzy function approximation method (Figure 5.17). When a simple function approximation method is used to implement fuzzy functions, such as LSE, the parameter genes are composed of, in sequence, the m-lower and m-upper∈[1.01, ∞], the number of clusters, c, the α-cut value and the indicator that specifies which type of regression method is being used, e.g., “1” LSE, “2” is SVM, “3” is ridge regression, etc. The control genes represent different forms
200
5 Modeling Uncertainty with Improved Fuzzy Functions
of membership functions to be used to structure the local fuzzy functions and interim fuzzy functions of the IFC method. If a non-linear function estimation method such as support vector machines, is used to construct non-linear fuzzy functions, three of the SVM parameters, Creg, epsilon and kernel type, are used as additional tokens, i.e., genes, each cell, in the chromosome. Creg is the regulation parameter to balance the SVM weight vector and the error margin, the two distinct concepts to be minimized in the objective function, and epsilon (ε) is the error margin to flatten the decision surface. Due to using different types of fuzzy function approximators, the gene length is dynamic and chances from one chromosomes in a model to the other. The rest of the tokens represent different forms of membership functions to be used while shaping the fuzzy functions. Among many different types of membership value transformations, in our models we used power sets, exponential, sigmoid, logistic transformations, etc., as additional inputs. The Genetic Learning Process – GLP – automatically determines which form of membership values would be optimum for the particular system. Note that each chromosome represents parameters of two separate individual type-1 improved fuzzy functions (T1IFF) models based on their m values, [m-lower, m-upper] each of which would have the same parameters that identify fuzzy function structures. However, each chromosome in the population would have different parameters, m boundaries, and fuzzy function structures, e.g., linear or non-linear methods, etc. so that one is able to capture a diverse population behavior. Initial interval valued type-2 fuzzy sets starts with a larger interval and the GLP tries to find the optimum m-interval by constructing two separate the T1IFF models for each chromosome using one of each m boundaries, i.e., m-lower and mupper, as shown in Figure 5.18. The crossover and mutation operators are applied on the randomly chosen chromosomes with the same type. The top-left graph in Figure 5.18 is the graph of membership values of a single chromosome. The shaded region of the membership values on the top-left figure indicates the uncertainty interval identified by the m-interval and different types of interim fuzzy functions, each of which is identified by τs, s=1,…,nif. Each chromosome defines an upper and lower boundaries based on the m variant of IFC method. The shaded area represents any other membership function that can be extracted when we discretisize the interval into type-1-membership functions by using any value from [m-lower, m-upper] interval or any interim fuzzy function structure used in the IFC algorithm while shaping membership functions. Hence, this figure represents an interval type-2 membership function identified by each chromosome. The algorithm starts with a larger interval and searches for the optimum interval based on a performance indicator, e.g., root means square error, mean absolute error, misclassification percentage, etc., of the T1IFF models. The performance measure is used as fitness evaluation function of GLP. The change in m values and fuzzy function structures, identify the uncertainty in shaping the membership values.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
201
The lower two graphs in Figure 5.18 indicate the model output values obtained from different fuzzy functions. At the beginning of the GLP, a list of different fuzzy function structures are introduced as candidate structures to be identified by the chromosome structure. The fuzzy function structures optimized by the GLP refer to the optimum list of membership value transformations to be used to structure the interim fuzzy functions of the improved fuzzy clustering method (IFC). In addition,
μ( x)
μ( x)
Fig. 5.18. Genetic Learning Process – Phase 1 of Evolutionary Interval Type-1 Fuzzy Functions models, identifying (TOP) interval valued membership functions, (BOTTOM) interval for fuzzy functions.
these optimum membership value transformations are later combined in every possible way, i.e, Cartesian product of its values, to obtain the optimum combination for each cluster to identify the local fuzzy function parameters. The list of possible fuzzy function structures identifies some form of an interval of fuzzy function structures. If this interval can be optimized, then there will be fewer fuzzy functions to do reasoning with at the end of the first phase of the algorithm as shown in the two lower-right graphs in Figure 5.18. In these two graphs we demonstrate the before and after conditions of the interval fuzzy functions. The optimized uncertainty interval will later be
202
5 Modeling Uncertainty with Improved Fuzzy Functions
used to identify the optimum DIT2IFF model. These optimized intervals are assumed to encapsulate the optimum model parameters, which are uncertain at this point. Therefore, we give a wide range of values at the start of the GLP algorithm to enable the algorithm to search for the optimum intervals. Control genes in Figure 5.17 are composed of kernel type, K(⋅), and list of fuzzy function structures to identify the interim fuzzy function and local fuzzy functions, {τs,Φψ}. GLP will determine the list of membership value forms that has the most predictive capability of the behavior of the output variable within each local model. Hence, later in phase 2, these membership function forms will be used to identify fuzzy function structure uncertainties. This step tries to capture the optimum interval by identifying which membership function would most suit the DIT2IFF models by reducing the list of possible values. The bounded uncertainty interval of membership functions as shown in Figure 5.18 top graphs, indicate two separate T1IFF models based on m-lower and m-upper parameters. Those two membership functions have the same fuzzy function structures. Any membership function that could be extracted after discretization of this interval would indicate any other membership function of an embedded T1IFF model. Different output values obtained from fuzzy functions of different structures are displayed in the lower graphs of Figure 5.18. The parameters that denote the shape of these functions is determined by darker color tokens of the chromosomes in Figure 5.17. In addition, the first token of this second part of chromosomes, the fuzzy function structure tokens, represent Kernel type parameter that define the fuzzy function types. We used two separate kernel types when SVM is used to form fuzzy functions: linear K(xk,xj)= xkTxj, or, non-linear Gaussian radial basis kernel (RBF), K(xk,xj)=exp(-δ||xk-xj||), δ>0. We tried different sets of models for linear LSE and non-linear SVM to measure their performance differences based on error reduction. The value 0 in a control gene token represents that the corresponding membership value transformation will not be used in the model. The length of fuzzy function structures, nm, viz. the collection of membership value forms to shape fuzzy functions, is determined prior to chromosome formation. As it was mentioned earlier, based on the fuzzy function approximators type, the length of the genes may differ therefore the genetic structure has a dynamic length. Hence, Figure 5.17(bottom) chromosome is an example extracted from any tth iteration of the GLP when the support vector regression is used to approximate fuzzy functions. The chromosome structure indicates that the m interval is [1.45, 1.75], the number of fuzzy functions is 3, the alpha-cut value is 0.1 and type=2, which means the SVM will be used to approximate fuzzy functions. In turn, Creg=54.4, ε=0.115, the kernel function indicated by the first control gene implies that a non-linear rbf kernel will be implemented (=1) and only exponential transformation of the membership values will be used to shape the system fuzzy function parameters. Alpha-cut indicates that, in each cluster, the interim and local input matrixes will be determined based on the following constraint, μi(x)>α-cut, e.g., equation (4.1), which indicates that the data vectors with improved membership values less than 0.1 will be discarded when approximating the interim and local fuzzy functions.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
203
The purpose of the new genetic learning process of the EDIT2IFF is to find the optimum m interval, the optimum list of membership value transformation to structure the fuzzy functions, i.e., {τs,Φψ}, such as in equations (5.41) and (5.43), where s and ψ represent different fuzzy function structures, parameters and type of fuzzy functions, e.g., Creg, ε, K(⋅), so that the optimum model can be captured. The strength of this approach is that each individual in the population can construct different structures, e.g., linear or non-linear, based on fuzzy function approximation type. The algorithm determines the optimum structure through the probabilistic search method. The algorithm decides which type of function should be better to use for a particular model. Initial population is randomly generated. Fitness function is defined based on combinatorial performances of two type-1 improved fuzzy function (T1IFF) models for each m-bound on validation dataset (as shown in Figure 5.16). It is calculated by the defined performance indicator (PI), considering the global minimum value for the PI is searched for selection purposes. Then, the surviving individual will be selected using the fitness function by: m _ lower PI pop = PI mpop_ upper + PI pop
(5.55)
pop=1…population-size. The algorithm searches for the optimum model parameters and the m-bound so that the two T1IFF models constructed with the upper and lower boundaries of the degree of fuzziness variable would have the minimum error. The algorithm starts with a larger m-bound and gradually shifts to where the PIpop is minimized. To ensure that the fitness function decreases monotonically, the best candidate solution in each generation enters the next generation directly. Different genetic operators, e.g., arithmetic, simple crossover, non-uniform or uniform mutation operators, etc., are utilized for parameter and control genes since they are real and binary numbers respectively. For parameter genes, we used arithmetic and simple crossover and non-uniform and uniform mutation operators. For control genes, simple crossover and shift mutation operators are utilized. Tournament selection is used for the population selection. Elitist strategy is employed to ensure that the fitness function decreases monotonically, hence the best candidate solution in each generation enters the next generation directly. The definitions of genetic operators are described in Appendix C.3. The genetic learning process of new EDIT2IFF, as shown in Figure 5.16 Phase 1, is displayed in Table 5.3. In Table 5.3, each chromosome in the gene pool, i.e., each individual model, is denoted with chrpop, pop=1…total number of chromosomes. Hence the parameters of the DIT2IFF models, which are being optimized, take on the subscript to identify a chromosome. In sequence, for each chromosomes pop, m-lowerpop and m-upperpop represent the lower and upper values of the fuzziness parameter, e.g., m-lowerpop =1.1 and m-upperpop=3.5, cpop represent the number of clusters, α-cutpop represents the alpha-cut ∈[0,1], to eliminate the anomalities of the membership values, typepop represent the type of the function approximation method, e.g., LSE, SVM, etc., some function parameters specific to the type of the function approximation method used
204
5 Modeling Uncertainty with Improved Fuzzy Functions
Table 5.3 The steps of the Genetic Learning Process of EDIT2IFF GA initializes chromosomes to form the initial populationg=0. For each g=1… max-number-iterations, { chrpop : popth chromosome in the population with parameters typepop, mpop-lower, mpopupper, cpop, α-cutpop, typepop, Cregpop, εpop, Kernel-typepop {K(⋅)}, and list of membership values transformations that would be optimum to construct the interim matrix, e.g., {τspop} to identify the interim fuzzy functions and system input matrix, e.g.,{ Φppop ={Φ1ψ,…,Φc*ψ}} to identify the system fuzzy functions of each cluster. if chrpop has not been used in the past iterations { - Compute Improved Fuzzy Clustering with parameters from the chrpop using training data. - Approximate fuzzy functions, fi pop(x, Φiψ) for each cluster i=1…cpop using chrpop parameters. - Find improved membership values of validation data and infer their output values using each fuzzy function, fi pop(x, Φiψ), - Measure fitness value based on PIpop of validation data. } GA generates next populationg+1 by means of crossover and mutation operations Next generation (g=g+1) } End
to identify the fuzzy function parameters, e.g., Cregpop, being the regularization parameter, εpop, being the error margin, Kernel-typepop {K(⋅)} representing the nonlinear transformation of the dataset. The rest of the parameters of the chromosomes are control genes which only take on 0 or 1 to identify the type of membership value transformation to be used to identify the fuzzy functions, i.e., interim or system. Each chromosome includes the same list of the membership value transformations, which usually includes a long list of different transformations so that the optimum can be identified from within. Thus, the optimum interim matrix, {τ*pop}, to identify the optimum interim fuzzy function parameters, ŵi*, i=1,…,c, which composes of the membership values and their transformations, as well as the system input matrix of each cluster i,i=1,…c*, Φ*pop={Φ1ψ,…,Φc*ψ} to identify the local fuzzy function parameters is identified from this list of possible transformations of membership values. For example at the bottom of Figure 5.17 a sample chromosome is shown. If this were one of the optimum chromosomes with the best fitness functions, then the optimum fuzzy functions were to be identified from an optimized pool of membership value transformations that are defined with only an exponential transformation formula, i.e., it is the only token that takes on the values of 1, which indicates that the particular membership value transformations should be used as additional input to identify the interim fuzzy functions and system fuzzy functions. If there were additional transformations identified as ‘1’, then in the analysis, one would have used any combination of the memberships values transformation to identify as many different fuzzy functions as possible to identify an uncertainty interval of embedded type-1 improved fuzzy functions (T1IFF) models.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
205
(a) Fuzzy Function Surface- f(x,eμ), SVM-Linear, Kernel token=0.
(b) Fuzzy Function Surface f(x,eμ), SVM-Gaussian Kernel token=1 Fig. 5.19 Decision surfaces obtained from GLP for changing fuzzy function structures of chromosomes. K(.)={Linear or Non-linear} and (mlow,mup,Creg,ε)= {1.75,2.00,54.5,0.115}, c*=3. uclusi represents improved membership values of corresponding cluster i.
206
5 Modeling Uncertainty with Improved Fuzzy Functions
The purpose of genetic learning process (GLP) is to identify the uncertainty interval of the type-2 fuzzy membership values and list of possible structures of fuzzy functions (as shown in Figure 5.18). The algorithm tries to find the optimum forms of memberships values, to construct the input matrixes, {τ*,Φ*}, that would identify the interim and system fuzzy functions e.g., such as equations (5.41) and (5.43) , parameters and structure of fuzzy functions (Creg, ε, K(⋅)) such that the estimated output of the optimum model is as close to the identified system. In Figure 5.19, the three different decision surfaces of the single input, x, single output, y, Z={x,y}, 100 data point, 3–clustered non-linear artificial dataset are shown. The upper and lower graphs are two models identified by two different chromosomes. In this structure, the only difference between the chromosomes of the two modes as shown in the upper and lower models is the kernel type token (token #7 in Figure 5.20), which determines the non-linearity of the fuzzy functions. In this sample, the interim matrix, τ, constructed to identify the interim fuzzy functions in Improved Fuzzy Clustering (IFC) algorithm is identified by exponential transformation of membership values, eμ, as shown in Figure 5.20, as follows: μ imp hi (τ i , wˆ i )=wˆ 0,i +wˆ 1,i ⎛⎜ e i ⎞⎟ ;τ i ∈ R n× 2 ⎝ ⎠
μiimp ⎡ ,1 ⎤ ⎢1 e ⎥ = ⎢M M ⎥ , wˆ i ∈ R 2×1 = [ wˆ 0,i ⎢ ⎥ imp ⎢1 e μi , n ⎥ ⎣ ⎦ n× 2
wˆ 1,i ]T
(5.56)
The ‘imp’ indicates that the membership values are calculated from IFC method. In addition, the list of input matrices, Φ={Φ11,Φ21,Φ31} for each cluster to formulate system fuzzy functions of the corresponding cluster are identified using only exponential transformation of their corresponding improved membership values, eμ, as additional parameters to the original input variables, same for each cluster as follows:
(
)
imp μ yˆir , s ,ψ = f i Φψi ,Wˆ i =Wˆ0,i +Wˆ1,i ⎛⎜ e i ⎞⎟ + Wˆ2, i x ⎝ ⎠ imp μi ,1 ⎡ ⎤ x1 ⎥ ⎢1 e Φψi ∈ R n×3 = ⎢M M M ⎥ , Wˆi ∈ R 3×1 = [Wˆ0,i Wˆ1,i Wˆ2,i ]T ⎢ ⎥ imp ⎢1 e μi , n x ⎥ n⎦ ⎣
(5.57)
The upper and lower graphs of Figure 5.19 demonstrates two embedded T1IFF model decision surfaces obtained from the two different chromosomes of Figure 5.20, which are obtained from the first step of re-shaping process of membership values and fuzzy
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
207
functions of the proposed 3-phase EDIT2IFF. Each embedded T1IFF model obtained using the parameters denoted with two different chromosomes have the same parameters, but the Kernel Type, which identifies the linear or non-linear property of the SVM if SVM is the model approximation function chosen by the genetic algorithm.
Fig. 5.20 Two different chromosomes from the GLP algorithm of EDIT2IFF modeling approach applied on the Artificial Dataset. The dark colored token is the only difference between the two chromosomes. ‘1’:Linear Kernel Model, ‘2’:Non-Linear Kernel Model.
The top embedded model in Figure 5.19 uses the linear model, e.g., the ‘Kernel Type’ token of its chromosome is equal to 1, and the bottom figure is formed with a nonlinear model, i.e., K(.)=‘2’. Furthermore, each embedded model has three cluster structures, which identify three different fuzzy functions for each of the three clusters. It should be noted that each of the three graphs in the top figure (as well as the bottom figure) contains two decision surfaces, which are formed using two different m values, i.e., m-lower and m-upper, identified by its corresponding chromosomes (the first two tokens). If any other m′ values, i.e., m-lower<m′<m-upper, is to be used from within this m-bound, i.e., [m-lower, m-upper], then the decision surface to be obtained using this m′ parameter, would be in between the two decision surfaces obtained using m-lower and m-upper as shown Figure 5.19, considering the rest of the parameters indicated by the corresponding chromosome is kept intact. In phase 1, the GLP is employed based on T1IFF modeling to optimize the parameters. For each cluster, a different fuzzy decision surface is approximated based on parameter and control genes of the chromosome structure using the corresponding cluster’s membership values as inputs. The GLP searches for the best fuzzy decision surfaces based on parameters that are represented with each chromosome. The gap between surfaces represent uncertainty interval that the GLP tries to minimize based on chromosome structures. Phase 2: Structure Identification with Discrete Interval Type-2 IFF (DIT2IFF)
In Phase 1 of EDIT2IFF, the GLP captures the uncertainty interval by identifying: ¾ ¾
An optimum m interval, [m-low*,m-up*] A list of optimum membership value
transformations,
e.g.,
{( μ ) , ( μ ) , ( e ) , (ln (1-μ ) / μ )L} , which are to be used to form difp >0
μ p >0
ferent combinations of fuzzy function structures.
208
5 Modeling Uncertainty with Improved Fuzzy Functions
¾
The optimum list of function parameters, viz, any other fuzzy regression function parameters, e.g., Creg*,ε*,K(⋅), necessary for the model execution.
These parameters are represented by the resulting chromosomes with the best fitting function that has the minimum error. Using the identified uncertainty interval of the membership functions and optimum values of the particular parameters, the new evolutionary method implements the discrete type-2 fuzzy functions method to do reasoning. Therefore, in this step, identified uncertainty interval from the previous step, induced by the change in the degree of fuzziness and fuzzy function structures identified with a list of optimum membership value transformations, is discretisized to find as many embedded T1IFF models as feasible. Here we apply the DIT2IFF, however, this time we have shifted the uncertainty interval towards where the optimum models parameters would reside. The DIT2IFF models in the previous section apply an exhaustive search method to identify the uncertainty interval and optimum parameters by starting with a large uncertainty interval of parameter values and long list of fuzzy function structures, which takes a longer time to converge. With this new approach, after the Phase 1, we would have a preconception about the boundaries of the parameters in terms of the whereabouts of their optimum values should be searched. The new approach has a unique property that we should once more emphasize in this step. Previous Type-2 Fuzzy Logic Systems [Mendel et al., 2006; Uncu et al, 2004a, Uncu and Turksen, 2007] construct a general fuzzy function structure for a system model and use this structure to construct each rule. A model that is represented with different fuzzy function structures (characteristics) for each cluster (rule) have a better chance of identifying the optimum model than a method which builds a model with single function structure. In order to capture uncertainty, in the new model we utilize the best fuzzy function structures and fuzziness values in a cluster level, considering that the cluster center representatives are kept the same. Hence, the algorithm captures the best local function structures based on the training and validation datasets and preserves them in a matrix (collection table) to be used by inference, e.g., equation (5.49) and (5.50). This way the system model enables to have different local fuzzy models. Identification of a list of best models, in other words an interval value of possible solutions, instead of one particular optimum model, may increase the ability of the models to capture structure uncertainties in the T1IFF system. As a result of the genetic learning process in Phase 1, we obtain the optimum learning parameters, PL*={[m-lower*, m-upper*], c*,Creg*,ε*,K(⋅), the list of optimum membership value transformations}, from the surviving chromosome with the best PI value. It should be noted from PL* that, the optimum fuzziness parameter, m, is defined as an interval, [m-lower*- m-upper*], which is to be used as input variable of DIT2IFF method in Phase 2 of EDIT2IFF strategy. In addition, the optimum list of membership value transformations are identified by the GLP later to be used to identify the optimum interim fuzzy function parameters by constructing as many different
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
209
interim matrices, (τs), s=1…,nif, to build nif different IFC models, and as many local input matrixes for each cluster, Φp={Φ1ψ,…,Φc*ψ}, ψ=1,…,nf, p=1…(nf)c*. Thus, the list of optimum membership value transformations is identified by the GLP to identify the optimum uncertainty interval of membership values and fuzzy functions. The rest of the parameters of PL* are crisp optimum values. In Phase 2, the optimum parameters denoted with PL*, will be used to build a DIT2IFF system model. One of the parameters, which is used to define the uncertainty intervals, the interval identified by the optimum upper and lower values of the fuzziness variable, [m-lower*, m-upper*] is converted into a list of embedded membership values using {mr}, r=1,…,nr. In addition, all possible combinations of the list of optimum membership value transformations identified in the first phase of the algorithm are used to form the list of optimum fuzzy function structures to build embedded models. In place, for these discrete parameter set, we construct (1) an interim matrix to identify the interim fuzzy functions, τs, for the IFC clustering which comprise of the membership values and their transformations only, and (2) the local input matrix to identify the (system) local fuzzy function structures, Φp, which comprise of the original input variables, membership values and their transformations. One can define different matrix structures to define the fuzzy functions for each cluster, Φiψ, using the list of possible fuzzy function structures. Then, the optimum local fuzzy function structures of the optimum DIT2IFF models is represented with Φp={Φ1ψ,…,Φc*ψ}, ψ=1,…,nf, p=1…(nf)c*, one for each cluster, i=1…c*, where Φiψ represents one form of fuzzy function structure used to identify the local fuzzy function of the ith cluster. Examples of different types of fuzzy functions are given in equations (5.49), (5.50). For each discrete values of these parameters, 〈mr,τs,Φp〉, one embedded T1IFF model is constructed. Since, some of the parameters are already optimized and values of the uncertain interval of some parameters are reduced, there will be a fewer number of discrete embedded T1IFF models for EDIT2IFF structure identification compared to previous DIT2IFF models based on exhaustive search method, in which one has to search for all the combinations to find the optimum ones. In EDIT2IFF, the initial GLP step helps to reduce unimportant values of some of the parameters. It should be pointed out that, the second phase of EDIT2IFF is the same as the DIT2IFF strategy, except the number of parameters are pre-determined in the first phase of the EDIT2IFF strategy. Figure 5.21 summarizes the identification of the optimum uncertainty interval of membership values based on the three phrases of the EDIT2IFF model. The interval identified at the beginning of the algorithm are reduced in the first phase of the EDIT2IFF and then in the second phase of the algorithm, this interval is discretisized, i.e., converted into a list of discrete membership values, to find the embedded type-1 improved membership values, i.e., the scatter clouds on the upper right graph. The gray area in the upper left graph indicates as many membership values that could be defined using the parameters, i.e., the discretisized degree of fuzziness, m-interval, determined with the upper and lower fuzziness degree and the list of different membership value transformations to identify interim fuzzy functions. The algorithm narrows down this uncertainty interval in the initial step and shifts it to where the optimum interval should have been. In the second phase, the identified optimum values of parameters are discretisized to form an optimum interval for membership values as shown in the top right graph in Figure 5.21. In the
210
5 Modeling Uncertainty with Improved Fuzzy Functions
magnified view, on the bottom graph, the interval valued membership values for a given data point, x′, are shown, which are identified with t=1,…,(nr×nif) discrete membership values in each cluster, each of which are denoted with μi(x',c*,mr,τs), i=1,…c*, r=1,…nr, s=1,…nif.
μ( x)
μ( x)
μ i ( x ', c*, m r ,τ s )
Fig. 5.21 Interval valued membership values of evolutionary type-2 fuzzy functions strategy. Uncertainty interval represented with membership values dispersion induced by each tuple 〈mr,τs〉 (a) Optimized uncertainty interval from GLP-Phase 1 of EDIT2IFF, (b) Discrete Improved Membership Values- Phase2 EDIT2IFF (c) magnified view of different discrete membership values for any x′.
The identification of the optimum uncertainty interval of the fuzzy functions as shown in Figure 5.21 is processed in an analogical manner to the uncertainty identification of membership values explained in the latter paragraph. At the start of the GLP process, a wider list of possible fuzzy function structures is introduced to the system. The genetic algorithm identifies the optimum fuzzy function structures by identifying the optimum forms of membership value transformations. Hence, this corresponds to identifying the optimum list of the fuzzy function structures, viz., reducing the uncertainty interval of fuzzy functions down to where the optimum values could be found as shown in Figure 5.22.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
211
This could mean identifying the optimum list of transformations of membership values anywhere between identifying only 1 possible membership value transformation to all the transformations. Any possible combination of membership values would identify a different fuzzy function structure from which a different estimated output value can be extracted. Thus, this forms the uncertainty interval of the fuzzy functions which include embedded fuzzy functions. During structure identification of the DIT2IFF system, one embedded fuzzy function, f(Φir,s,ψ), such as in (5.57), is approximated for each cluster i using each set of 〈c*,mr,τs〉 to approximate as many output values as possible for each data point, as shown in Figure 5.22.
yk yˆ ir,,ks ,ψ = f i r , s ,ψ
Fig. 5.22 Uncertainty interval of Fuzzy Function Structures. (a) Uncertainty interval represented with different output values obtained from list of fuzzy function structures induced by each tuple 〈mr,τs,Φp〉 (b) Optimized uncertainty interval from GLP-Phase 1 of EDIT2IFF, (c) magnified view of different output values, yikr,s,ψ, for a specific x′ vector obtained form optimized list of fuzzy functions.
The top-left graph in Figure 5.22 represents the initial fuzzy functions at the start of the algorithm and they are the output values of the fuzzy functions that are obtained from each embedded model at the start of the GLP algorithm. The GLP identifies the optimum fuzzy function structures by identifying the optimum forms of membership value transformations to be used to approximate local fuzzy
212
5 Modeling Uncertainty with Improved Fuzzy Functions
function structures. Hence, the algorithm narrows down this uncertainty interval in the initial step and shift it to where the optimum interval should have been by identifying selected fuzzy function structures which may be optimum ones, as shown in the top-right graph in Figure 5.22. In the second stage, the DIT2IFF method uses only these selected membership value transformations to identify fuzzy functions in order to obtain different outputs for a given input data point. This way, the number of embedded fuzzy function models is reduced, and the system only deals with candidate embedded models that are optimized in the first phrase. The interval valued estimated output values of a given data point, x′, is displayed in the magnified view, on the bottom graph of Figure 5.22. They are obtained from each local fuzzy function, which are identified with nf discrete fuzzy functions in a particular cluster, denoted with yˆikr , s ,ψ = f i r , s ,ψ ( x ', c*, m r ,τ s , Φψ ) i=1,…c*, r=1,…nr, s=1,…nif, ψ=1,…,nf, e.g., such as in (5.49), (5.50) and is calculated using equations like (5.57). The fuzzy output value from each function is weighted with their corresponding membership values to calculate a single crisp output value using
μ r ,s ˆy ikr ,s ,ψ ∑ i = 1 ik = c* ∑ i =1 μ ikr ,s c*
ˆy kq
(5.58)
It should be noted from Figure 5.21 and Figure 5.22 that we deal with the membership value scatter matrixes and obtain fuzzy functions outputs in a scatter diagram to define the interval valued membership values and fuzzy functions for identified optimum parameter lists. Based on the optimum performance measure, the optimum model parameters for each training data point is captured and retained in collection tables such as in equations (5.49), (5.50), to be used to infer output values of testing and validation data samples as follows:
(
arg min yk − yˆ kq q
)
2
⎪⎧ ∈ ⎨q ∃q′ , yk − yˆ kq ⎩⎪
(
)
2
(
< yk − yˆ kq′
) ⎪⎬⎭⎪ 2⎫
(5.59)
The rest of the structure identification is as the same as the DIT2IFF structure identification method described in section 5.3.
Phase 3: Inference Method for Evolutionary Discrete Interval Type-2 IFF (EDIT2IFF) The inference methodology of the new strategy is similar to the DIT2IFF approach. For each testing dataset, one crisp output value is obtained by applying the inference mechanism of the DIT2IFF method. The collection tables constructed in Phase 2, e.g., equations (5.49), (5.50), of the algorithm is used to infer crisp output values for testing cases.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
213
One final note for the new type-2 models is that, the DIT2IFF, and its extension using astochastic search technique, EDIT2IFF, are presented for regression type problem domains. We also adopted the two algorithms for classification problems: ¾
Discrete Interval Type-2 Improved Fuzzy Functions for Classification (DIT2IFF-C),
¾
Evolutionary Design of Discretized Interval Type-2 Improved Fuzzy Functions for Classification (EDIT2IFF-C).
The difference between the new type-2 strategies of classification domain and the regression domains which is presented in this chapter is that the classification extensions of the new approaches are designed by changing the fuzzy function approximators into classification fuzzy functions, and fitness evaluations into performance evaluation criteria for classification problems such as Area Under the ROC(receiver operating curve)– AUC or classification recognition percentage, to be discussed in the experiments section. In these classification extensions, we implemented the IFC-C classification method to find the improved membership values. Since the structure of the system modeling and inference modules are not affected by this change, we will not display these methodologies in detail.
5.6.3
Reduction of Structure Identification Steps of DIT2IFF Using New EDIT2IFF Method
In the new discrete interval valued type-2 improved fuzzy functions – DIT2IFF systems, initial parameters are optimized with an exhaustive search based on the supervised learning method by iterating the list of parameters; degree of fuzziness (m), number of clusters (c), types of membership value transformations to construct the matrices (τ, Φ) to identify interim and local fuzzy functions, alpha-cut constant and fuzzy function approximator parameters such as when SVM is used then regularization constant (C-reg), kernel type and error margin (ε). For each set of parameters, one T1IFF model is build and the optimum set is determined based on the cross validation analysis. At the beginning of the DIT2IFF algorithm, we implemented an exhaustive search method using T1IFF to optimize some of the parameters, i.e., fuzzy function approximators’ parameters and number of clusters. Assuming each parameter has N different values (except the degree of fuzziness, which is set to m=2 and one set of fuzzy functions is used for the initial T1IFF models) and two different kernel types are iterated, the number of iterations of the initial exhaustive search would be 2N5. Then, the boundaries of degree of fuzziness, m, is identified by the user [m-lower ,m-upper] to be discretisized to search for the optimum values for each input data point. In addition, a list of possible structures of fuzzy functions is identified. Assuming, the m interval is discretisized into N values and there are N different fuzzy functions structures, and the optimum values of the rest of the parameters from the initial T1IFF exhaustive search are used, the total number of times T1IFF will be executed during structure identification would be 2N5+N2+c.
214
5 Modeling Uncertainty with Improved Fuzzy Functions
Table 5.4 Number of parameters of Discrete Interval Type-2 Improved Fuzzy Functions (DIT2IFF)
Parameter
Initial parameter optimization based on exhaustive search using T1IFF method.
DIT2IFF optimization
c: number of clusters m: degree of fuzziness α-cut {τ, Φ}: structure of fuzzy functions Creg: regularization constant of SVR epsilon constant for SVR Kernel Type Subtotal Discrete values of m {τ}: interim fuzzy function types {Φ}: system fuzzy function types Total
Number of discrete values N 1 N N N N 2 2N5 N N Nc 2N5+N2+c
On the other hand, for the proposed EDIT2IFF, let the total number of iterations of genetic algorithms be N2. In each iteration, one T1IFF will be executed for two child chromosomes for every crossover operation, and one child for mutation from two selected parents. This will be repeated for each m-value in the chromosome, i.e., m-lower and m-upper. Roughly speaking, we set the number of iterations to 100, so the correspondence of N different values of a parameter of T1IFF listed in Table 5.4 will be ~N2. The population size will be set to N2, since we use 50-100 different populations in our experiments. Therefore, the initial number of iterations of the EDIT2IFF would be 2N2+2*2N2=6N2. Then using the optimum parameters, DIT2IFF will be executed for the optimum m interval, which will be discretisized into, N values. In addition, the combination of the list of possible fuzzy function structures will be used as separate N structures to execute the DIT2IFF method, even though there should have been less than N values for these parameters since in the genetic learning process of EDIT2IFF, the uncertainty interval is optimized. Nonetheless, we converted the reduced uncertainty interval into the same number of discrete values. The total number of iterations of the EDIT2IFF will be as shown in Table 5.5. It is evident from the Table 5.4 and Table 5.5 that the number of iterations are reduced when the EDIT2IFF are used instead of DIT2IFF approaches, i.e., (6N2+N2+c < 2N5+N2+c) Æ (3N21)
5.7 Summary
215
Table 5.5 The number of parameters of Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions. (EDIT2IFF)
Parameter
Genetic Learning Process – Phrase 1 of EDIT2IFF
DIT2IFF optimization
Initial Run: population size = N2 # of T1IFF models for each chromosome (one for each m value) Secondary run: evaluation of 1 crossover and 1 mutation operations Sub-Total Discrete values of m {τ}: interim fuzzy function types {Φ}: system fuzzy function types Total
Number of discrete values 2N2
2N2 * 2 6N2 N N Nc 6N2+N2+c
In the chapter of experiments real datasets are used to measure the elapsed time of these two methods under the same number of initial parameter sets.
5.7 Summary This chapter presented two novel Discrete Interval Type-2 Improved Fuzzy Functions to identify uncertainties in system models. Structurally, the novel Discrete Interval Type-2 Improved Fuzzy Functions strategies are different from traditional type-2 fuzzy rule base approaches. They employ a new method to identify uncertainty in learning parameters and function structures. Two different types of uncertainties, namely the uncertainty in selection of improved fuzzy clustering parameters, and uncertainty in determining the mathematical model structure of each local fuzzy function are taken into consideration. In the second novel strategy, the optimum parameters are captured and the uncertainty interval of fuzziness is identified with genetic learning algorithm. This reduces the number of steps to identify the optimum parameters compared to the Discrete Interval Type-2 Improved Fuzzy Functions method based on exhaustive search method. Application of heterogeneous dynamic length chromosome structure enables to optimize parameters with different domains in the same model utilizing their cross combination effects. Additionally, the new type-2 inference schema enables the employment of different membership functions and fuzzy function structures in different local structures, which helps to identify the uncertainty of the system model.
Chapter 6
Experiments 6 Experiments
This chapter presents the results of experiments applied to benchmark and real life datasets to evaluate the performance of proposed algorithms. The results are compared to other soft computing methods of system modeling. In this chapter, performance of the proposed Fuzzy Functions approaches is analyzed against performances of other well-known soft computing methods using several datasets. Our goal is to assess prediction performance and robustness of the proposed methodologies on real life datasets under a variety of scenarios by altering system parameters using cross validation analysis.
6.1 Experimental Setup 6.1.1 Overview of Experiments Information about the datasets that are used to test the performance of the proposed approaches against other well-known approaches is listed in Table 6.1. The datasets are classified into two parts based on their structures. Datasets 1 through 4 are regression type datasets, where the output variable has a continuous domain, y∈ℜ, and datasets 5 through 10 are classification datasets, where the output variable has a discrete domain, y∈]. In these experiments only binary classification datasets with dichotomous output variable are used, e.g., y∈ {0,1}. Dataset 3 of the regression type includes five different stock price datasets and they are analyzed differently from the rest of the regression datasets, Dataset 1, 2, and 4. In the next section, a sub-sampling cross validation method that is applied to each experiment will be presented. Later, the performance measures listed in Table 6.1 will be explained in more detail. The parameters are optimized based on exhaustive search method or genetic algorithms based on the methodology used. Exhaustive Search verifies all the possible combinations of the optimized parameters, thus ensuring that the best possible solution will be found. Genetic Algorithms are search algorithms based on the mechanics of natural selection and natural genetics. They combine the survival of the fittest rule with structured yet randomized information exchange. Genetic algorithms possess the best characteristics of the other optimization methods, such as robustness and fast convergence, which does not depend on any of the optimization criteria (for instance, on smoothness). A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 217–304. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
Friedman Artificial Auto-Mileage -UCI Stock Price Predict. TD BMO Enbridge Loblaws Sun Life Desulphurization Process Reagent1 Reagent2 Liver Disorder–UCI Ionosphere -UCI Breast Cancer - UCI Diabetes – UCI Credit Scoring UCI California Housing
1 2
10,000 10,000 345 349 277 768 690 20,640
389 445 445 445 445
R R R R R
R R C C C C C C
9,791 398
OBS**
R R
Type*
11 11 6 34 9 8 15 9
16 16 16 16 16
5 8
#Var§
250 250 175 150 130 125 150 500
120 200 200 200 200
Performance measures used to evaluate each model performance in comparative analysis. UCI: University of California, Irvine, Real Dataset Repository
ð
750 750 75 120 70 75 75 500
90 144 144 144 144
8000 8000 50 80 50 50 50 12,600
100 100 100 100 100
OBS used in three-way cross validation Testing Training Validation 500 250 9,000 125 45 100
* R: Regression, C: Classification Type Datasets. ** OBS: Total number of cases (i.e., instances, objects, data points, observations) § Var: Total Number of attributes/features/variables exists in the dataset.
5 6 7 8 9 10
4
3
Dataset
No
Table 6.1 Overview of Datasets used in the experiments
Accuracy ROC Curve/ AUC Several Ranking Methods
Ranking
R2
RMSE MAPE Robust Simulated Trading Benchmark (RSTB)
Perform. Measure Usedð
218 6 Experiments
6.1 Experimental Setup
219
6.1.2 Three-Way Sub-sampling Cross Validation Method In this section, numerical examples are used to illustrate how system-modeling methods are applied using the three-way cross validation method [Rowland, 2003]. In each experiment, an entire dataset is randomly separated into three parts: training, validation and testing. The training dataset is used to find the parameters of a model, the validation dataset tunes these parameters and finds the optimum model, and the testing dataset is used to asses the performance of an optimum model (no tuning is done on testing dataset). In Figure 6.1 general framework of the three-way cross validation method is displayed which is used in building every algorithm presented in this work, i.e., proposed and benchmark algorithms.
Fig. 6.1 General Framework of the Three-Way Cross Validation Method
In Figure 6.2 the same three-way cross validation method is displayed specific to fuzzy function methodologies of this work. The three-way cross validation method applied to stock price estimation datasets is slightly different from the rest of the datasets, especially in the construction of testing dataset (Dataset #3 in Table 6.1). Initially, stock prices within the studied period, are divided into two parts. Specific to stock price estimation models, the data is not randomly divided since these are time series data and the analysis requires continuity from one to the next data vector. The first period is used for constructing five different training and validation datasets. The last period, 100 trading days of each stock price, is used for testing purposes. An example of the sampling method is illustrated in Figure 6.3. using an artificial stock price dataset. Stock prices from the first part of the selected period are used for learning and optimizing the model parameters. Random samples are selected to construct training and validation datasets at each repetition. The performance of the optimum model is evaluated on the last part of the time series, which we call the testing dataset.
220
6 Experiments
Stock Price (Closing Price)
Fig. 6.2 Three-way cross validation used in Fuzzy Functions approaches 75 70 65 60 55
0
50
100
150 200 250 Input versus Time
300
350
400
Fig. 6.3 Schematic view of the Three-Way Cross Validation Process that is used for stock price estimation models
Each experiment is repeated k times e.g., k={5,10,..} by selecting different samples of different sizes from the pool of vectors to create training and validation datasets. (The process in Figure 6.1 is repeated k times). In Table 6.1, we display a number of instances of training, validation and testing datasets, which are used when building models of each experiment. It should be stressed that, in order to do a fair comparison between the proposed and benchmark methods, the very same training, validation and testing datasets are used to build models for each algorithms. In particular, the training, validation, and testing data samples that are used to evaluate the proposed fuzzy functions approaches are also used in benchmark methods, e.g., SVM regression, ANFIS, DENFIS, etc., to learn, validate, and test their performance. Similarly, each algorithm evaluates their optimum parameters
6.1 Experimental Setup
221
using the same testing datasets. Let the performance of each method be represented with a tuple as { PM , stPM}, where PM represents the average of the performances obtained from testing datasets over k repetitions, and stPM is their standard deviation. The extraction of the values of tuples from each methodology is shown in Table 6.2. Table 6.2 Calculation of overall performance of a method based on three way cross validation results. The overall performance is represented with tuple of 〈 PM , stPM 〉Z.
Cross Validation Repetition 1 2 . . k
Performance Measure obtained from Testing Dataset PM1 PM2 . . PMk
PM = 1k ∑ k PM k , sdPM =
1 k
∑ k ( PM k − PM )2
6.1.3 Measuring Models’ Prediction Performance In the experiments, different types of evaluation criteria are used to analyze the prediction of the performances of the proposed fuzzy system modeling strategies in comparison to benchmark methods, to be listed next. The type of the evaluation criteria (performance measure) is mostly dependent on the structure of the system domain. Therefore, we separated the performance measures of the regression and classification problem domains. Additionally, a new performance measure is introduced specifically for stock-price estimation models. 6.1.3.1 Performance Evaluations of Regression Experiments Let yk and ŷk∈ℜ represent the actual and model output values of a datum k, respectively. In this work, to evaluate the performances of each methodology, four different functions are used for regression type datasets where the observed output variable has a continuous domain, viz. y∈ℜ:
RMSE =
1.
Root mean square error (RMSE) ,
2.
Mean absolute percentage error, MAPE =
3.
Coefficient of determination, R2 = 1 −
1 n
1 n
∑ k =1 ( yk − yˆ k ) 2
∑ k =1
SS E , SST
n
n
yk − yˆ k .100 yk
222
6 Experiments
SST = ∑ k ( yk − y ) , 2
4.
SS E = ∑ k ( yk − yˆ k ) , 1 2
Robust Simulated Trading Benchmark (RSTB) – to be explained later. yk: actual output, yk: predicted output, ŷk: predicted output.
RMSE ⎯: one of the most commonly used performance measures. It is useful to understand the deviation between the predicted output and the actual observed output. It is still widely accepted and used for performance evaluation methods. Recent publications e.g., [Salakhutdiniv et al. 2007], have shown the usage of RMSE as one of the acceptable performance measures. Thus, in the experiments, we used this measure where applicable. RMSE=0 means that the model output exactly matches the observed output. MAPE ⎯: Mean Absolute Percentage Error is a commonly used statistical measure of Goodness of Fit in quantitative forecasting methods. It produces a measure of relative overall fit. MAPE is a normalized value between 0 and 1. MAPE=0 means that model output exactly matches with the observed output. R2 ⎯ : defined as the coefficient of determination. Regardless of the structure of a model, one can always compute the total variance of the dependent variable (total sum of squares, SST), the proportion of variance due to the residuals (error sum of squares, SSE), and the proportion of variance due to the regression model (regression sum of squares, SSR=SST - SSE). The ratio of the regression sum of squares to the total sum of squares (SSR/SST = 1-(SSE/SST)) explains the proportion of variance accounted for in the dependent variable (y) by a model; thus, this ratio is equivalent to the R-square (0 R-square 1, the coefficient of determination). This measure helps to evaluate how well the model fits the data. R2 =1 indicates that the model can explain all the variability of the output variable, while R2 =0 means otherwise. RSTB ⎯: A new performance measure, Robust Simulated Trading Benchmark, is introduced specifically for stock price prediction problems. In any model of a trading system, the main goal is to improve its profitability. A profitable prediction is a better prediction even it has less accuracy based on different criteria, e.g., accuracy in predicting the next day directions of a stock. In [Deboeck, 1992] it was shown that a neural network that correctly predicted the next-day direction 85% of the time, consistently lost money. Although the system correctly predicted market direction, the prediction accuracy was low. Hence, evaluation of trading models should not just be based on predicted directions of stock price movements. In addition, as will be shown in the results of stock price predictions in the next section, the accuracies of benchmarking methods are not always significantly different from one another. This makes it difficult to identify the sole model for estimation of stock prices. Since the aim of stock trading models is to return profit, the profitability should be the performance measure. For these reasons, on top of the well-known performance 1
SST: Total Sum of Squares, SSE: Error Sum of Squares.
6.1 Experimental Setup
223
measures for regression models, here we introduce a new criterion – Robust Simulated Trading Benchmark (RSTB), based on profitability of models that are used to predict the stock prices. The RSTB combines three different properties to form one performance measure; namely the market directions, prediction accuracy and robustness of models. RSTB is driven by a conservative trading approach. The higher the RSTB, the better the profitability of the model would be. The details of the new RSTB are presented in the analysis of stock prices in section 6.4.4. 6.1.3.2 Performance Evaluations of Classification Experiments Classification type datasets used in this work have output variables of dichotomous structure, e.g., y∈{0, 1}. To evaluate the performance of classification datasets, three different criteria are used as follows; ¾ ¾ ¾
Accuracy, Area Under the ROC Curve (AUC), Ranking Methods; Average Rank (AR), Success Rate Ratio (SRR), Significance Win Ratio (SWR), Percent Improvement Ratio (PIR).
Accuracy ⎯: Classification accuracies are measured based on the contingency table as follows; Table 6.3 Contingency Table to calculate accuracy
Predicted Positive
Predicted Negative
Actual Positive
True Positives
False Negatives
Actual Negative
False Positives
True Negatives
accuracy (%) =
(True Positives ) + (True Negatives ) number of data (nd )
(6.1)
The maximum accuracy that a test can have is 1, the minimum is 0. Ideally, we want a test to have higher accuracy closer to 1. Since the classification model outputs are probabilities, different threshold values (to discern between two classes) values are varied to obtain the optimum True Positives (TPs) and True Negatives (TNs) during learning stage of each modeling approach. The threshold values that are identified by the structure identification are used during inference to estimate class labels of testing datasets. ROC ⎯: Receiver Operating Characteristics uses prediction probabilities directly in model performance evaluations. With most algorithms such as logistic regression, support vector machines, we obtain prediction probabilities instead of prediction labels. Accuracy measures do not directly consider these probabilities. In addition accuracy is not a good measure when there is a large difference between the number
224
6 Experiments
of positive and negative instances. However, in many data mining applications such as ranking customers, we need more than crisp predictions such as predicted probabilities. The probabilities show the true ranking and this way possible information loss is prevented due to discretization of the predicted output based on an unknown threshold. Thus, it is more appropriate to use a very common validation technique, receiver operating characteristics (ROC) curve [Swets, 1995; Bradley, 1997] that uses probabilities as inputs to evaluate models instead of the accuracy measure. In [Huang and Ling, 2005], the performance of ROC analysis is discussed in comparison to the simple accuracy measure. It was mathematically proven that area under the ROC curve (AUC), to be discussed next, should replace accuracy in measuring and comparing classification methods. Their argument was originated from the fact that accuracy and AUC measures obtained from the same methods were not always correlated. In the classification experiments of this work, we have come across the same situation where the accuracy measures are not correlated with the AUC values. Even though we listed both performance measures in the Appendix, the analysis for classification datasets is based on the AUC performance measure. Fig. 6.4 A sample Receiver Operating Characteristic ROC curve
A sample ROC Curve is shown in Figure 6.4. The idea behind ROC curve is that, one defines various possible cut-off points Cj and classifies each data vector with a probability higher than Cj as a potential success (positives) and lower than Cj as a potential failure (negatives). Thus, at each cut-off point a hit rate (True Positive Rate) :TPR(Cj )= TP (C j ) N P , is identified where TP(Cj ) is the number of correctly predicted positives at given cut-off (Cj ) and NP is the total number of actual positive outputs. Also, a false alarm rate (False Positive Rate): FPR(Cj ) = FP (C j ) N N , is defined where FP(Cj ) is the number of negative output instances which are incorrectly predicted as positive instances at a given cutoff (Cj ) and NN is the total number of negative instances. Thus, ROC curve is the plot of the list of FPRs as a function of TPRs and the performance measure to evaluate is the area under the curve above the diagonal, as depicted in Figure 6.4. As ROC curve area increases (towards perfect classifier), the higher the AUC and prediction power would be.
6.1 Experimental Setup
225
The TPR and FPR values to obtain ROC curve are calculated as follows. The scores produced by a classifier represent probabilities of each datum belonging to one class (for binary classifiers, usually one class becomes the base class, e.g., y=1) which are then sorted in descending order. A threshold is to be determined in order to predict the class labels of each datum. By varying the threshold, different values for TPR and FPR are obtained. ROC curve can be used to show the tradeoff of errors at different thresholds. Figure 6.5 shows an example of a ROC curve on a dataset of twenty instances. The instances, ten positive and ten negative, are also shown. In the table on the right-hand side of Figure 6.5, the instances are sorted by their scores (class probabilities), and each point in the ROC graph is labeled by the threshold that produces it. A threshold of +∞ produces the point (0,0). As we lower the threshold to 0.9, the first instance is classified positive and the rest are classified negative. At this point the TPR=1/10=0.1 and FPR=0/10=0, yielding (0,0.1) point on the ROC Curve. As the threshold is lowered, the TPR and FPR values are re-calculated to obtain a point for each possible threshold value. The ideal point on the ROC curve would be (0,1), that is all positive examples are classified correctly and no negative examples are misclassified as positive.
Fig. 6.5 The ROC “curve” created by varying threshold values.The table at right shows 20 data points and the score (probabilities) assigned to each of them. The graph on the left shows the corresponding ROC curve with each point labeled by the threshold that produces it.
AUC ⎯: By comparing ROC curves, one can analyze the classification performance differences of two or more classifiers. The higher the curve, that is; the nearer to the perfect classifier, the higher the accuracy would be. Sometimes the curve for one classifier is superior to that of another, that is; one curve is higher than the other throughout the diagram; therefore, a measure is given by the Area Under the ROC curve, (denoted as AUC ∈ [0,1]) [Breiman et al, 1984]. The curve that has a higher AUC is better than the one that has a smaller AUC. If any two ROC curves intersect, which makes it hard to differentiate them, AUC would be the comparison between the models. The simplest way to calculate the AUC is the trapezoid integration of ROC curve such as in Figure 6.5 as follows:
226
6 Experiments
AUC
=
(1 − β k ) .Δα +
1 Δ ( 1 − β ) .Δα 2
k =n
(6.2) k =2
where, Δ ( 1 − β ) = ( 1 − β k ) − ( 1 − β k −1 ) and Δα = α k − α k −1 . Here β =1-TPR and α=FPR and nd denotes number of data points. Ranking – Identification of the most adequate classification algorithm for a problem is usually a very difficult task since many different classification algorithms are available and they originate from different areas such as statistics, machine learning and soft computing methods. In this respect, additional ranking methods, to be discussed next, are used as performance measures for benchmark analysis on classification problem domains. Among many ranking methods, average ranks (AR), success rate ratios ranking (SRR), significance win ranking (SWR) [Brazdil and Soares, 2000] are used to generate an ordering of different algorithms based on experimental results obtained from different datasets. A new ranking method, namely percent improvement ratio (PIR) is presented to rank the performance improvements of each methodology based on AUC values. A best algorithm is determined based on the average results obtained from these ranking algorithms. AR –: Average Rank uses individual rankings to derive an overall ranking. This is a simple ranking method that orders the values of performance measures, which is referred to the average of the measures on all the folds of the cross-validation procedure. Then, the best algorithm is assigned rank 1, the runner-up 2, and so on. Let
rji be the rank of the algorithm where i refers to a dataset and j refers to each
different methodology. The average of the measures is calculated for each algorithm by
rj =
( ∑ r ) /nd, where nd is the total number of datasets. The final i i j
ranking is obtained by ordering the average ranks and assigning ranks to the algorithms accordingly. SRR –: Success Rate Ratios Ranking measures the ratios of success rates between pairs of datasets. The performances of each methodology are compared based on the ratio of a success rate. Let PMji be the measured performance measure of method j on dataset i. The SRR between two different methodologies on one dataset is represented with SRRmethod 1,method 2 = PM method 1 / PM method 2 . Thus, the dataset1
higher the
dataset1
dataset1
SRRij ,k the better the methodology j compared to methodology k. We
then calculate the pairwise mean success rate ratio,
SRR j ,k =
( ∑ SRR ) /nd i
i j ,k
for each pair of methodology j and k among datasets where nd is the number of datasets. This measure is an estimation of the general advantage/disadvantage of methodology j over methodology k. Finally, the overall mean success rate ratio for
6.2 Parameters of Benchmark Algorithms
227
each methodology is measured by SRR j =
(∑
k
)
SRR j ,k /(m − 1) , where m is the
number of methods. Then, the ranking is derived from this measure. SWR –: Significance Win Ratio Ranking measures the significance of differences in performance between each algorithm. In this work, paired student’s t-test is used because the number of datasets and the algorithms are small. Firstly, the significance of differences in performance between each pair of algorithms is measured individually for all datasets. We denote that an algorithm j is significant over algorithm k on dataset i when the probability of t-test is less than p<0.05. This is based on the hypothesis that the two algorithms possess about the same performance. Each of these values is kept in win tables for each dataset. Next, the pairwise estimate of the probability of winning for each pair of algorithm pwj,k is calculated by dividing the number of datasets where algorithm j is significantly better than algorithm k, by the number of datasets, n. This value denotes that the algorithm j is significantly better than the algorithm k. Finally, we calculate the overall estimate of the probability of winning for each algorithm by pw j =
(∑
k
)
pw j , k /(m − 1) . These values are used
as a basis of constructing the overall ranking. PIR –: Percent Improvement Ratio measures the percentage improvement between each algorithm over all classification datasets. Firstly, the percentage improvement of each methodology in comparison to the rest of the methodologies is measured individually for all datasets. Let PMji and PMki be the measured performances of methodology j and k on dataset i, respectively. We denote the performance improvement of an algorithm j compared to algorithm k using the
(
)
formula PIR ij , k = PM ij − PM ki / PM ki . Thus, values larger than zero indicate performance improvement, where the higher values of PM is preferable. We then calculate the pairwise percent improvement ratio,
PIR j , k =
( ∑ PIR ) /nd for each pair of methodology j and k among datasets i j ,k
i
where nd is the number of datasets. This measure is an estimation of the general performance improvement of methodology j over methodology k. Finally, the overall mean percent improvement ratio for each methodology is measured by PIR j =
(∑
k
)
PIR j ,k /(m − 1) , where m is the number of methods. Then, the
ranking is derived from this measure. In the last section of the experiments, the elapsed time of the system model identification and inference method of each benchmark system modeling tool will be compared as well as proposed modeling approaches. Finally discussion on their observed performance will be discussed at the end.
6.2 Parameters of Benchmark Algorithms In this section, we present the list of learning parameters of benchmark algorithms for function estimation and classification methods of soft computing and machine
228
6 Experiments
learning. These methods are applied to the benchmark datasets in order to compare the results of the same datasets obtained from proposed methodologies of this work.
6.2.1 Support Vector Machines (SVM) Support Vector Machines, SVM, is a learning algorithm, which can be applied to both the classification and regression problem domains using a support vector for classification –SVC- and for regression-SVR- methods. There are a number of learning parameters that can be used in constructing SVMs for regression and classification [Cherkassky and Ma, 2003]. For SVR, the two most relevant parameters are the insensitivity zone indicator, epsilon, and the penalty/regularization parameter, Creg, which determines the trade-off between the training error and the flatness of the function. Creg is the common parameter of SVR and SVC methods. The way these parameters are determined, either based on the user or based on the dataset characteristics under study, may be critical for the model performance. For instance, if Creg is chosen too large, the penalty for non-separable points would be high, whereas there will be less stress on the weight parameter, then we may store many support vectors. This may increase the possibility of over-fitting. If it is chosen too small, then we may face with the problem of under-fitting. In practice, the parameter Creg is varied through a wide range of values and the optimal performance is assessed using a separate validation set. For an SVR, the value of epsilon in the ε-insensitive loss function should also be selected. epsilon has an effect on the smoothness of the SVM’s response and it affects the number of support vectors. In SVM methods, many kernel-mapping functions can be used. However, a few kernel functions have been found to work well in a wide variety of applications. A default and recommended kernel function is the Radial Basis Function (RBF), K(u,v)=exp{δ||u-v||2}, δ=1/σ2 (the δ is the width of the Gaussion as explained in App. 39 in Appendix C.2) for non-linear mappings. We also used linear kernel functions, e.g., K(u,v)=uTv. The learning parameters of SVC and SVR methods that are used in benchmark models are given in Table 6.4. LIBSVM [Chang and Lin, 2001] SVM toolbox written for MATLAB is used to implement the SVR and SVC methodologies. Table 6.4 Learning parameters of Support Vector Machines for classification and regression methods
Learning Parameters Structure
SVM for Classification SVC Classification
SVM for Regression SVR Regression
{2-1,20,…,27,28}
{2-1,20,…,27,28}
Epsilon
N/A
{0.01,0.03…,0.5}
Gamma
1
1
Linear, Radial Basis Function
Linear, Radial Basis Function
C-regularization
Kernel Type
6.2 Parameters of Benchmark Algorithms
229
6.2.2 Artificial Neural Networks (NN) In this work, Neural Networks (NN) toolbox of MATLAB is used to build system models of NN approaches. In NN methods, the system expert identifies certain parameters and the system optimizes the weights between neurons and the parameters associated with the optimization method determined by the user. The learning parameters that should be identified to execute NN in MATLAB are as follows: ¾ ¾ ¾ ¾
The neural network structure, e.g., feed-forward, etc. The number of hidden layers The number of neurons in each layer The optimization method used to identify the weights linking neurons (back-propagation, competitive, etc.) ¾ The parameters of the optimization method ¾ The transfer functions of each neuron.
One of the well-known and widely used NN approaches is the 1-Layer NN, which uses non-linear or linear transfer functions in hidden and output layers for function approximation problems. As the number of layers increase, the structure identification gets stronger since the network’s capability to learn system behavior with higher order non-linearity increases. As the layer is increased, the over-fitting of the network is inevitable. Therefore, in this work we used only a 1-Layer NN structure with the following learning parameters: Table 6.5 Learning parameters of 1-Layer Neural Networks
Learning Parameters Structure
1-Layer Neural Network Multi-Layer Feed-forward
Number of Hidden layer
1 – A three layer NN (input+output+hidden)
Number of Neurons in the hidden layers
50
Transfer function of the hidden layer
Hyperbolic tangent sigmoid
Transfer function used in output layer
Linear
Optimization Method
Back-propagation
6.2.3 Adaptive-Network-Based Fuzzy Inference System (ANFIS) ANFIS method [Jang, 1993], one of the most popular hybrid fuzzy system models, represents fuzzy rule bases as a neural network. The inference parameters of the network are learnt by using back-propagation or hybrid optimization methods. The ANFIS system identification implemented in MATLAB has two alternatives to choose from to identify the number of rules in the rule base. The first one is a subjective method, in which the number of clusters, which are used to segment each input variable, is provided by the user prior to the model execution. The
230
6 Experiments
number of rules is then determined by taking the product of the number of clusters. Hence, the fuzzy rule base structure is formed by taking the Cartesian product of clusters of each variable. This method can identify parameters of TakagiSugeno or Mizumoto type fuzzy rule bases. The challenge in this method is that when the number of input variables is large, then the algorithm suffers from long execution time or not having to converge at all. The second system identification method of ANFIS, which is used in this work, is the subtractive clustering (SC) method of ANFIS. The parameters are dynamically determined by the algorithm. It only generates Takagi-Sugeno type fuzzy rule base structures, where the consequents are first order regression functions. Even though the SC method of ANFIS dynamically specifies the number of rules, there are still other parameters that the user needs to specify prior to model execution. This version of ANFIS is the closest one to the Type-1 Fuzzy Functions approaches. Therefore, we used only SC of ANFIS in the analysis. The learning parameters of subtractive clustering ANFIS method as defined in [MathWorks, 2002] are listed as follows: ¾ ¾ ¾ ¾ ¾ ¾ ¾
Range of Influence: Specifies a cluster center’s range of influence in each of the data dimensions. Quash Factor: The factor used to multiply the radii values that determine neighborhood of cluster centers, in order to quash the potential of outlying points to be considered as part of the corresponding cluster. Accept Ratio: Sets the potential, as a fraction of the potential of the first cluster, above which another data point will be accepted as a cluster center. Reject Ratio: Sets the potential, as a fraction of the potential of the first cluster center, below which a data point will be rejected as a cluster center. Shape of Input Membership Functions Method of Optimization, i.e., hybrid or back-propagation optimization algorithm. Number of epochs
The following learning parameters are used to implement ANFIS SC method in the experiments: Table 6.6 Learning parameters of Adaptive Network Fuzzy Inference Systems - ANFIS (Takagi-Sugeno) Subtractive Clustering Method
Learning Parameters Range of Influence
ANFIS SC method 0.5
# of epochs
50
Shape of input membership functions
Gaussian
Method of optimization
Hybrid
Reject ratio
0.15
Accept ratio
0.50
Quash factor
1.25
6.2 Parameters of Benchmark Algorithms
231
6.2.4 Dynamically Evolving Neuro-Fuzzy Inference Method (DENFIS) DENFIS [Kasabov, 2002] is a neuro-fuzzy system modeling approach, which has presented good performance on time-series analysis applications. It has two different models, an online and an offline application. Both types are explained in chapter 2. In this work the offline model is used for benchmark analysis because empirical datasets are used. DENFIS uses Takagi-Sugeno type fuzzy inference method, which uses the same function structure for each cluster. DENFIS is one of the closest methods to the proposed evolutionary type-1 fuzzy functions approaches. In the benchmark analysis, the DENFIS offline model with batch mode training is applied using MATLAB Ecos toolbox [Kasabov, 2002]. The DENFIS MATLAB version employs the first order Takagi–Sugeno type inference engine using linear functions as well as higher order Takagi-Sugeno models with polynomial functions or multi-layer perceptrons. In this work, we used the higher-order Takagi-Sugeno fuzzy inference engine, which employs several small-size, multilayer perceptrons (the hidden layer consists of two or three neurons) to realize the function f in the consequent part of each fuzzy rule instead of using a predefined function. This mode of DENFIS is the closest one to the proposed fuzzy functions structures, which implements non-linear higher order functions to capture learning parameters of the inference. The following learning parameters are used to evaluate the DENFIS method. Table 6.7 Learning parameters of Dynamically Evolving Neuro-Fuzzy Inference System DENFIS Online Learning with Higher order Takagi-Sugeno (TS) inference
Learning Parameters Training mode
DENFIS method Higher order TS
Learning style
Offline {0.02 0.05 0.08 0.1}
Distance threshold Number of epochs for creating a Higher order TS fuzzy rule
50
The distance threshold determines the structure of the clusters during evolving clustering method of DENFIS, which is explained in chapter 2. This parameter is iterated during the structure identification of DENFIS to obtain the optimum distance threshold.
6.2.5 Discrete Interval Valued Type-2 Fuzzy Rule Base (DIT2FRB) One of the closest uncertainty modeling approaches to the proposed approach based on type-2 fuzzy sets, is the method of discrete interval type-2 fuzzy rule
232
6 Experiments
bases. [Uncu and Turksen, 2007], in short DIT2FRB. The applications of this system modeling tool on regression problem domains demonstrated superior results compared to other well-known soft computing and machine learning methods in terms of reduction of prediction error. We used this methodology in benchmark analysis to compare its performance with the performance of the proposed approaches. DIT2FRB implements Takagi-Sugeno methods [Takagi and Sugeno, 1985], where the consequent part of the rules are represented with first or higher order functions (polynomial), and interval valued type-2 fuzzy sets and uses search based type reduction method. Due to the similarity of the structure identification method of DIT2FRB to the proposed type-2 fuzzy systems, we used this method in our analysis. The structure identification of this method, as explained in chapter 5, utilizes the FCM clustering algorithm [Bezdek, 1981a] in order to determine the basic model structure of a given system. The learning parameters of the DIT2FRB method that is given by the user are as follows: ¾ ¾ ¾ ¾
The minimum number of rules, cmin, the fuzzy rule base, The maximum number of rules, cmax, the fuzzy rule base, The set of fuzziness values, m, that control the degree of overlap between clusters. The seed of the random number generator to assign random membership values in order to initialize the FCM clustering algorithm.
Additionally, DIT2FRB adjusts the inference parameters using the learning parameters specified by the user that are: ¾ ¾
The maximum iterations, iter, number that the tuning algorithm will execute The maximum iteration number that the fine-tuning module of the tuning algorithm will execute, iterFT
The following values of learning parameters are used in the DIT2FRB method: Table 6.8 Learning parameters of Type-2 Fuzzy Rule Base Approach - DIT2FRB
Learning Parameters Inference Type FCM cmin FCM cmax m-values seed Iter iterFT
DIT2FRB method Takagi Sugeno 2 10 {1.01, 1.2, …, 2.6} 0 10 10
6.2.6 Genetic Fuzzy System (GFS) Genetic algorithms has been one the useful tools in global optimization given their ability to efficiently use past information to produce new, improved solutions.
6.2 Parameters of Benchmark Algorithms
233
Genetic fuzzy systems (GFS), which are briefly presented in Chapter 2, concentrate on a fundamental problem of fuzzy system modeling that is the development of the number and shapes of fuzzy sets, the number and structure of rules, etc. There has been a diversity of approaches to fuzzy modeling. Since in this work we introduce the fuzzy functions methodologies, the general class of conventional fuzzy system models where the rules come from conclusions forming local regression models are used as benchmark methodologies. The genetic fuzzy system that is used in this work is based on such rule bases systems, namely Takagi-Sugeno rule base systems. The GFS modeling algorithm used in this work can be broken into five steps: 1) Generate basic Takagi-Sugeno model. 2) Set the bounds for all the input membership functions and rule consequents. Initilize the population of parameters. 3) Execute the genetic algorithm and set the new parameters of the model. 4) Evaluate the model after each generation by measuring absolute modeling error. 5) Return the model with the best fitness value. In this study, given several sets of input and output data, subtractive clustering (SC) method of MATLAB is initially executed to extract initial Takagi-Sugeno type rules instead of random assignments. The rule extraction method first uses the clustering method to determine the number of rule and antecedent membership function and then uses linear squares estimation to determine each rule’s consequent equations. In the sequel the genetic algorithm refines the membership function parameters to obtain the optimum Takagi-Sugeno type rule base model. Table 6.9 Initial Parameters of Genetic Fuzzy System
Learning Parameters Inference Type Membership function type Number of membership functions GA Maximum iteration number Population Size Crossover Rate Mutation Rate
DIT2FRB method Takagi Sugeno Triangular 3-5 150 100 2% 1%
Learning parameters of the new structure identification methods, which implement evolutionary methods, are optimized with GA using Genetic Optimization Toolbox (GAOT) in MATLAB [Houck et al., 1995]. At each GA optimization, mutation and crossover is applied with a population size of 100. The algorithm is executed with 150 iterations. For parameter genes, we used arithmetic and simple crossover and non-uniform and uniform mutation operators. For control genes,
234
6 Experiments
simple crossover and shift mutation operators are utilized. Tournament selection based on elitist strategy, in which the individual with the best fitness will be directly passed to the next generation, is employed. For the stopping criteria maximum number of iterations is used, which is 100 iterations by default. Maximum number of iterations is used as the stopping criteria. The crossover and mutation rates indicate the amount of population that undergoes the crossover and mutation operators. These rates are generally kept low since one wants to observe gradual changes on the chromosomes. Thus, in this work, all the crossover and mutation operators are applied on the 2% and 1% or the population in each experiment. The GFS of this paper optimizes the membership function parameters of triangular form, i.e., the center and the spread parameters. The membership function encoding for GFSs are described in Chapter 2. Each chromosome encodes a complete fuzzy rule set. Parameters of triangular membership functions are optimized during the GA process. First-order TSK models are constructed.
6.2.7 Logistic Regression, LR, Fuzzy K-Nearest Neighbor, FKNN The logistic regression and Fuzzy K-nearest Neighbor approach as explained in as explained in Appendix C.5 and C.6 respectively do not require any prior learning parameters in order to identify the system model structure.
6.3 Parameters of Proposed Fuzzy Functions Algorithms In this section, the parameters of the proposed algorithms are listed. The proposed algorithms that implement standard Fuzzy C-means Clustering (FCM) method are referred as “Fuzzy Functions Methods”, whereas, the proposed methods that implement Improved Fuzzy Clustering (IFC) method are referred as “Improved Fuzzy Functions Methods’. The genetic optimization toolbox (GAOT) in MATLAB [Houck et al. 1995] is used to apply genetic algorithms. The GAOT is based on three basic steps: setting the bounds of each parameter, proceeding through the search space and validating the model for each parameter, and evaluating the fitness function of that set of parameter.
6.3.1 Fuzzy Functions Methods There are four different extensions of the proposed Fuzzy Functions Methods as presented in chapter 4 and 5. These are: ¾ ¾ ¾ ¾
T1FF – Type-1 Fuzzy Functions (chapter 4) DIT2FF – Discrete Interval Type-2 Fuzzy Functions (chapter 5) ET1FF – Evolutionary Type-1 Fuzzy Functions (chapter 4) EDIT2FF – Evolutionary Discrete Interval Type-2 Fuzzy Functions (chapter 5).
The initialization parameters of these methodologies are summarized in Table 6.10. We did not set the number of clusters more than 10, since large number of clusters may result in over-fitting.
6.3 Parameters of Proposed Fuzzy Functions Algorithms
235
Table 6.10 The Parameters of Type-1 and Type-2 Fuzzy Functions Methods for Regression Problems Parameters FCM number of Clusters*
T1FF [2,…,10]
DIT2FF [2,…,10]
ET1FF min c = 2 max c = 10
EDIT2FF min c = 2 max c = 10
{1.2,1.3,…2.6}
{1.2,1.3,…2.6}
m-lower = 1.2 m-upper = 2.6
m-lower = 1.2 m-upper = 2.6
{0, 0.1}
{0, 0.1}
min-α-cut=0 max-α-cut=0.1
min-α-cut=0 max-α-cut=0.1
LSE and SVM
LSE and SVM
LSE and SVM
LSE and SVM
C-reg
{2-1,20,…,27,28}
{2-1,20,…,27,28}
minCreg = 2-1 maxCreg=28
minCreg = 2maxCreg=28
epsilon
{0.01,…,0.5}
{0.01,…,0.5}
min-ε =0.01 max-ε =0.5
min-ε =0.01 max-ε =0.5
FCM m-values
α-cut for clusters Fuzzy Function Type for Regression
Gamma Kernel Type
1
1
1
1
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
For classification type datasets, the Fuzzy Functions Methods for classification problem-domain extensions are applied. These are: ¾ ¾ ¾ ¾
T1FF-C – Type-1 Fuzzy Functions method for classification DIT2FF-C – Discrete Interval Type-2 Fuzzy Functions for classification ET1FF-C – Evolutionary Type-1 Fuzzy Functions for classification EDIT2FF-C – Evolutionary Discrete Interval Type-2 Fuzzy Functions for classification.
The initialization parameters of these methodologies are summarized in Table 6.11. Table 6.11 The Parameters of Type-1 and Type-2 Fuzzy Functions Methods for Classification Problems Parameters FCM number of Clusters*
T1FF-C [2,…,10]
DIT2FF-C [2,…,1]
ET1FF-C min c = 2 max c = 10
EDIT2FF-C min c = 2 max c = 10
FCM m-values
{1.2,1.3,…2.6}
{1.2,1.3,…2.6}
m-lower = 1.2 m-upper = 2.6
m-lower = 1.2 m-upper = 2.6
α-cut for clusters
{0, 0.1}
{0, 0.1}
min-α-cut=0 max-α-cut=0.1
min-α-cut=0 max-α-cut=0.1
Fuzzy Function Type for Classification
LR, SVC
LR, SVC
LR, SVC
LR, SVC
{2-1,20,…,27,28}
{2-1,20,…,27,28}
minCreg = 2-1 maxCreg=28
minCreg = 2-1 maxCreg=28
C-reg Gamma Kernel Type
1
1
1
1
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
236
6 Experiments
6.3.2 Imporoved Fuzzy Functions Methods There are four different extensions of the proposed Improved Fuzzy Functions Methods, which apply the proposed Improved Fuzzy Clustering (IFC) algorithm, as presented in chapter 4 and 5. ¾ ¾ ¾ ¾
T1IFF – Type-1 Improved Fuzzy Functions (chapter 4) DIT2IFF – Discrete Interval Type-2 Improved Fuzzy Functions (chapter 5) ET1IFF – Evolutionary Type-1 Improved Fuzzy Functions (chapter 4) EDIT2IFF – Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions (chapter 5).
The initialization parameters of these methodologies are summarized in Table 6.12. Table 6.12 The Parameters of Type-1 and Type-2 Improved Fuzzy Functions Methods for Regression Parameters IFC number of Clusters*
ET1IFF min c = 2 max c = 10
EDIT2IFF min c = 2 max c = 10
{1.2,1.3,…2.6}
m-lower = 1.2 m-upper = 2.6
m-lower = 1.2 m-upper = 2.6
{0, 0.1}
{0, 0.1}
min-α-cut=0 max-α-cut=0.1
min-α-cut=0 max-α-cut=0.1
LSE and SVM
LSE and SVM
LSE and SVM
LSE and SVM
C-reg
{2-1,20,…,27,28}
{2-1,20,…,27,28}
minCreg = 2-1 maxCreg=28
minCreg = 2-1 maxCreg=28
epsilon
{0.01,…,0.5}
{0.01,…,0.5}
min-ε =0.01 max-ε =0.5
min-ε =0.01 max-ε =0.5
IFC m-values
α-cut for clusters IFC Type for Regression
Gamma Kernel Type
T1IFF
DIT2IFF
[2,…,10]
[2,…,10]
{1.2,1.3,…2.6}
1
1
1
1
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
*For small datasets, small numbers of clusters are used, e.g., c*∈[2,..10].
For classification type datasets as shown in Table 6.1, the Improved Fuzzy Functions Methods for classification problem-domain extensions are applied. These are: ¾
T1IFF-C – Type-1 Improved Fuzzy Functions method for classification (chapter 4) ¾ DIT2IFF-C – Discrete Interval Type-2 Improved Fuzzy Functions for classification (chapter 5) ¾ ET1IFF-C – Evolutionary Type-1 Improved Fuzzy Functions for classification (chapter 4) ¾ EDIT2IFF-C – Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions for classification (chapter 5).
6.3 Parameters of Proposed Fuzzy Functions Algorithms
237
The initialization parameters of these methodologies are summarized in Table 6.13. Table 6.13 The Parameters of Type-1 and Type-2 Improved Fuzzy Functions Methods for Classification Problems Parameters IFC-C number of Clusters
T1IFF-C
DIT2IFF-C
[2,…,15]
[2,…,15]
IFC-C m-values
{1.2,1.3,…2.6}
α-cut for clusters IFC Type for Classification C-reg Gamma Kernel Type
ET1IFF-C min c = 2 max c = 15
EDIT2IFF-C min c = 2 max c = 15
{1.2,1.3,…2.6}
m-lower = 1.2 m-upper = 2.6
m-lower = 1.2 m-upper = 2.6
{0, 0.1}
{0, 0.1}
min-α-cut=0 max-α-cut=0.1
min-α-cut=0 max-α-cut=0.1
LR and SVC
LR and SVC
LR and SVC
LR and SVC
{2-1,20,…,27,28}
{2-1,20,…,27,28}
minCreg = 2-1 maxCreg=28
minCreg = 2-1 maxCreg=28
1
1
1
1
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
Linear, Radial Basis Function
*For small datasets, small numbers of clusters are used, e.g., c*∈[2,..10].
The Improved Fuzzy Functions methods for classification problem domains apply the new Improved Fuzzy Clustering for Classification (IFC-C) method. The evolutionary extensions of fuzzy function strategies i.e., ET1FF, EDIT2FF, ET1IFF, EDIT2IFF, use genetic algorithm methods to optimize the learning parameters during structure identification. The genetic algorithm methods encode the learning parameters as genetic codes, i.e., chromosomes. Hence there are specific parameters that the domain expert should specifiy before executing genetic fuzzy systems. Thus, the list of parameters to execute the genetic algorithms to optimize the parameter of fuzzy function strategies are as follows: The proposed structure identification methods, which implement evolutionary methods, are optimized with GA using GAOT MATLAB program [Houck et al., 1995]. At each GA optimization, mutation and crossover is applied with a population size of 100. The algorithm is executed with 100 iterations. For parameter genes, we used arithmetic and simple crossover and non-uniform and uniform mutation operators. For control genes, simple crossover and shift mutation operators are utilized. Tournament selection based on elitist strategy, in which the individual with the best fitness will be directly passed to the next generation, is employed. For the stopping criteria maximum number of iterations is used, which is 100 iterations by default. The next section presents the results of the experiments. The datasets from real life applications and from UCI machine learning repository [Newman et al. 1998] and StatLib [Meyer and Vlachos] are chosen to demonstrate the effectiveness of the proposed approach against other approaches based on different performance criteria listed in section 6.1.3. Each of these datasets will be analyzed separately. Then summary of results will be given at the end of this chapter.
238
6 Experiments
6.4 Analysis of Experiments – Regression Domains 6.4.1 Friedman’s Artificial Domain To compare the performance of the proposed methods with the benchmark methods, an artificial modeling problem proposed by Friedman [1991] is used. The model is a five input function of:
f ( x) = 10sin (π x1 x2 ) + 20 ( x3 − 0.5 ) + 10 x4 + 10 x5 + Ν ( 0,1) . 2
(6.3)
N is zero mean, unit variance, additive Gaussian noise, corresponding to approximately 20% noise, and the inputs were generated independently from a uniform distribution in the interval [0,1] . The non-linear relations between the inputs and the output are shown in Figure 6.6. There are 9791 vectors in the dataset and the dataset is randomly divided into two each containing 791 and 9000 observations. From the first part, 500 and 250 vectors are selected randomly to construct the training and validation dataset, respectively. For each methodology, we used the same training, verification, and testing datasets. The experiments are repeated 5 times with different subsets of the above sizes and the regression performances are measured using RMSE, MAPE, and R2 measures. In the following experiments, the best results are indicated with the bold numbers.
Fig. 6.6 Scatter diagram of selected variables from Friedman Artificial dataset
6.4.1.1 Results of Friedman’s Artificial Dataset The results obtained from the application of the benchmark and proposed methodologies on Friedman’s artificial dataset are compared based on R2, MAPE and RMSE values of testing datasets. The results of each cross validation experiment are displayed in Appendix Table D.1 to D.9. In this section, we will only present the R2 values of the models.
6.4 Analysis of Experiments – Regression Domains
239
Type-1 Fuzzy Functions Proposed Type-1 Fuzzy Functions (FF) and Improved Fuzzy Functions (IFF) models include; T1FF, T1IFF, ET1FF and ET1IFF. Sub-sampling cross validation is iterated five times and the performance measures are averaged over these five models. The R2 values of the experiments are shown in Table 6.14. Table 6.14 R2 values obtained from the application Type-1 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Freidman’s Artificial Dataset R2 (Stdev) Train- R
T1FF
2 2
Validation- R Test- R
2
T1IFF
ET1FF
ET1IFF
0.97 (0.008)
0.97 (0.009)
0.97 (0.003)
0.97 (0.003)
0.94 (0.007)
0.94 (0.006)
0.94 (0.006)
0.94 (0.009)
0.939 (0.001)
0.939 (0.001)
0.942 (0.001)*
0.942(0.004)
*The optimum models are indicated with bold colors.
The numbers in each cell represent the average R2 values on the training, validation and testing datasets from five different cross validation experiments. The parentheses indicate the standard deviation of R2 values over five iterations. This measure indicates the robustness of the models. The higher the R2 and the lower the standard deviation on testing datasets, the more accurate and robust the prediction models would be. The optimum parameters of the Type-1 FF and IFF models obtained from cross validation experiments are shown in Table 6.15. Table 6.15 Optimum Parameters of Variations of Type-1 Fuzzy Functions Approach variations
Regression Type # of clusters Alpha-cut Fuzziness degree Creg Epsilon
T1FF Non-linear SVM
T1IFF Non-linear SVM
{8,10} {0, 0.1}
{8} {0, 0.1}
{1.5,1.8} {2,24} {0.05}
{1.8} {2} {0.05}
ET1FF Non-linear SVM {7,8} [0,0.1] {1.6,1.8}
ET1IFF Non-linear SVM {6,7,8} [0,0.1] {1.3,1.6,1.85}
{6.2,8.3} {0.02,0.08}
[3.5,55] [0.014,0.1]
Other Well-Known Type-1 Fuzzy System Models and Statistical Learning Methods Type-1 fuzzy systems based on rule bases, namely, ANFIS, DENFIS, and GFS, which are two well-known hybrid fuzzy systems, are applied to the same training, validation and testing datasets. Artificial Neural Network, a soft computing method, and non-linear SVR are also applied to the same datasets. R2 values of these methods are as follows:
240
6 Experiments
Table 6.16 R2 values obtained from the application of Benchmark Approaches on TrainingValidation-Testing Datasets of Freidman’s Artificial Dataset and their optimum model parameters R2 (Stdev) Train- R2
ANFIS 0.999 (0.001)
DENFIS 0.917 (0.010)
Validation- R2
0.486 (0.211) 0.444 (0.23)
2
Test- R Average values of Parameters from cross validation
Number of rules {51}
GFS
NN 0.887 (0.04)
0.985(0.002)
SVM 0.966 (0.004)
0.863 (0.006)
0.870(0.053)
0.735(0.156)
0.939 (0.007)
0.855 (0.007)
0.873 (0.004)
0.728(0.155)
0.938 (0.001)
Threshold= 0.1 Number of {39}
Number of neurons = 50
Number of rules = {49, 50}
RBF Creg: {2}, Epsilon = {0.05} # of support vectors: 432
*The optimum model is indicated with bold colors.
Type-2 Fuzzy Functions The proposed uncertainty modeling methods are applied to the Friedman’s Artificial dataset to build system models, which can identify the parameter and fuzzy function uncertainties. Table 6.17 R2 values obtained from the application Type-2 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Freidman’s Artificial Dataset R2 (Stdev)
DIT2FF
DIT2IFF
EDIT2FF
EDIT2IFF
Train- R2
0.988 (0.005)
0.990 (0.005)
0.979 (0.006)
0.980 (0.007)
Validation- R2
0.985 (0.006)
0.968 (0.006)
0.975 (0.004)
0.956 (0.013)
0.936 (0.004)
0.924 (0.004)
0.946 (0.001)
0.940 (0.003)
Test- R
2
*The optimum models are indicated with bold colors.
The optimum parameters of the Type-2 FF and Type-2 IFF models obtained from cross validation experiments are shown as follows: Table 6.18 Optimum Parameters of Variations of Type-2 Fuzzy Functions Approach
Regression Type # of clusters Fuzziness degree Alpha-cut Creg Epsilon
DIT2FF Non-linear SVM {8,9,10} [1.3,2.3} {0, 0.1} {2,64} {0.05}
DIT2IFF Non-linear SVM {8,10} [1.2-2.0] {0, 0.1} {2,32} {0.05}
EDIT2FF Non-linear SVM {6,7} [1.2,1.6]
EDIT2IFF Non-linear SVM {6,7,8} [1.62,1.76]
[0,0.1] ~[1.37,1.74] {0.08}
[0,0.1] {4.34,11.77} {0.02,0.095}
6.4 Analysis of Experiments – Regression Domains
241
Other Type-2 Fuzzy System Models - DIT2FRB The same cross validation sub-samples are used to train, validate and test the recent Type-2 Fuzzy Rule Base method - DIT2FRB models. The R2 results are shown as follows: Table 6. 19 R2 values obtained from the application Earlier Type-2 Fuzzy Rule Base – DIT2FRB Approach on Training-Validation-Testing Datasets of Freidman’s Artificial Dataset
R2 (Stdev)
DIT2FRB
2
0.972 (0.003) 0.908 (0.011) 0.883(0.009) c*={5,6} m=[1.4,1.9]
Train- R Validation- R2 Test- R2 Optimum Parameters
The average R2 values of the testing datasets of the optimum models of Type-1 Fuzzy Functions, Type-2 Fuzzy Functions and the benchmark methods are depicted in one graph as in Figure 6.7. Cross Validaton Test R2 - Freidman's Artificial Dataset 1 0.9 0.873 0.8
0.94 0.88
0.94
0.95
R2
0.855 0.728
0.7 0.6 0.5
GFS
DENFIS
NN
SVM
DIT2FRB
ET1IFF
EDIT2FF
Fig. 6.7 Average Cross Validation Test R2 values of Optimum Models – Freidman’s Artificial Dataset. Standard errors of five repetitions are shown on the curve.
Based on the analysis of the results, the optimum models of the Friedman’s Dataset are the proposed EDIT2FF and ET1IFF models, and the SVM regression method. EDIT2FF applies Fuzzy C-Means (FCM) clustering and ET1IFF applies the proposed Improved Fuzzy Clustering (IFC) during structure identification. Both methods find the optimum type-2 fuzzy function intervals, based on evolutionary search methods. It should be noted that the SVM method performances are very close to the optimum proposed models (The R2 differences are 1%). The ANFIS model average r-square values were considerably lower than the rest of the system modeling techniques. This is due to the fact that the ANFIS training error was very low, indicating over-fitting.
242
6 Experiments
The optimum proposed methodology that is EDIT2FF applies discrete interval valued type-2 fuzzy functions method, which utilize standard FCM clustering method. On the other hand, the optimum proposed methodology that is ET1IFF utilize Improved Fuzzy Clustering (IFC) method but use type-1 fuzzy improved fuzzy function structures. The two models therefore have different optimum parameter sets. The parameters of fuzzy function structures of the optimum models of the best proposed methodology, EDIT2FF, are retained in collection tables, i.e., m-Col*, Φ-Col*. Each row of m-Col* collection table holds the optimum degree of fuzziness parameter, m* values, of an optimum embedded type-1 fuzzy function model identified for the corresponding training vector. Hence the m-Col* table is a (n×1) matrix, where n indicate the number of training vectors in the particular cross validation training dataset of Freidman’s Artificial. The Φ-Col* collection table holds the local fuzzy function structures of each cluster i, i.e., the optimum transformations of fuzzy functions and their parameters, Ŵi,k, k=1…,n, i=1,…c*, of the optimum embedded type-1 fuzzy function model identified for the corresponding training data vector, k, in each cluster, i=1,…,c*. There is one collection table for each cross validation iteration, i.e., there are five different collection tables in order to do inference on five different testing datasets. In Appendix D.2., samples of the collection tables of the Friedman’s Artificial Dataset are shown. Similarly, unlike EDIT2FF models, the parameters of fuzzy function structures of the optimum models of the ET1IFF are just the list of fuzzy functions for each cluster, which are presented in Appendix D.2. as well. Collection tables are not used in ET1IFF models since the fundamental structure of its learning method is based on type-1 fuzzy functions strategy, whereas EDIT2FF utilizes type-2 fuzzy functions strategy. In particular, EDIT2FF identifies interval valued membership values and fuzzy functions, i.e., footprint-of-uncertainty of the membership value scatter diagram for each cluster and discretisized list of optimum fuzzy function structures. The optimum embedded models from within these uncertainty intervals are identified during structure identification of EDIT2IFF so that one embedded model for each training vector can be selected and the parameters of these models are stored in collection tables, as demonstrated in appendix D.2. In this work, the two-sample left-tailed t-test with 95 percent confidence level is used to indicate the significance of the optimum models of each methodology. This test, which is usually called the statistical significance, is used in the experiments to show the strengths of the algorithms. The t-test results of Friedman’s Artificial Dataset models are shown Table 6.20. We assumed that increasing the R2 values by at least 0.025 points (2.5 percent) indicate improvement in the performances and we build our hypothesis based in this assumption. The null hypotheses of t-tests experiments indicate that the performances of two paired algorithms are significantly different. The null Hypothesis used for the system modeling of Friedman’s artificial dataset is as follows:
H0 :
(
1 5
∑ cv=1 R 2j ,cv ) − ( 51 ∑ cv=1 Rk2,cv ) > %2.5 5
5
The H0 indicates that the difference between the average R2 values (obtained from five cross validation models) of methodology j(row) and k(column) is greater than 2.5% (0.025 points in R2 value). Failing to Reject (FR) the null hypothesis
6.4 Analysis of Experiments – Regression Domains
243
indicates that the null hypothesis is true with the 95% confidence level and the methodology j is significantly better than the methodology k. Rejecting the null hypothesis indicates that the two methodologies are not significantly different. Table 6.20 Two sample left-tailed t-test results (p<0.05) for Friedman’s Artificial Dataset. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R).
T1FF T1IFF DIT2FF DIT2IFF ET1FF ET1IFF* EDIT2FF* EDIT2IFF
ANFIS
DENFIS
NN
GFS
SVM
DIT2FRB
FR 1 FR 1 FR 1 FR 1 FR 1 FR 1 FR 1 FR 1
FR 1 FR 1 FR 1 FR 1 FR 1 FR 1 FR 1 FR 1
FR 0.97 FR 0.97 FR 0.96 FR 0.9 FR 0.97 FR 0.98 FR 0.98 FR 0.97
FR 0.98 FR 0.98 FR 0.97 FR 0.97 FR 0.98 FR 0.98 FR 0.98 FR 0.98
R 0 R 0 R 0 R 0 R 0 R** 0 R** 0 R 0
FR 1 FR 1 FR 1 FR 0.96 FR 1 FR 1 FR 1 FR 1
* Indicate the best methods based on the performance measure. Shaded cells indicate the best methods in the rows that are significantly better than the benchmark methods on the intersecting column. ** Indicate the best methods in the rows that are not significantly better than the benchmark methods on the intersecting column.
The t-test table in Table 6.20 is interpreted as follows: Each cell in the t-test table indicates the rejection or failure to reject the null hypothesises. The numbers indicate the probability of observing decision, in this case FR or R of the null hypothesis. Each cell entry should be interpreted as follows: “The methodology in the row is (not) significantly better than the methodology in the column”. Small probabilities cast doubt on the validity of the null hypothesis. In the experiments of this work, if the probability is more than 5 percent, then the null hypothesis is not rejected (fail to reject). In other words, the null hypothesis is failed to reject with 95 percent confidence. For instance, in Table 6.20, the t-test between row ET1IFF and column GFS methods, “FR (0.98)” indicates that the R2 value of the proposed ET1IFF method is at least 2.5 point higher than GFS method in average of five different cross validation datasets. The optimum models of the proposed ET1IFF and EDIT2FF methods in the rows 6 and 7 are significantly different from the benchmark methods in columns 1,2,3,4, and 6 with high probabilities, except the SVM models in column 5. The optimum models of the proposed methodologies are not significantly better than SVM models in this experiment. The main reason for this is due to the given structure of the dataset. When the dataset is
244
6 Experiments
constructed with defined functions, proposed models are only as good as the rest of the benchmark methods. There are various regression performance measures in the literature that one could use to compare the performances of different models. Some of these measures are highly correlated. Hence one should use only the one that represents the performance of given models the best [Mild and Natter, 2001]. In this experiment, we used three different performance measures, RMSE, R2, and MAPE. It should be pointed out from the RMSE, R2, and MAPE graphs in Figure 6.8 that, there is correlation between the values of each individual optimum model (values are listed in Appendix Tables D.1 to D.9). Thus, we further investigated if there is a relationship between these performance measures; and if there is, we wanted to measure the degree of such relationship. Therefore, the coefficient of correlation between the values of three performance measures is calculated for each training, validation and testing datasets on each cross validation experiment. The results of the analysis are shown in Figure 6.9. 1
R
2
RMSE
0.8
0.6
5 4.5 4 3.5 3 2.5 2
0.4 ANFIS
GFS
DENFIS
NN
SVM DIT2FRB T1IFF* EDIT2IFF* 50
ANFIS
GFS
DENFIS
NN
SVM DIT2FRB T1IFF* EDIT2IFF*
MAPE
40 30 20 10 ANFIS
GFS
DENFIS
NN
SVM DIT2FRB T1IFF* EDIT2IFF*
Fig. 6.8 Cross Validation Test R2 values of Optimum Models – Friedman’s Artificial Dataset. Standard errors of five repetitions are shown on the curve. 1
R2
0.95 0.9
RMSE versus MAPE RMSE versus R2
0.85 0.8
R2 versus MAPE
0
cviTR1 cviTR2 cviTR3 cviTR4 cviTR5 cviVR1 cviVR2 cviVR3 cviVR4 cviVR5 cviTE1 cviTE2 cviTE3 cviTE4 cviTE5
Fig. 6.9 Correlation between Performance Measures of each Friedman’s Training, Validation and Testing Dataset for five cross validation models.cviXXX represent the cross validation models, TR: training, VR: validation, TE: testing datasets.
It can be observed from the above figures that for each cross validation model, all performance measures are highly correlated. Before we decide on using only one of the error measures to evaluate the methodologies, the correlation
6.4 Analysis of Experiments – Regression Domains
245
between these performance measures will once more analyzed in the next experiment.
6.4.2 Auto-mileage Dataset 6.4.2.1 Dataset In this experiment, the proposed approaches are employed as a decision support system for determination of automobile miles per gallon (MPG) consumption. The dataset is available from UCI Repository of Machine Learning Database [Newman et al., 1998]. In this nonlinear relational problem, eight different input variables are used to predict automobile’s fuel consumption as shown in Table 6.21. Table 6.21 Auto-Mileage Dataset Variables
Variable
Name
output
mpg
Input 1
cylinders
Input 2
displacement
Input 3
horsepower
Input 4
weight
Input 5
acceleration
Input 6
model year
Input 7
origin
Input 8
name
After removing instances with missing values, i.e., six instances, the dataset is reduced to 392 data vectors of eight input variables. Stepwise regression using various non-linear transformations of variables is applied to the entire dataset to reduce the number of input variables. Hence, among eight input variables, only four of them, namely the weight, acceleration, origin and model year is selected. In order to test the performance of the proposed model, we used three-way subsampling cross validation approach. The dataset is randomly divided into three parts: 125 training vectors to apply the training algorithm, 45 samples for verification dataset to validate the model and to find the optimum model parameters, and 100 testing vectors to test the optimum model performance. Experiments are repeated with five random subsets of the above sizes. The R2, MAPE and RMSE values are used to measure the performance of the models by averaging error rates of testing samples across five repetitions for each model. 6.4.2.2 Results from Auto-mileage Dataset The results of the application of the benchmark and proposed methodologies on the Auto-Mileage dataset are analyzed based on R2, MAPE, and RMSE values
246
6 Experiments
obtained from testing datasets. These results are displayed in Appendix Table D.10 – D.18. Here, we only display the R2 values of the methods. Type-1 Fuzzy Functions Proposed Type-1 Fuzzy Functions models include the proposed T1FF, T1IFF, ET1FF and ET1IFF. Three-way sub-sampling cross validation is iterated five times. The R2 results of these methods are shown in Table 6.22. Table 6.22 R2 values obtained from the application Type-1 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Auto Mileage Dataset
R2 (Stdev)
T1FF
T1IFF
ET1FF
ET1IFF
Train- R2
0.926 (0.008)
0.928 (0.009)
0.918 (0.021)
0.91 (0.05)
0.857 (0.02)
0.861 (0.024)
0.862 (0.03)
0.866 (0.042)
0.851 (0.047)
0.863 (0.044)
0.84 (0.035)
0.83 (0.055)
2
Valid- R Test- R
2
*The optimum models are indicated with bold colors.
The numbers in each cell represent the average R2 values on training, validation and testing datasets, from cross validation experiments. The parenthesis indicates the standard deviation of R2 values over five iterations. This measure indicates the robustness of the models. The higher the R2 and lower the standard deviation, the more accurate and robust the prediction models would be. The optimum parameters of the Type-1 Fuzzy Functions models obtained from cross validation experiments are shown in Table 6.23. Table 6.23 Optimum Parameters of the variations of Type-1 Fuzzy Functions Approach
Regression Type # of clusters Fuzziness degree Alpha-cut Creg Epsilon
T1FF Non-linear SVM
T1IFF Non-linear SVM
ET1FF Non-linear SVM
ET1IFF Non-linear SVM
{3,5,7}
{3,5,7}
{4,7,8,10}
{4,6,7,8}
{1.3,1.5,2.0}
{1.3,1.5}
[1.16, 2]
[1.23, 2.13]
{0.1} 2 {0.01, 0.1}
{0,0.1} 2 0.1
[0,0.1] [6.02, 68.39] [0.2,0.35]
[0,0.1] [11.03,113.5] [0.18,0.30]
Other Type-1 Fuzzy System Models and Statistical Learning Methods Type-1 Fuzzy System models based on rule bases, namely, ANFIS, DENFIS and GFS, which are two well-known hybrid fuzzy systems are applied to the same training, validation and testing datasets. Artificial Neural Network, a soft computing method, and non-linear SVR are also applied to the same datasets. The R2 results of these methods are summarized as follows:
6.4 Analysis of Experiments – Regression Domains
247
Table 6.24 R2 values obtained from the application of Benchmark Approaches on TrainingValidation-Testing Datasets of Auto-Mileage Dataset and their optimum model parameters R2 (Stdev)
ANFIS 0.956 (0.01)
DENFIS 0.903 (0.014)
NN 0.898 (0.01)
GFS 0.898(0.02)
0.619 (0.26)
0.853 (0.032)
0.843 (0.04)
0.846(0.04)
0.767 (0.07)
0.845 (0.04)
0.841 (0.04)
0.84(0.04)
Number of rules {25}
Threshold= 0.05 Number of rules {32}
Number of neurons = 50
Number of rules = {8}
Train- R2 Validation- R2 Test- R2 Average values of Parameters of cross validation
SVM 0.937 (0.01) 0.842 (0.05) 0.848 (0.05) Creg = {2} epsilon = {0.01, 0.1}
*The optimum model is indicated with bold colors.
Type-2 Fuzzy Functions The proposed uncertainty modeling methods are applied to the Auto-Mileage dataset to identify the parameter and fuzzy function uncertainties. Table 6.25 R2 values obtained from the application variations of Type-2 Fuzzy Functions Approaches on Training-Validation-Testing Datasets of Auto-Mileage Dataset
R2 (Stdev)
DIT2FF
DIT2IFF
EDIT2FF
EDIT2IFF
2
0.945 (0.01)
0.951 (0.01)
0.94 (0.01)
0.935 (0.03)
2
Valid- R
0.891 (0.02)
0.896 (0.01)
0.908 (0.03)
0.915 (0.02)
2
0.853 (0.04)
0.858 (0.04)
0.847 (0.04)
0.863 (0.03)
Train- R Test- R
*The optimum model is indicated with bold colors.
The optimum parameters of the Type-2 FF and Type-2 IFF models obtained from cross validation experiments are shown in Table 6.26. Table 6.26 Optimum Parameters of variations of Type-2 Fuzzy Functions Approach
Regression Type # of clusters Fuzziness degree Alpha-cut Creg Epsilon
DIT2FF Non-linear SVM {3,5,7} [1.3,2.3] {0.1} {2} {0.05}
DIT2IFF Non-linear SVM {3,5,7} [1.2-2.0] {0,0.1} {2,16} {0.01, 0.2}
EDIT2FF Non-linear SVM {6,7,8} [1.27,2.01]
EDIT2IFF Non-linear SVM {6,7,8} [1.23,1.89]
[0,0.1] [20,60] [0.18,0.47]
[0,0.1] {1.26,5.25,10.8,103} {0.03,0.09,0.23}
248
6 Experiments
Other Interval Type-2 Fuzzy Systems Based on Fuzzy Rule Bases - DIT2FRB The same cross validation sub-samples are used to train and validate the DIT2FRB models. The very same testing dataset of cross validation is used to obtain the R2 values as follows: Table 6.27 R2 values obtained from the application Earlier Type-2 Fuzzy Rule Base – DIT2FRB Approach on Training-Validation-Testing Datasets of Auto-Mileage Dataset
R2 (Stdev)
DIT2FRB
2
0.927 (0.01) 0.851 (0.02)
Train- R Valid- R2 Test- R2 Parameters
0.846 (0.04) Opt c* ={5,6,7} Fuzziness degree = ~[1.1,1.7]
*The optimum model is indicated with bold colors.
Optimum Models from each group is combined together to compare the best models in one graph, as shown in Figure 6.10. 0.9
0.85
R2
0.84 0.8
0.845
0.841
0.848
0.863 0.846
0.863
0.767 0.75
0.7 ANFIS
GFS
DENFIS
NN
SVM
DIT2FRB
T1IFF
EDIT2IFF
Fig. 6.10 Average Cross Validation Test R2 values of Optimum Models – Auto Mileage Dataset. Standard errors of five repetitions are shown on the curve.
The parameters of the best models of the optimum methodology, EDIT2IFF, is retained in collection tables, i.e., m-Col*, τ-Col*, Φ-Col*, which include the parameters as shown in the Appendix D.3. There is one collection table set, i.e., 〈mCol*, τ-Col*, Φ-Col*〉, for each cross validation iteration, i.e., there are five different collection tables obtained from each five different training and validation datasets in order to do inference on five different testing datasets. m-Col* indicates the collection table constructed from one cross validation model and it holds the optimum level of fuzziness, m* values of each optimum embedded model for each training vector. Hence the m-Col* table is a (n×1) matrix. The τ-Col* collection table holds the interim fuzzy function structures, i.e., the optimum transformations of fuzzy functions and parameters, ŵk,i, k=1…,n, of the optimum embedded IFC model
6.4 Analysis of Experiments – Regression Domains
249
identified individually for each training data vector and each cluster uses the same structure in IFC algorithm. Thus, if n indicates the number of training vectors, then the τ-Col* is a matrix of (n×1) dimensions of function parameters. The Φ-Col* collection table holds the local fuzzy function structures, i.e., the optimum transformations of fuzzy functions and parameters, Ŵi,k, k=1…,n, of the optimum embedded model identified individually for each training data vector in each cluster. Thus, if n indicates the number of training vectors, then the Φ-Col* is a matrix of (n×c*) dimensions. On the other hand, parameters of the optimum models of the second best model, T1IFF methodology, are much compact than the EDIT2IFF since, T1IFF does not require collection tables. Optimum model parameters are identified for each cluster and every data point in each cluster uses these parameters for the inference. The optimum parameters are listed in Appendix D.3. Based on the results, the two best models are Type-1 Improved Fuzzy Functions (T1IFF) and Evolutionary Type-2 Improved Fuzzy Functions (EDIT2IFF). The R2 differences of these two best models and the best optimum benchmark models, DIT2FRB and SVM, are ~1.8%. Even though these two best methods have higher R2 values than the rest of the models, it is not clear as to how significant this improvement is. In this work, two-sample left-tailed t-test with 95 percent confidence level is used to indicate the significance of the results. After identifying the best models, we wanted to measure how much improvement can Table 6.28 Two-sample left tailed t-test results (p<0.05) for Auto-Mileage Dataset. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R). ANFIS T1FF T1IFF* DIT2FF DIT2IFF ET1FF ET1IFF EDIT2FF EDIT2IFF*
FR 1 FR 1 FR 1 FR 1 FR 0.99 FR 0.98 FR 1 FR 0.99
DENFIS R 0.04 FR 0.17 R 0.02 FR 0.05 R 0.01 R 0.02 FR 0.06 FR 0.29
NN FR 0.06 FR 0.28 R 0.02 FR 0.05 R 0.01 R 0.02 FR 0.05 FR 0.42
GFS FR 0.24 FR 0.46 FR 0.21 FR 0.31 FR 0.1 R 0.03 FR 0.21 FR 0.43
SVM R 0 R** 0.04 R 0 R 0.01 R 0.01 R 0.01 R 0.02 FR 0.28
DIT2FRB FR 0.06 FR 0.16 FR 0.05 FR 0.09 R 0.01 R 0.03 R 0.05 FR 0.33
* Indicate the best methods based on the performance measure. Shaded cells indicate the best methods in the rows that are significantly better than the benchmark methods on the intersecting column. ** Indicate the best methods in the rows that are not significantly better than the benchmark methods on the intersecting column.
250
6 Experiments
these optimum models provide compared to the rest of the benchmark models. Thus, t-test, which is usually called the statistical significance, is used in the experiments to show the strengths of the algorithms. The t-test results for Auto-Mileage datasets are shown in Table 6.28. We defined that increasing the R2 values by at least 0.025 points, (2.5 percent) indicate improvement in performance and we build our hypothesis based on this assumption. The null hypothesis is given for Auto-Mileage dataset experiments as follows:
H0 :
(
1 5
∑ cv=1 R 2j ,cv ) − ( 51 ∑ cv=1 Rk2,cv ) > %2.5 5
5
The H0 indicates that the difference between the average R2 values obtained from five cross validation models of two methodologies, j(row) and k(column) is greater than 2.5% (0.025 points higher in R2 value). Each cell entry should be interpreted as follows: “The methodology in the row is (not) significantly better than the methodology in the column”. Failing to Reject (FR) the null hypothesis indicates that the null hypothesis is true at the 95 percent confidence level and the methodology j is significantly better than the algorithm k. Rejecting the null hypothesis indicates that the two methodologies are not significantly different. For instance, as can be seen from Table 6.28, the optimum models of the proposed EDIT2IFF method are significantly different from the benchmark methods. However, T1IFF is not significantly better than the results of the SVM models, although it is better than every other benchmark method, e.g., methods in columns 1,2,3,4 and 6. It should be pointed out from the RMSE, R2, and MAPE graphs in Figure 6.11 that, there is a correlation between them in individual optimum model (values are listed in Appendix Table D.10 – D.18). Thus, we further analyzed if there is a relationship between these performance measures; and if there is, we wanted to measure the degree of such a relationship. Therefore, the correlation coefficient between the values of each performance measure is calculated for each training, validation and testing datasets on each sample dataset for this experiment once more. 5 0.9
4.5
RMSE
R
2
0.85 0.8 0.75
3 2.5
0.7 0.65
4 3.5
2 ANFIS
GFS
DENFIS
NN
SVM DIT2FRB T1IFF* EDIT2IFF*
ANFIS
GFS DENFIS
NN
SVM DIT2FRB T1IFF* EDIT2IFF*
MAPE
14
12
10
8
ANFIS
GFS
DENFIS
NN
SVM DIT2FRB T1IFF* EDIT2IFF*
Fig. 6.11 Auto-Mileage Dataset Cross Validation Test Performance measures of Optimum Models using R2, MAPE, and RMSE measures. Standard errors of five repetitions are shown on the curve.
6.4 Analysis of Experiments – Regression Domains
251
1.05 0.9
R2
0.75 RMSE versus MAPE
0.6
RMSE versus R2 0.45 0.3
R2 versus MAPE 0
cviTR1 cviTR2 cviTR3 cviTR4 cviTR5 cviVR1 cviVR2 cviVR3 cviVR4 cviVR5 cviTE1 cviTE2 cviTE3 cviTE4 cviTE5
Fig. 6.12 Correlation between Performance Measures of Auto-Mileage Dataset Training, Validation and Testing Datasets for five cross validation models. cviXXX represent the cross validation models, TR: training, VR: validation, TE: testing datasets.
It can be observed that for every model, almost all performance measures for this dataset are highly correlated. Particularly, the average R2 between ‘RMSE and MAPE’ is 84.9%, “RMSE and R2” is 99%, and ‘R2 and MAPE’ is 84%. Therefore, we will only use one of these performance measures for rest of the regression experiments.
6.4.3 Desulphurization Process Dataset 6.4.3.1 Hot Metal Pretreatment and Desulphurization Process Desulphurization is a sub-process of a Hot Metal Pre-treatment process, which takes place in a primary metal production industry. The desulphurization dataset of this work is supplied by DOFASCO, Canada. Hot metal produced in blast furnaces contains impurities like phosphorus, silicon, sulphur, carbon, and so on. The process of removing the impurities is called the refining process. The refining process consists of hot metal pre-treatment conducted in a torpedo car, decarburization process in a converter and various kinds of secondary refining processes corresponding to the requirements of final product. The hot metal pre-treatment sequential process is shown in Figure 6.13. The major part of pre-treatment is assigned to Desulphurization process.
Fig. 6.13 Process flow of Hot Metal Pretreatment
252
6 Experiments
In Figure 6.13, CaO represents the calcium oxide (Soda ash) added to the hot metal for dephosporization, LD-OTB is an LD converter with inert gas blowing from the bottom for cooling. Dofasco uses a Torpedo Car Desulphurization facility to remove sulphur from the batches of molten metal, as shown in Figure 6.14. Fig. 6.14 A Blast Furnace at a Hot Metal Pre-treatment plant
The aim is to produce a final steel product to have a sulphur content less than or equal to the maximum sulphur specification for the desired grade of product. A schematic of the facility is illustrated in Figure 6.15. A torpedo car desulphurization facility removes sulphur from hot metal leaving the blast furnaces before it is sent to the next process. Generally, desulphurization is carried out by injecting two different powered reagents directly into the hot metal via a lance. The reagents react with the sulphur in the hot metal and residue,
Fig. 6.15 Torpedo Car Desulphurization Process
6.4 Analysis of Experiments – Regression Domains
253
which is rich in sulphur, to separate from iron. Examples of typical reagents include calcium carbide (CaC2), magnesium and lime. The addition of reactive material creates a sulphur rich slag layer that can be physically separated from the molten metal which then contains less sulphur. The aim of the data-mining project is to build a decision support system to determine the right amounts of the reagents to be added into the hot metal. In reality, the target amount of sulfur (i.e. the aim sulfur) is often set much lower than the true sulfur value in desulphurization process. The argument underlying the modeling exercise is that a reduction in reagent consumption would be possible if more precise and reliable model can be developed to estimate the right amount of reagents used in the desulphurization process. The aim sulfur is the required quantity demanded by the customers of the steel plant. Empirical modeling strategy is required to understand the mechanism of the chemical and mechanical effects. One of the key issues of this modeling case study is that, when a model with poor predictive capability is used, it results in many batches of hot metal that has to enter desulphurization process again. It should be noted that the desulphurization process is a highly expensive process; therefore, the main objective in this modeling approach is to minimize the number of desulphurization processes increasing the modeling prediction ability. 6.4.3.2 The Dataset The desulphurization reactions in the torpedo car, as illustrated in Figure 6.15, are largely affected by not only amounts of the reagents, but also operation conditions such as process time, temperature of hot metal, initial condition of impurities, slag conditions, injection conditions of reagents and so on. The supplied empirical data includes these parameters as shown in Table 6.29 as independent variables in determination of two different kinds of reagents. The desulphurization dataset consisted of approximately 13,000 data vectors with 27 attributes composed of binary, ordinal, scalar, and a few categorical variables. Each vector represents the measurements taken from one batch of a hot metal. Several variable selection methods are applied to choose the optimum parameters. Around 3000 data vectors in the dataset contained negative Reagent1 and Reagent2 values. Based on the information obtained from the domain experts in Dofasco, these vectors are considered as faulty inputs, and should be discarded from the dataset. Therefore, the number of vectors in the core dataset is reduced down to 10,000 data vectors just for this study. We also applied statistical methods to clean the dataset from possible noises. In order to apply noise cleansing based on statistical methods, the probability distributions are drawn and the vectors that are outside a certain confidence value, e.g., xk>(9*Standard-deviation) are discarded. These values are determined by the experts. The number of input vectors, after all of the above outlier treatments, has dropped to 9675 observations. There are two different types of variables in the dataset; the scalars, which take on continuous values and the discrete values, which take on binary or ordinal values. The proposed methods use regression estimation functions; therefore, only scalar continuous variables are used. In addition, the proposed method also
254
6 Experiments
Table 6.29 Desulphurization Dataset variables Scalar Variables
Descriptions
Scalar Variables Start-Sulphur
Starting level of sulphur before desulphurization
KGS
Weight of the batch that consists of iron (tons)
TEMP
Temperature of the hot metal as it leaves from the Blast Furnace.
FB
Measure of fullness of the furnace
Aim-Sulphur
The amount of sulphur that is targeted to remain after desulphurization
End-Sulphur
The amount of sulphur remain within the metal after desulphurization
Compound 1 to 5
The chemicals measured in the hot metal as they arrive the desulphurization process
Binary and Ordinal Variables Car-Type
Specific style of vessel that is used to hold the hot metal
POS
Specific station at which the desulphurization takes place
Practice 1 to 6
They indicate that a certain type of modification to the normal operating practice has been applied.
Injection Number
Number of times the hot metal goes through desulphurization process.
Equipment Type
Equipment style used for the corresponding batch.
implements clustering methods, which depend on the distance measures between points represented with scalar dimensions. Using discrete values during the function estimation tasks has always been a challenge. Among some of the ways of incorporating discrete valued input variables into the system models is using these types of variables as partitioning indicators such as decision trees. We can also use these variables as scalar values if they have a sufficient number of discrete values to be considered as continuous variables.
Fig. 6.16 Distribution of Reagent1 versus Reagent2 variables of Desulphurization Dataset
6.4 Analysis of Experiments – Regression Domains
255
For discrete valued variables, kernel-density estimation analysis is applied. The aim was to analyze whether the behavior of reagents can be analyzed with these discrete valued variables. Since the two reagents are highly correlated, as shown in Figure 6.16, we only used one of the reagents to apply kernel density analysis. We separated the reagent values based on the discrete values of each categorical variable and draw the kernel density graphs to observe if the distribution of reagents for different discrete values of the same categorical variable is different. Statistical density estimation involves approximating a hypothesized probability density function from observed data. The kernel density function of the SAS statistical software is used to draw the distributions of reagent values for each discrete value of the parameters. For instance, ‘aim-sulphur’ is a categorical parameter, which has only four different discrete values; {0.0038, 0.006, 0.0095, 0.014}. The kernel density graphs of four discrete values are shown in Figure 6.17. It should be noted from the density graphs of different aim-sulfur levels that different values of aim sulfur explain different parts of the reagent variable. Therefore, this variable should be included in the analysis. The kernel density graphs of the rest of the discrete variables are given in the Appendix D.4. Based on these graphs, we conclude that none of the categorical variables, except the aim-sulfur, can successfully discriminate the reagent values, therefore, they are not used in the analysis. The rest of the scalar variables are used in building system models.
Fig. 6.17 Kernel density estimation of Reagent 1 for four grades of Aim-sulphur (discrete variable)
The dataset has two output variables; reagent1 and reagent2. Therefore, for each output, we build separate models. The dataset is randomly divided into 1675 for training and validation purposes and 8000 observations are used for testing purposes. Furthermore, the first part, 1675-observation data, is randomly divided into two parts: 250 training vectors to apply the training algorithm, 750 samples for validation dataset to validate the model to find the optimum model
256
6 Experiments
parameters. 8000-obseravtion testing vectors are used to test the optimum model performance. Experiments were repeated with five random subsets of training and validation datasets of the above sizes. The R2 values are used to measure performance of the models by averaging error rates of testing samples across five repetitions for each model.
6.4.3.3 Results of Desulphurization Dataset The results from applying the benchmark and proposed methodologies on the Desulphurization dataset for two different outputs are analyzed based on R2 values obtained from testing datasets. The results are displayed in Appendix Table D.19 to D.21 for Reagent1 dataset and Appendix Table D.22 to D.24 for Reagent2 dataset. Type-1 Fuzzy Functions Proposed Type-1 Fuzzy Functions models include the proposed T1FF, T1IFF, ET1FF and ET1IFF. Sub-sampling cross validation is iterated five times and the R2 results are averaged over five models. The results of the R2 values for Reagent1 and Reagent2 outputs are shown in Table 6.30. Table 6.30 R2 values obtained from the application Type-1 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Reagent1 and Reagent2 Datasets of Desulphurization Process R2 (Stdev)
T1FF
T1IFF
ET1FF
ET1IFF
Reagent 1 2
0.698 (0.031)
0.705 (0.034)
0.698 (0.031)
0.699 (0.030)
Valid- R2
0.691 (0.035)
0.690 (0.036)
0.691 (0.035)
0.691 (0.035)
0.790 (0.011)
0.792 (0.010)
0.794 (0.011)
Train- R Test- R
2
0.789 (0.011)
Reagent 2 Train- R2
0.726 (0.029)
0.734 (0.024)
0.727 (0.031)
0.73 (0.024)
Valid- R2
0.71 (0.017)
0.712 (0.016)
0.712 (0.019)
0.704 (0.02)
0.762 (0.079)
0.769 (0.039)
0.776 (0.048)
0.805 (0.007)
Test- R
2
*The optimum models are indicated with bold colors.
The numbers in each cell represent the average R2 values on training, validation and testing datasets, from cross validation experiments. The values in parenthesis indicate the standard deviation of R2 over five iterations. This measure indicates the robustness of the models. The optimum parameters of the Type-1 Fuzzy Function and Improved Fuzzy Functions models obtained from cross validation experiments are shown in Table 6.31. Other Type-1 Fuzzy System Models and Statistical Learning Methods Type-1 Fuzzy System models based on rule bases, namely, ANFIS and DENFIS, which are two well-known hybrid fuzzy systems, are applied to the same training,
6.4 Analysis of Experiments – Regression Domains
257
Table 6.31 Optimum Parameters of variations of Type-1 Fuzzy Functions Approaches Reagent1 Regression Type # of clusters Fuzziness degree Creg Epsilon
T1FF
T1IFF
ET1FF
ET1IFF
Linear SVM 9 {1.3, 1.8, 2.0} {2,16,64,128} {0.05}
Linear SVM 9 {1.5, 1.8, 2.0} {2,16,64} {0.05}
Linear SVM {5,6,8} [1.27, 2.16] [8.68,57.91] [0.04,0.1]
Linear SVM {5,6,8} [1.29, 2.12] [8.59,62.6] [0.04,0.09]
Reagent2 Regression Type # of clusters Fuzziness degree Creg Epsilon
T1FF
T1IFF
ET1FF
ET1IFF
Linear SVM 9 {1.5, 1.8, 2.3} {16,64,128} {0.4}
Linear SVM 9 {1.5, 1.8, 2.3} {64,128} {0.4}
Linear SVM 9 [1.37, 2.03] [9,91] [0.3,0.5]
Linear SVM 9 [1.3, 1.9] [2.5,25] [0.04,0.08]
validation, and testing datasets. Artificial Neural Network, a soft computing method, is applied to the same datasets. Additionally, SVM models are also presented to indicate non-linear and linear regression models. Sub-sampling cross validation is iterated five times and the R2 results are averaged over these five models. The results of these methods for the Dofasco dataset are summarized as follows: Table 6.32 R2 values obtained from the application of Benchmark Methods on TrainingValidation-Testing Datasets for Reagent1 and Reagent2 of Desulphurization Process and their optimum model parameters R2 (Stdev)
ANFIS
DENFIS
NN
GFS
SVM
Reagent1 Train- R2
0.872 (0.034)
0.810 (0.039)
0.749 (0.035)
0.816(0.04)
0.697 (0.030)
2
0.502 (0.087)
0.594 (0.058)
0.666 (0.031)
0.596(0.04)
0.690 (0.036)
0.591 (0.051)
0.686 (0.029)
0.767 (0.010)
0.678(0.07)
0.789 (0.011)
Valid- R
Test- R2 Average values of Parameters from cross validation
Total number of hidden neurons = 50
Number of rules = {5,6}
Creg = {2,16,128} Epsilon={0.05,0.1} Linear SVM
c*={4,5,6,7}
DThr*= {0.05,0.08}
Train- R2
0.89 (0.034)
0.823 (0.043)
0.774 (0.012)
0.579(0.07)
0.724 (0.029)
Valid- R2
0.543 (0.107)
0.634 (0.024)
0.699 (0.021)
0.544(0.08)
0.708 (0.018)
Test- R2 Average values of Parameters from cross validation
0.624 (0.07)
0.686 (0.005)
0.774 (0.014)
0.728(0.06)
0.776 (0.024)
Reagent2
c*={4,5,6}
DThr*= {0.05,0.1}
Total number of hidden neurons = 50
Number of rules = {5,6}
Creg = {128} Epsilon={0.4} Linear SVM
258
6 Experiments
Type-2 Fuzzy Functions The proposed uncertainty modeling methods are applied to the Desulphurization dataset to identify the parameter and fuzzy function uncertainties. Table 6.33 R2 values obtained from the application of variations of Type-2 Fuzzy Functions Approach on Training-Validation-Testing Datasets for Reagent1 and Reagent2 Datasets of Desulphurization Process R2 (Stdev)
DIT2FF
DIT2IFF
EDIT2FF
EDIT2IFF
Reagent1 Train- R2
0.755 (0.025)
0.785 (0.038)
0.732 (0.054)
0.710 (0.044)
Valid- R2
0.745 (0.022)
0.744 (0.027)
0.730 (0.033)
0.705 (0.025)
Test- R2
0.785 (0.015)
0.785 (0.015)
0.773 (0.035)
0.805 (0.005)
Train- R2
0.777 (0.024)
0.795 (0.02)
0.75 (0.039)
0.762 (0.05)
2
0.759 (0.028)
0.764 (0.021)
0.741 (0.031)
0.735 (0.029)
0.739 (0.079)
0.765 (0.033)
0.733 (0.122)
0.807 (0.005)
Reagent2 Valid- R
Test- R2
The optimum parameters of the Type-2 FF and Type-2 IFF models obtained from cross validation experiments are shown in Table 6.34. Table 6.34 Optimum Parameters of variations of Type-2 Fuzzy Functions Approach Reagent1 Regression Type # of clusters Fuzziness degree Creg Epsilon
DIT2FF
DIT2IFF
EDIT2FF
EDIT2IFF
Linear SVM 9 {1.3, 1.8, 2.0} {2,16,100} {0.05,0.1}
Linear SVM 9 {1.5, 1.8, 2.0} {2,64} {0.05}
Linear SVM {6,8} [1.2, 2.6] [2.6,7.9] [0.06,0.41]
Linear SVM {6,8} [1.2,2.6] [2.9, 7.6] [0.06,0.42]
EDIT2FF Linear SVM
EDIT2IFF Linear SVM
{7,8,9} [1.2, 2.6] [3,126] [0.4]
{7,8} [1.2,2.6] [12, 89] [0.02,0.04]
Reagent1 Regression Type # of clusters Fuzziness degree Creg Epsilon
DIT2FF Linear SVM 9 {1.5, 2.3} {16,64,128} {0.4}
DIT2IFF Linear SVM 9 {1.5, 1.8, 2.0} {64,128} {0.2,0.4}
Other Interval Type-2 Fuzzy Systems based on Fuzzy Rule Bases – DIT2FRB The same cross validation sub-samples are used to train and validate the DIT2FRB models. The very same testing dataset of cross validation is used to obtain the R2 values as follows:
6.4 Analysis of Experiments – Regression Domains
259
Table 6.35 R2 values obtained from the application Earlier Type-2 Fuzzy Rule Base – DIT2FRB Approach on Training-Validation-Testing Datasets of Desulphurization Process
2
R (Stdev)
Test- R
Reagent2
DIT2FRB
DIT2FRB
2
0.800 (0.024)
0.827 (0.04)
2
0.645 (0.026)
0.674 (0.032)
0.745 (0.008)
0.751 (0.01)
Train- R Valid- R
Reagent1
2
c*={3}, m=[1.2,2.6]
Optimum Parameters
c*={6}, m=[1.2,2.6]
The R2 values of the optimum proposed and benchmark models are summarized as shown in Figure 6.18 and Figure 6.19. R2 Values fo Desulphurization - Reagent 1 Dataset 0.85 0.8 0.75
R2
0.75
0.77
0.7
0.79
0.80
0.81
0.69
0.65
0.68
0.6
0.59 0.55 0.5
ANFIS
GFS
DENFIS
NN
DIT2FRB
SVM
ET1IFF*
EDIT2IFF*
2
Fig. 6.18 Average cross validation R values of optimum models of Desulphurization Reagent 1 Dataset R2 Values of Desulphurization - Reagent 2 Dataset 0.85 0.8
0.81
0.75
0.77
R2
0.7
0.73
0.65
0.75
0.78
0.81
0.69
0.6
0.62
0.55 0.5
ANFIS
GFS
DENFIS
NN
DIT2FRB
SVM
ET1IFF*
EDIT2IFF*
Fig. 6.19 Average cross validation R2 values of optimum models of Desulphurization Reagent 2 Dataset
260
6 Experiments
Based on the results, the best two models are Evolutionary Type-1 Improved Fuzzy Functions (ET1IFF) and Evolutionary Type-2 Improved Fuzzy Functions (EDIT2IFF). For Reagent1 dataset the R2 differences between the best optimum model, EDIT2IFF, and best benchmark model, SVM, is 2% and for Reagent2 it is 3%. It is notable that the testing performance is higher than the training and validation performances of most of the proposed methodologies. This sounds promising, but such behavior is rare in the regression literature, and made us investigate this behavior furthermore. It appears to be that in real datasets, the normality assumption may not apply, intact; the dataset usually shows a skewed behavior. Thus, the sum of total squares, SST, usually gets larger as we increase the number of data in the testing dataset. In addition, this rare behavior is a result of the interpolation ability of the proposed methodologies. In SVM the same behavior can also be observed. When we analyze the behavior of the SST, in relation to the sum of squared errors, SSE, we observe that when the numbers of data points are increased in testing vectors, the increase in the SSE is not proportional with the increase in the SST. Infact; the SST grows much more than the SSE, which indicates a larger R2 value. Even though the R2 values of the proposed approaches are the highest among other methods, to strengthen our findings, we further analyzed significance of the improvements. The t-test results of Reagent1 and 2 models are as follows: Table 6.36 Two-sample left tailed t-test results (p<0.05) for Reagent 1 Dataset. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R).
T1FF T1IFF DIT2FF DIT2IFF ET1FF ET1IFF* EDIT2FF EDIT2IFF*
ANFIS
DENFIS
NN
GFS
SVM
DIT2FRB
FR 1 FR 1 FR 1 FR 1 FR 1 FR 1 FR 1 FR 1
FR 1 FR 1 FR 1 FR 1 FR 1 FR 1 FR 0.99 FR 1
FR 0.36 FR 0.39 FR 0.24 FR 0.23 FR 0.57 FR 0.64 FR 0.16 FR 0.99
FR 1 FR 1 FR 0.99 FR 0.99 FR 1 FR 1 FR 0.99 FR 1
R 0 R 0 R 0 R 0 R 0 R** 0 R 0.02 FR 0.07
FR 0.99 FR 1 FR 0.97 FR 0.97 FR 1 FR 1 FR 0.61 FR 1
* Indicate the best methods based on the performance measure. Shaded cells indicate the best methods in the rows that are significantly better than the benchmark methods on the intersecting column. ** Indicate the best methods in the rows that are not significantly better than the benchmark methods on the intersecting column.
6.4 Analysis of Experiments – Regression Domains
261
The null hypothesis of t-tests for Reagent 1 experiments indicates that the performances of two paired algorithms are significantly different. The null hypothesis for Reagent 1 models is as follows:
H0 :
(
1 5
∑ cv=1 R 2j ,cv ) − ( 51 ∑ cv=1 Rk2,cv ) > %2.5 5
5
The H0 indicates that the difference between the average R2 values obtained from five cross validation models of methodology j(row) and k(column) is greater than 2.5% (0.025 points in R2 value). Each cell entry should be interpreted as follows: “The methodology in the row is (not) significantly better than the methodology in the column”. Failing to Reject (FR) the null hypothesis indicates that the null hypothesis is true at the 95 percent confidence level and the methodology j is significantly better than the algorithm k. Rejecting the null hypothesis indicates that the two methodologies are not significantly different. For instance, as can be seen from the last row of Table 6.36, the optimum models of the proposed EDIT2IFF method are significantly different from the benchmark methods since the null hypothesis is not rejected. Table 6.37 Two-sample left tailed t-test results (p<0.05) for Reagent 2 Dataset. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R).
T1FF T1IFF DIT2FF DIT2IFF ET1FF ET1IFF* EDIT2FF EDIT2IFF*
ANFIS
DENFIS
FR 1 FR 1 FR 0.98 FR 1 FR 1 FR 1 FR 0.96 FR 1
FR 0.8 FR 0.97 FR 0.6 FR 0.97 FR 0.97 FR 1 FR 0.52 FR 1
NN R 0.05 R 0 R 0.02 R 0 R 0.01 FR 0.06 FR 0.06 FR 0.07
GFS FR 0.97 FR 0.99 FR 0.97 FR 0.99 FR 0.99 FR 1 FR 0.92 FR 1
SVM R 0.04 R 0 R 0.02 R 0 R 0.01 FR 0.08 FR 0.06 FR 0.11
DIT2FRB FR 0.16 R 0.04 FR 0.06 R 0.02 FR 0.14 FR 0.94 FR 0.13 FR 0.97
* Indicate the best methods based on the performance measure. Shaded cells indicate the best methods in the rows that are significantly better than the benchmark methods on the intersecting column.
The null hypothesis of t-tests for Reagent 2 experiments indicates that the performances of two paired algorithms are significantly different. The null hypothesis for Reagent 2 models is as follows:
262
6 Experiments
H0 :
(
1 5
∑ cv=1 R 2j ,cv ) − ( 51 ∑ cv=1 Rk2,cv ) > %5 5
5
The H0 indicates that the difference between the average R2 values obtained from five cross validation models of methodology j(row) and k(column) is greater than 5% (0.05 points higher in R2 value). Failing to Reject (FR) the null hypothesis indicates that the null hypothesis is true at the 95 percent confidence level and the methodology j is significantly better than the algorithm k. Rejecting the null hypothesis indicates that the two methodologies are not significantly different. For instance, as can be seen from Table 6.37, the optimum models of the proposed ET1IFF and EDIT2IFF methods are significantly different from the benchmark methods since the null hypothesis is not rejected in comparison to each benchmark method in columns 1-6. The parameters of the best models of the optimum methodology, EDIT2IFF, is retained in collection tables, i.e., m-Col*, τ-Col*, Φ-Col*. There is one collection table set, i.e., 〈m-Col*, τ-Col*, Φ-Col*〉, for each cross validation iteration, i.e., there are five different collection table sets obtained from each five different training and validation datasets of each Reagent1 and Reagent 2 models, in order to do inference on five different testing datasets. The structure of the collection tables that indicate the optimum parameters and fuzzy function structures of the optimum models of Desulphurization Dataset for Reagent1 and Reagent2 models are identified by the methodologies, i.e., EDIT2IFF, ET1IFF, which are similar to the Freidman’s Artificial Dataset and Automobile Mileage Dataset optimum parameters. In Appendix D.2 and D.3., we show samples of optimum parameters and interim and local system fuzzy function structures and parameters for these models.
6.4.4 Stock Price Analysis Stock price estimations a have number of specific properties that require special consideration. In [Hellstrom and Holmstrom, 1998] the stock price analysis is defined as; “…Prediction of stocks is generally believed to be a difficult task. The most common view point, especially among academics, is that the task of predicting stocks is comparable to that of inventing a perpetual machine or solving problems like the quadrature of the circle.” A profitable prediction is a better prediction even if it has less overall accuracy. A very good example of this theory is presented in the beginning of this work. In [Deboeck, 1992] a neural network that correctly predicted the next-day direction 85% of the time, consistently lost money. Therefore, in stock price estimation problems, when selecting the best algorithms or ranking the algorithms, one should not just determine the optimality of the model based on error measures such as MAPE or RMSE, but analyze stock price domain related measures such as profitability of the algorithm within the given time period. Therefore, on top of the well-known performance measures, here a new criterion is introduced to measure and compare the performances of models for stock price prediction, namely Robust Simulated Trading Benchmark, to be discussed below. This new measure combines three different properties in one measure, namely the directions, accuracy and robustness of predicted outcome.
6.4 Analysis of Experiments – Regression Domains
263
6.4.4.1 Stock Price Prediction Datasets Predicting stock prices is one of the most challenging application areas in the financial sector. One of the reasons for this is that many of the factors that affect financial markets such as general economic conditions, a trader’s expectations, and political events, are independent. Nevertheless, several studies [Ince and Trafalis, 2007; Cao et al., 2005, Leigh et al, 2005] examine the relationships in stock price movements with technical indicators. These indicators such as moving averages, volume spikes, are concerned with the dynamics of the market price and volume behaviors are used to estimate future stock prices. The results of these studies show that financial analysts today are using more than 100 different technical indicators [Murphy, 1999] to get an insight into stock price trends. Some of these estimate price fluctuations using MAPE or RMSE for performance evaluations and some of them deal with price increase or decrease and use Hit Rate (HR) as the evaluation criteria. In [Ince and Trafalis, 2007] some of the new financial indicators are used to show that stock prices can be approximated with non-linear machine learning models, viz. support vector machines methods. In this work, some of these well-known technical indicators like moving average, exponential moving average as well as some of the new financial indicators of [Murphy, 1999] are used to build models for stock prices using the proposed Fuzzy Functions approaches on five different historical stock prices extracted from Yahoo Finance. The datasets are converted into a multi-input single-output data mining problem, where the input variables are just the summary values of the stock prices. Among 100 different financial indicators [Murphy, 1999; www.stockcharts.com] some of them model market fluctuations, and some focus on when to make buy or sell decisions. The difficulty with the technical indicators is deciding which indicators are crucial in determining market movements. An example of the dependency of two financial indicators is shown in Figure 6.20. 75 Closing Stock Price
Closing Stock Price
75 70 65 60 55 55
60 65 70 Exponential Moving Average
75
70 65 60 55 55
60
65 70 Bollinger Band
75
Fig. 6.20 Two financial indicators of a selected Stock’s Price versus Closing Stock Price
In the literature, usually three or four indicators are used to measure the market trends. The author is not a financial domain expert; hence, the aim of this study is to measure the performance of the proposed methodologies. Therefore, in this work financial indicators similar to [Ince and Trafalis, 2007] are used as input variables, specifically, the exponential moving average, relative strength index, the Bollinger band widths, moving average convergence/divergence and the Chakin money flow. These indicators [Murphy, 1999] are summarized in the Appendix D.5.
264
6 Experiments
For each of the financial indicators explained in the Appendix D.5., we used the short, mid and long-term measures corresponding to N={10,20,50} days. Hence, the list of parameters to build a decision support system using the proposed approaches is shown in Table 6.38. In the first stage of stock price analysis, the most important indicators that can explain the next day closing values of stock prices are identified. It should be pointed out that the independent variables are constructed based on the closing day prices of earlier periods. It is expected that there would be a correlation between the input variables and the closing values. In order to reduce the number of input variables, initially a simple stepwise regression is applied. Then we applied correlation analysis to select the variables that can reduce RMSE the most and that are highly correlated with the output. From the results of the simple variable selection analysis, it is assumed that the current stock price depends on:
⎛ EMA10t −1 , EMA20t −1 , EMA50t −1 , BB 20t −1 , MACD _12 _ 26t −1 , ⎞ St = f ⎜ ⎟ ⎝ SMA10t −1 , SMA50t −1 , PCMA20t −1 , PCMA50t −1 , SR50t −1 ⎠ Table 6.38 List of variables used in Stock Price Estimation Models Variable Code EMA10 EMA20 EMA50 BB20 RSI MACD_12_26 CMF SMA10 SMA20 SMA50 PCMA10 PCMA20 PCMA50 SR10 SR20 SR50 Close Value
Description Exponential Moving Average Short Term – 10 days Exponential Moving Average Middle Term – 20 days Exponential Moving Average Long Term – 50 days Bollinger Band – Middle Term Relative Strength Index Moving Average Convergence Divergence between 12 and 26 periods Chaikin Money Flow Simple Moving Average – Short Term – 10 days Simple Moving Average – Middle Term – 20 days Simple Moving Average – Long Term – 10 days Present Change of Moving Average Short Term = SMA10(t)-SMA10(t-1) Present Change of Moving Average Middle Term = SMA20(t)-SMA20(t-1) Present Change of Moving Average Long Term = SMA50(t)-SMA50(t-1) Separation Ratio Short Term = (SMA10-Close Value) Separation Ratio Middle Term = (SMA20-Close Value) Separation Ratio Long Term = (SMA50-Close Value)
In order to measure the performance of the proposed algorithms on stock price analysis, we used five different stocks from Canadian stock markets including TD CANADA TRUST (TD), BANK OF MONTREAL(BMO), SUNLIFE(SLF), ENBRIDGE(ENB), and LOBLAWS(LB). The proposed fuzzy functions approaches as well as benchmark models are used to build models for each of these stocks. In these estimation models, approximately 1.5 years prior stock price data are used to build models and capture the optimum one. The last 100 day’s (almost 5 months) stock prices are used to test the optimum model. It should be emphasized that for an unbiased performance measure (especially to be able to use the proposed RSTB measure) these 100 days of stock prices are not used for training or validation purposes. These are allocated for testing the optimum model
6.4 Analysis of Experiments – Regression Domains
265
75
70
70
L oblaw s S to ck
TD Canada Trust Stock
performances for each cross validation model. The distribution of each stock price dataset is illustrated in Figure 6.21- Figure 6.25.
65 60
train+valid.
60
50
testing
55
0
100
200 Days
300
40
400
Fig. 6.21 TD Canada Trust – 400 days historical price (23 Nov 2005-8 June 2007). First 300 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model.
50 100 150 200 250 300 350 400 450 Days
45 Enbridge Stock
70 65 60 55
0
Fig. 6.22 Loblaws – 444 days historical price (Nov. 14 2005-Aug.21, 2007). First 344 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model.
75 Bank of Montreal Stock
train+valid. testing
train+valid. testing 0
100
200 Days
300
40
35
30
400
Fig. 6.23 Bank of Montreal – 400 days historical price (Nov 11 2005-Aug 20 2007). First 200 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model.
train+valid. testing
0
50
100 150 200 250 300 350 400 450 Days
Fig. 6.24 Enbridge – 400 days historical price (Nov 14 2005-Aug 21 2007). First 200 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model.
Sun Life Stock
55
50
45 train+valid. testing 40
0
50 100 150 200 250 300 350 400 450 Days
Fig. 6.25 Sun Life – 400 days historical price Nov. 11, 2005-Aug 20, 2007). First 345 days are used for training and validation to optimize the model parameters, and the last 100 days are used for testing the performance of the optimum model.
266
6 Experiments
Stock prices collected for around 20-22 months are divided into two parts. Data from approximately the first 15-17 months are used to train models and to optimize model parameters. The last 5 months are held-out for testing model performances. We randomly separated 200 samples for training from the first part, 140 samples for validation of the optimum model parameters again from the first part and 100 samples to test the performance of the models from the held-out part, which has not been used for training or validation purposes. Experiments were repeated with 5 random subsets of the above sizes. Model performances using MAPE and proposed RSTB, to be presented below, are measured for the hold-out dataset of the last five months and averaged over five repetitions. The information on periods and the number of training, validation and testing samples of each of the five different stock datasets are given in Table 6.39. Table 6.39 Descriptions of Stock Datasets that are used in the experiments Properties
TD
BMO
SUNLIFE
ENBRIDGE
# of data points
389
445
445
445
445
23 Nov 20058 June 2007
Nov. 11 2005Aug. 20, 2007
Nov. 11 2005Aug. 20, 2007
Nov. 14 2005Aug. 21, 2007
Nov. 14 2005Aug. 21, 2007 200
Periods
LOBLAWS
Training Size
120
200
200
200
Validation Size
90
144
144
144
145
Testing Size
100
100
100
100
100
6.4.4.2 Evaluation Criteria for Stock Price Prediction Models In this work, on top of well-known error measures, a new performance criterion, Robust Simulated Trading Benchmark (RSTB), is introduced to measure the profitability of the predicted models on the testing data. In stock price models, the profit measure is usually compared to the well-known profit measure of the Buyand-Hold strategy [Hellstrom and Holmstrom, 1998]. Therefore, in this work, we used Buy-and-Hold model as an additional benchmark for stock price estimation problems. Buy-and-Hold Strategy A passive investment strategy, in which an investor buys stocks and holds them for a period, regardless of the market fluctuations. An investor who employs a buy-and-hold strategy actively selects stocks, but once in a position, is not concerned with short-term price movements and technical indicators. Let the period under study be represented with t to t+n. The optimum model should make more profit compared to Buy-and-Hold strategy at the end of day t+n, for those stocks having an upward trend. In reality, sometimes the market trend can be downwards, which means that the stock price has decreased from day t to day t+n. In this case, the optimum model should be profitable or at least loose less money compared to the Buy-and-Hold strategy. Therefore, in benchmark analysis, we compare the RSTB of each model with the Buy-and-Hold return.
6.4 Analysis of Experiments – Regression Domains
267
Let yt represent the closing price at time t and yt represent closing price at the end of the period (t+n). Buy-and-Hold return, rBH, for a time period {t … t+n} of a stock is defined as:
rBH =
( yt + n − yt )
(6.4)
yt
The rBH indicates the profit/loss when buying at the start and selling at the end of the period. Simulated Trading Benchmark (STB) The main principle of measuring the success of a prediction algorithm is its ability to produce profit when applied to real trading. The most commonly used measures are the mean profit measure [Hellstrom and Holmstrom, 1998]. The mean profit for a trading rule-oriented prediction algorithm is computed by just looping over the time interval in question and applying the trading rule to the computed prediction function, e.g., proposed fuzzy systems, support vector regression, etc. The mean profit per trade for time-series prediction models is computed by:
Mean Profit =
1 ∑ sign ( yˆ (t ) − y(t − 1) )( y(t ) − y(t − 1) ) n t =1
(6.5)
In (6.5), a trade is assumed at each step, in the direction of the predicted change. The mean profit trade of [Hellstrom and Holmstrom, 1998] is not very realistic, because based on this formula, it is assumed that the trader has an unlimited amount of money. For instance, at day 1 if the predicted output is greater than the previous day’s closing price, then the mean profit suggests buying the stock. If the model predicts the day 2 closing prices as higher than day1’s actual closing price, then the model still suggests buying. The algorithm does not verify whether the investor has any money left to invest. In reality the investor usually has a limited amount of money to invest. Thus, we propose a benchmark measure by modifying the mean profit per trade to simulate actual investor behavior. The new profit measure STB is a simulated transformation of the mean profit per trade benchmark that is presented in [Hellstrom and Holmstrom, 1998]. The new benchmark, STB, is calculated as follows: In order to investigate the relationship between accuracy and profit of a system, a simple strategy is studied first. This strategy simply suggests that traders buy whenever the predicted price climbs above the previous day’s closing price, and sell when it drops below. The buying transaction occurs only if the investor has some cash and selling transaction occurs if the investor has stocks. The ‘decision’ to sell, buy or hold (do nothing) for calculation of the profit is made according to the closing date price and the model’s estimated stock price. Thus, the decision is made at any t day according to the following rules; If If If If
Actualt-1 < Predicted-Closing Price t Actualt-1 > Predicted-Closing Price t Actualt-1 < Predicted-Closing Price t Actualt-1 > Predicted-Closing Price t
AND Casht-1 > 0 Æ Buy AND Casht-1 > 0 Æ Hold (Do Nothing) AND Casht-1 = 0 Æ Hold (Do Nothing) AND Casht-1 = 0 Æ Sell
268
6 Experiments
In this strategy, the following assumptions are made: ¾ ¾
Each trader starts with $100 cash. The decision to sell, buy or do nothing (hold) is based on the predicted decision rule base structure above. ¾ The trader either has cash or stock. This means that when the STB suggests selling, the trader/investor sells all the stocks in his/her portfolio. If the STB suggests buying, then the trader buys stocks using all the money in his/her portfolio. Based on this the following assumption is made: ¾ Estimated profit is calculated by multiplying the number of stocks (#STK) with its closing value when there are stock in the portfolio or it is equal to the $cash in hand at the end of the day.
An example buy-sell-hold strategy is shown in Table 6.40. The profit at the end of the period is calculated as follows: Number of Stocks in portfolio at time t : #STKt The $amount in portfolio at time t: Cash$t Calculated Profit at t+n= [(#STK t+n * Closing price t+n) + Cash$t+n] - Cash$t Table 6.40 STB Values obtained from two different Hypothetical Models of an Artificial Stock Data. Both hypothetical models start trading with $100
Closing Price 100.1 100 102 105 104 103.5 102.75 101.9 102 101.2 104.5
Predicted Closing Price of Hypothetical Model 1
Predicted Closing Price of Hypothetical Model 2
100.7 102.5 104.1 104.5 103.75 102.89 102.57 102.09 102.5 103.5
102.03 105.1 110.22 105.87 105.01 104.75 99.45 97.54 100.2 100.5
Hypothetical Model 1
Decision
#STK*
CASH $
Buy 0.99 0.0 BuyÆHold 0.99 0.0 BuyÆHold 0.99 0.0 Sell 0.00 104.9 SellÆHold 0.00 104.9 SellÆHold 0.00 104.9 SellÆHold 0.00 104.9 Buy 1.03 0.0 BuyÆHold 1.03 0.0 BuyÆHold 1.03 0.0 Calculated Profit = [(1.03*104.5) + 0] -100= $7.53
Hypothetical Model 2
Decision
#STK
CASH $**
Buy 0.99 0.00 BuyÆHold 0.99 0.00 BuyÆHold 0.99 0.00 BuyÆHold 0.99 0.00 BuyÆHold 0.99 0.00 BuyÆHold 0.99 0.00 Sell 0.00 102.64 SellÆHold 0.00 102.64 SellÆHold 0.00 102.64 SellÆHold 0.00 102.64 Calculated Profit= [(0*104.5) + 102.64] -100 = $2.64
* #STK : The number of stocks in hand at the end of the day. ** CASH $ : The amount of money in $ the trader hold at the end of the day.
The decision to buy, sell or hold at the end of the day depends on the model’s calculated price. If the model estimates a higher value than the closing price of the previous day, and if there are some cash in the portfolio, then the model suggests buying. If there is no cash left in portfolio, then nothing happens. If the prediction is less than the previous day’s closing price (the model predicts that the stock price will
6.4 Analysis of Experiments – Regression Domains
269
decrease) and if there is stock left in the portfolio, then the model suggests selling, otherwise nothing happens. For instance in Table 6.40, the ‘Decision’ columns, ‘BuyÆHold’ means that, the algorithm estimates a higher stock price than the previous day’s stock price. Although the algorithm suggests buying, the decision is kept as ‘Hold’ only because there is no cash in the portfolio to buy additional stocks. The STB measure calculates the profit based on decisions that are more realistic. By simply calculating the STB based on the direction of the prediction may not be the optimum performance strategy due to the following reasons. If one analyses the error of different algorithms to estimate the value of the same stock, one may realize that not all the functions can estimate the stock price as close to the actual stock price. One should measure the error of each model to evaluate their performance. The error, ê(t), is measured by simply taking the absolute difference between the estimated stock price, ŷ(t) and the actual closing price of the previous day, y(t-1), eˆ(t ) = y (t − 1) − yˆ (t ) . The average daily price change of each stock is calculated by ∇y(t)= n1
∑t
t +n
y (t − 1) − y (t ) . We refer to ∇y(t) the confidence
level for deciding the fitness of the prediction. If the measured error, ê(t), is less than ∇y(t), confidence level, more than 50% of the estimation period, ê(t)≤∇y, then this model is considered robust for this stock. If the error is above ∇y(t), 50% or more of the times, we might as well use random guessing (50% probability). If this happens, the new benchmark totally ignores the model. The main reason of measuring model accuracy is that with this benchmark, we are trying to capture if the prediction methodology is capable of effectively predicting the actual stock price in the first place. For this reason, a new Robust Simulated Trading Benchmark (RSTB) is presented to evaluate stock prices. The new RSTB can be explained using the same example presented in Table 6.40. The errors in each model can be seen in from Table 6.41 and Figure 6.26. One can observe from the RSTB calculations in Table 6.41 that some of the decisions are different from Table 6.40. When ê(t) of a model at day t is greater than the confidence level (greater than the ∇y(t)), then this model’s prediction is not considered an accurate estimation and should be ignored in the calculations for the particular t day. In such cases, the RSTB calculation procedure ignores this model’s prediction and suggests to do nothing (hold). It is a smart decision to ignore an estimation, which is not accurate ⁄ fitted in the first place; even it predicts the direction of the stock movement correctly. The confidence level, ∇y(t), of the stock prices of this work is measured from the learning and optimization datasets and it is found that ∇y(t)=y(t-1)*1%. For instance, the errors of the first day predictions in Table 6.41 (row 1, t=1) indicates that Hypothetical Model-1’s prediction is within the confidence level; ê(t)
≤ ∇y(t) =y(t-1)*1%.
|100.10-100.70|=0.6 ≤ (100.10*1%)=1.01 Whereas Hypothetical Model-2’s first day error is greater than 1% of the actual stock price, |100.10-102.03|=1.93 > (100.10*1%)=1.01
270
6 Experiments
Table 6.41 RSTB Prediction errors of two Hypothetical Models* of Artificial Stock Dataset. Both hypothetical models start trading with $100. Hypothetical Model 2 Predictions
Hypothetical Model 1 Predictions t
Y
Y-hat Model1
Error
Previous Decision
New Decision
Y-hat Model2
Error
Previous Decision
New Decision
100.1 1
100
100.7
0.60
Buy
Buy
102.03
1.93
Buy
2
102
102.5
2.50*
BuyÆ Hold
Do Nothing ÆHold
105.1
5.10
Hold
Do Nothing ÆHold Do Nothing ÆHold
3
105
104.1
2.10
BuyÆ Hold
Do Nothing ÆHold
110.22
8.22
Hold
Do Nothing ÆHold
4
104
104.5
0.50
Sell
Sell
105.87
0.87
Hold
5
103.5
103.75
0.25
SellÆ Hold
SellÆHold
105.01
1.01
Hold
Buy Do Nothing ÆHold
6
102.75
102.89
0.61
SellÆ Hold
SellÆHold
104.75
1.25
Hold
Do Nothing ÆHold
7
101.9
102.57
0.18
SellÆ Hold
SellÆHold
99.45
3.30
Sell
Do Nothing ÆHold
8
102
102.09
0.19
Buy
Buy
97.54
4.36
Hold
Do Nothing ÆHold
9
101.2
102.5
0.50
BuyÆ Hold
BuyÆHold
100.2
1.80
Hold
Do Nothing ÆHold
10
104.5
103.5
2.30
BuyÆ Hold
Do Nothing ÆHold
100.5
0.70
Hold
Sell
Note: These hypothetical models are the same models as shown in Table 6.40, but this time RSTB is used to make decisions. * The bold framed cells indicate the prediction errors of the corresponding model which are greater than the confidence level.
Therefore, the robust STB, RSTB, ignores the prediction of Hypothetical Model 2 for day 1 and chooses to do nothing instead of buying the stock. Again this is due to the faulty estimation of day t’s closing price. This example is depicted in Figure 6.26. 115 Actual Model 1 Model 2
Stock Price
110
105
100
95
Actual Stock Price Increases 0
1
2
3
4
5 6 Trading days
7
8
9
10
Fig. 6.26 Actual and Predicted Stock Prices of an artificial Dataset of Stock Prices for a 10 days period. Predicted stocks from Hypothetical Model 1, Hypothetical Model2 and Actual Prices are shown. Model 2 is not a good model of the actual stock prices.
6.4 Analysis of Experiments – Regression Domains
271
The RSTB decision is made at any t day according to the following rules; If If If If If If
Actualt-1 < Predicted-Closing Price t Actualt-1 < Predicted-Closing Price t Actualt-1 > Predicted-Closing Price t Actualt-1 < Predicted-Closing Price t Actualt-1 > Predicted-Closing Price t Actualt-1 > Predicted-Closing Price t
AND Casht-1 > 0 AND ê(t) ≤ (1%*Actualt-1) ÆBuy AND Casht-1 > 0 AND ê(t) > (1%*Actualt-1) Æ Hold AND Casht-1 > 0 ÆHold AND Casht-1 = 0 ÆHold AND Casht-1 = 0 AND ê(t) ≤ (1%*Actualt-1) Æ Sell AND Casht-1 = 0 AND ê(t) > (1%*Actualt-1) Æ Hold
It should be noted from Table 6.41 that, within the given 10 day period, Hypothetical Model-2 can only estimate the stock price within the confidence interval in days 4 and 10 (see Error(%) column of Hypothetical Model 2 in Table 6.41). For this particular example, this simply means that Model-2 is correct 20% of the estimation period. Therefore, we introduce one last rule for RSTB as follows: 1 n ⎧ sign ( eˆt − ∇y ) = 1, 1 ⎫ ∑⎨ ⎬ > 0.5, then model is ignored. n t =1 ⎩ otherwise, 0⎭ e.g. ∇y = ( y (t − 1)*1% ) if
Since the percentage of days of the given period that Hypothetical Model-2 can estimate the stock prices accurately is 20%, which is less than 50%, RSTB ignores this model totally. This model is considered as unreliable due to the above conditions. The profits based on RSTB performance measure are shown in Table 6.42. Even though at the end of 10 days, Hypothetical Model-2 estimates a loss of $100$96.14=$3.86 compared to the 100$ cash in the portfolio on day 1, the RSTB ignores this unreliable model. On the other hand, the estimation of Hypothetical Model 1 is within the acceptable error range, i.e., 1% confidence limit, 70% of the times (its error is above the acceptable limits only at days 2, 3, and 10, see Table 6.41). The RSTB for Hypothetical Model-1 is measured as $107.57, which indicates profit at the end of day 10. Table 6.42 Predictions obtained from two Hypothetical Models of Artificial Stock Data and the RSTB Values. Both hypothetical models start trading with $100.
Closing Price 100.10 100.00 102.00 105.00 104.00 103.50 102.75 101.90 102.00 101.20 104.50
Hypothetical Model 1 Predictions
Hypothetical Model 2 Predictions
100.70 102.50 104.10 104.50 103.75 102.89 102.57 102.09 102.50 103.50
102.03 105.10 110.22 105.87 105.01 104.75 99.45 97.54 100.20 100.50
Hypothetical Model 1 New CASH Decision #STK $ 0.00 100.00 0.99 0.00 Buy 0.99 0.00 Hold 0.99 0.00 Hold 0.00 103.95 Sell 0.00 104.90 Hold 0.00 104.90 Hold 0.00 104.90 Hold 1.03 0.00 Buy 1.03 0.00 Hold 1.03 0.00 Hold RSTB = (1.03*104.5) + 0 = $107.57 Calculated Profit = $7.57
Hypothetical Model 2 New CASH Decision #STK $ 0.00 100.00 Hold 0.00 100.00 Hold 0.00 100.00 Hold 0.00 100.00 Buy 0.95 0.00 Hold 0.95 0.00 Hold 0.95 0.00 Hold 0.95 0.00 Hold 0.95 0.00 Hold 0.95 0.00 Sell 0.00 96.14 RSTB= (0*104.5) + 96.14 = $96.14 NOT REALIBLE Calculated Profit = -$3.86
272
6 Experiments
6.4.4.3 Results on Stock Price Datasets The three-way cross validation method is used to find the optimum models of each algorithm. In searching for the optimum model, each algorithm measures the performance of the models based on the error measure. In our analysis, we used MAPE to measure the performance during structure identification. The testing data is used to measure the performance of the optimum model. We used MAPE and proposed Robust Simulated Trend Benchmark measure (RSTB) to test the optimum model performance. It should be pointed out that the RSTB measure can only be used for timeseries data, where continuity of the data is required. In our experimental analysis, the training and validation datasets are random samples, where there is no continuity in them. The learning algorithms are based on the error at each observation, and do not require continuity, so we use these random samples for training and validation purposes. In these stock analyses, the testing datasets represent a continuum of the selected period. Therefore, one can only measure the proposed RSTB on testing data to find out the overall profit of the optimum model. The five benchmark datasets, TD Canada Trust (TD), BANK OF MONTREAL (BMO), SUNLIFE (SLF), ENBRIDGE (ENB), and LOBLAWS (LB) are used to apply the benchmark methodologies and the proposed fuzzy functions algorithms to evaluate their performance. Each dataset is randomly divided into three parts as described in Table 6.39. For each different algorithm, we used the same partitions Table 6.43 Average performance measures of models of five Real Stock Prices based on MAPE of Testing datasets. The standard deviations over five different testing dataset is shown in parenthesis. TD
BMO
SLF
ENB
LB
NN
MAPE (Stdev)
1.03 (0.49)
0.92 (0.03)
0.9 (0.05)
1.21 (0.33)
1.21 (0.35)
ANFIS
1.82 (1.69)
2.51 (0.85)
3.59 (0.77)
2.24 (0.64)
3.86 (1.62)
DENFIS
1.42 (0.29)
0.94 (0.15)
0.95 (0.07)
1.19 (0.05)
1.23 (0.27)
SVM
0.24 (0.08)
0.86 (0.01)
0.82 (0.01)
1.07 (0.02)
0.84 (0.03)
GFS
1.30 (0.62)
1.75 (0.52)
2.09 (0.48)
1.41 (0.09)
1.62 (0.21)
DIT2FRB
0.45 (0.26)
0.87 (0.03)
0.86 (0.06)
1.21 (0.08)
1.09 (0.23)
T1FF
0.19 (0.02)
0.88 (0.04)
0.83 (0.03)
1.13 (0.04)
0.88 (0.08)
T1IFF
0.2 (0.03)
0.88 (0.04)
0.81 (0.01)
1.12 (0.06)
0.89 (0.06)
DIT2FF
0.23 (0.04)
0.93 (0.04)
0.86 (0.02)
1.11 (0.04)
1 (0.05)
DIT2IFF
0.24 (0.03)
0.85 (0.02)
0.87 (0.05)
1.05 (0.03)
0.89 (0.03)
ET1FF
0.2 (0.02)
0.88 (0.04)
0.82 (0.01)
1.1 (0.05)
0.94 (0.05)
ET1IFF
0.21 (0.04)
0.88 (0.02)
0.87 (0.12)
1.07 (0.03)
0.89 (0.07)
EDIT2FF
0.21 (0.02)
0.86 (0.02)
0.83 (0.03)
1.05 (0.03)
1 (0.37)
EDIT2IFF
0.22 (0.03)
0.87 (0.02)
0.82 (0.02)
1.09 (0.03)
0.9 (0.07)
• The MAPE values in bold indicate best method that yields the optimum model for the dataset in the corresponding column.
6.4 Analysis of Experiments – Regression Domains
273
for learning and testing purposes. For each stock data, five different optimum models are obtained based on the sub-sampling cross validation method explained at the beginning of this chapter. The MAPE and RSTB values are used to measure model performance by averaging error rates and profit measures for testing samples across five repetitions. The average of daily change, ∇y(t)= 1n
∑t
t+n
y (t − 1) − y (t ) , of each of these five stocks are around 1% of y(t-
1), which means the stocks do not have high daily jumps and falls, therefore, instead of the average daily change we used ∇y(t) = y(t-1)*0.01 in the analysis. RSTB measures are based on $100 investment at the beginning of each testing dataset. It is assumed that, the investor has $100 at day 1 of the testing cases. Based on the next day prediction of each optimum model, similar to Table 6.42, the buy, hold, or sell decision is made. Each optimum model’s profit/loss is calculated using the RSTB performance measure. The MAPE and RSTB values obtained from each algorithm for each of the five stock datasets is displayed in Table 6.43 and Table 6.44. The cross validation MAPE for training, validation and testing as well as RSTB values for testing are displayed in the Appendix Table D.25 to D.45. It should be Table 6.44 Average performance measures of models of five Real Stock Prices based on RSTB of Testing datasets. The standard deviations over five different testing dataset is shown in parenthesis. TD
BMO
SLF
ENB
LB
102.28
92.91
95.56
92.37
100.43
NN
100.33 (0.85)
88.17 (3.59)
88.62 (2.09)
88.52 (2.39)
N/A
ANFIS
106.3 (0.60)
N/A*
N/A*
N/A*
92.28 (2.78) 92.09 (4.03)
RSTB (Stdev) Buy_and_Hold
DENFIS
N/A (N/A)
86.82 (1.76)
87.18 (3.42)
87.34 (4.86)
GFS
100.40 (3.38)
88.22 (N/A)**
N/A*
N/A*
N/A*
SVM
109.54 (2.99)
85.26 (1.76)
90.29 (1.52)
90.06 (3.04)
92.06 (1.97)
DIT2FRB
102.04 (1.20)
86.26 (2.43)
90.28 (3.88)
97.21 (N/A)**
96.29 (2.16)
T1FF
111.33 (1.88)
86.04 (3.17)
89.85 (2.26)
85.82 (2.79)
90.77 (4.98)
T1IFF
110.61 (2.71)
85.72 (3.09)
88.57 (1.47)
84.3 (2.59)
93.8 (4.31)
DIT2FF
113.12 (2.15)
94.97 (2.17)
95.95 (2)
86.21 (6.44)
94.53 (2.08) 93.91 (4.57)
DIT2IFF
109.62 (4.53)
85.99 (1.59)
87.26 (3.9)
88.47 (4.52)
ET1FF
112.72 (2.00)
86.38 (2.66)
96.79 (1.7)
83.32 (4.35)
92.34 (3.47)
ET1IFF
111.18 (4.45)
88.07 (2.87)
87.73 (4)
89.53 (2.74)
92.84 (3.79)
EDIT2FF
111.93 (2.13)
85.69 (1.17)
88.76 (2.08)
87.33 (4.44)
91.58 (1.94)
EDIT2IFF
115.44 (0.69)
88.75 (1.04)
91.75 (1.67)
96.11 (2.33)
102.35 (1.19)
* All five of the cross validation models of these methods predicted $0 profit; therefore, these models are excluded. ** Only one of the cross validation models predicted profit other than zero. Therefore, standard deviation could not be calculated for this method. This method is also excluded due to weak estimation power. • The values in bold indicate optimum method of the stock price model of the dataset in the corresponding column.
274
6 Experiments
pointed out from RSTB cross validation results in Appendix Table D.25, D.30, D.34, D.38, D.42, that some cross validation models have $0 profits (non-reliable models), which is due to RSTB calculations explained above. Hence, while calculating the average cross validation RSTB results of a method, we excluded cross validation models with $0 profit and calculated the averages using the rest of the cross validation model RSTB values, unless the method predicts $0 profitable in all the cross validation experiments. Then, the result of such method is totally excluded from the experiments and replaced with N/A as shown in RSTB in Table 6.44. It should be noted from the MAPE values in Table 6.43 that, most of the results are less than 1%. It is hard to identify the best methodology from this table based on MAPE values. This may indicate that MAPE or any other error measure based on the deviation of the estimated output from the actual output may not be a dependable performance measure most of the times [Hellstrom and Holmstrom, 1998]. On the other hand, the proposed RSTB, which is based on three different properties including the directions, accuracy and robustness of the predicted outcome, yields comparable results between each methodology. It is evident from Table 6.45 that the proposed EDIT2IFF methodology outperforms the rest of the models in 3 out of 5 stock price datasets based on profitability measure. One of the noticeable results of the Fuzzy Functions methods on the stock price datasets is the type of the fuzzy functions used, as shown in the second row of Table 6.45, “Best Proposed Method - Fuzzy Function type used” row. Each Fuzzy Function Approach utilizes linear regression, e.g., LSE, method. SVM regression methods did not reveal good RSTB performances. The evolutionary extensions of fuzzy functions approaches are also identified by LSE method as the optimum function approximation method in constructing the fuzzy functions. Table 6.45 Summary of performances of models on five different real stocks based on profitability values of best models of proposed, benchmark and ‘buy&hold’ methods TD
BMO
SLF
ENB
LB
Best Proposed Method - Fuzzy Function type used Best Benchmark Model
EDIT2IFF -LSESVM
DIT2FF -LSENN
ET1FF -LSESVM
EDIT2IFF –LSENN
EDIT2IFF -LSEDIT2FRB
RSTB of Best Proposed Models
$115.44
$94.97
$96.79
$96.11
$102.35
RSTB of Best Benchmark Models
$109.54
$88.17
$90.29
$88.52
$96.29
Buy&Hold Profit Profit Improvement of Proposed Method against Benchmark Profit improvement of Proposed Method against Buy&Hold
$102.28
$92.91
$95.56
$92.37
$100.43
$5.90
$6.80
$6.50
$7.59
$6.06
$13.16
$2.06
$1.23
$3.74
$1.92
• The profitability is in terms of Canadian $ based on $100 investment.
The optimum models of the stock price datasets are identified with three different proposed methodologies, i.e., EDIT2IFF, DIT2FF and ET1FF. The common characteristic of these models are that all of them uses linear functions to identify the parameters of interim and local fuzzy functions. Each of them has different structures which are displayed in Appendix D.5. Any three of these methodologies, would result in better estimations than the standard benchmark methodologies used in this work.
6.4 Analysis of Experiments – Regression Domains
275
In this section, we investigate the impact of the broker commission, i.e., brokerage, on stock trading profit. Broker commission is a fee charged by a broker or agent for his/her service in facilitating a transaction, such as buying or selling of stocks. It is used to be that the broker commissions were very expensive that only few could afford a broker and have access to the stock market. With the internet technology, the explosion of discount brokers that let one trade at a smaller fee, but do not provide personalized advice. Because of discount brokers, nearly anybody can afford to invest in the market now. Today many discount brokers offer flat rate charge per transaction. For instance, E*TRADE Canada (www.etrade.ca) has a plan that allows one to trade unlimited shares for a flat fee of $6.99/transaction. Here we will investigate how a flat rate brokerage would affect the profit. From the results of the RSTB analysis on the five real stock price data of this work, we observe that in average there are approximately 25 transactions, i.e., buy or sell activity, for a trading period of 100 days. We used $6.99 flat rate to calculate the cost and analyze the impact of trading cost on profit. In order to show the impact of broker commissions, we investigated the details of TD Canada Trust’s stock RSTB results. In Table 6.46, the broker commission investigation on results from different models on TD Bank stock dataset is shown. Table 6.46 The effects of broker commission on the average profit obtained from benchmark models of TD Bank stock prices. Different investment amounts are shown. Average Profit if invested $25,000 Average Transaction cost ($)
Average Profit if invested $50,000
Average Profit if invested $50,000 Average Transaction cost ($)
Average Profit if invested $100,000
Average Profit if invested $100,000 Average Transaction cost ($)
Average Profit if invested $500,000
Average Profit if invested $500,000 Average Transaction cost ($)
$500,000
Average Profit if invested $25,000
$100,000
Average Transaction cost ($) = $6.99*(Average # of Transactions)
$50,000
Average # of Transactions over five cross validation experiments
$25,000
Average Profit(%)
INVESTMENT AMOUNT Î
ANFIS
6.26
12.20
85.4
1,566
1,480
3,132
3,046
6,263
6,178
31,317
31,231
SVM
9.54
8.50
59.5
2,386
2,327
4,772
4,713
9,544
9,485
47,720
47,661
DIT2FRB
2.38
9.30
65.1
595
529
1,189
1,124
2,378
2,313
11,890
11,825
T1FF
11.3
14.10
98.7
2,832
2,733
5,664
5,565
11,328
11,229
56,640
56,541
T1IFF
10.6
13.50
94.5
2,652
2,557
5,303
5,209
10,606
10,512
53,030
52,936
DIT2FF
13.1
16.70
116.9
3,281
3,164
6,561
6,444
13,122
13,005
65,610
65,493
DIT2IFF
9.62
24.00
168.0
2,405
2,237
4,810
4,642
9,620
9,452
48,100
47,932
ET1FF
12.7
25.00
175.0
3,180
3,005
6,359
6,184
12,718
12,543
63,590
63,415
ET1IFF
11.2
19.10
133.7
2,796
2,662
5,592
5,458
11,184
11,050
55,920
55,786
EDIT2FF
11.9
15.40
107.8
2,982
2,874
5,963
5,855
11,926
11,818
59,630
59,522
EDIT2IFF
15.4
16.80
117.6
3,860
3,742
7,719
7,601
15,438
15,320
77,190
77,072
* Generally, each prediction method (rows) returns profit as shown in the second column.
276
6 Experiments
The profits for $25K, $50K, $100K and $500K investments on TD Canada Trust stock are shown for when trading costs are excluded from and included to the profit. For the EDIT2IFF (best yielding model) the ratio of the profit when trading costs are included to the profit when trading cost are excluded is 0.96 (3742/3860) for $25K investment, while it is 0.98, 0.99, and 0.998 for $50K, $100K, and $500K investments, respectively. It can be concluded that as the amount of money invested increases the impact of broker commissions on the profit decreases, as shown in Figure 6.27. Based on these analyses if investor invests sizeable money then trading costs will be negligible. Therefore, we ignore trading costs in RSTB stock performance analysis. ANFIS
The ratio of profit with broker commissions to profit without broker commissions
1
#REF! SVM 0.9
DIT2FRB T1FF T1IFF
0.8
DIT2FF DIT2IFF 0.7
ET1FF ET1IFF EDIT2FF
0.6
EDIT2IFF 0.5 0
100000 200000 300000 400000 500000 600000 Investm ent ($)
Fig. 6.27 The effect of broker commission on profit for different values of investments
6.4.5 Proposed Fuzzy Cluster Validity Index Analysis for Regression Improved fuzzy functions approaches implement proposed Improved Fuzzy Clustering (IFC) algorithm. One of the parameters of the IFC method that should be identified prior to model execution is the number of clusters. Based on the number of clusters, the structure identification procedure determines different fuzzy functions for each cluster. It is crucial to identify the correct number of clusters, because each cluster indicates a different local model that might be hidden in the dataset. When the number of clusters are set below the actual number of cluster, then some of the hidden models would be missed, which may reduce the prediction performance. The performance of the models is also affected when the numbers of clusters are set to values above the correct value. In such cases, there would be more local models than there should be, which may
6.4 Analysis of Experiments – Regression Domains
277
cause over-fitting, eventually the models may not generalize well, i.e., the models on testing datasets may not reveal good performances. In chapter 3, a new cluster validity measure is proposed. It is proven that this measure may identify the range of number of clusters by narrowing the available number of clusters. Hence, in the experiments, we used an exhaustive search to identify the optimum number of clusters of improved fuzzy functions methods. Since we now know the optimum number of clusters of the regression datasets when Improved Fuzzy Functions are applied, we can validate the performance of the proposed fuzzy cluster validity measure – cviIFC- based on these results. Here the validity of the proposed cluster validity index will be justified using two of the experimental results above. cviIFC Analysis for Friedman’s Artificial Dataset To identify the range of the optimum number of clusters of Improved Fuzzy Functions approaches, the proposed validity measure, cviIFC, is initially iterated for two different values of m, i.e., m=1.3, to represent more crisp models, and m=2.0, to represent more fuzzy models, for changing the values of the number of clusters, c=2,..,13. We obtained cviIFC values and averaged over 5 repetitions. Figure 6.28 depicts cviIFC values obtained from each experiment as a function of the number of clusters for two values of m. The cvi-graph indicates that after c>7, there is not too much variation on the cviIFC values. For m=2.0, there is visible decrease in the cviIFC value around when c=11. For m=1.3, after when c is greater than 8 or 9 clusters, the change in cviIFC is very small. So we can conclude based on the cviIFC graph in Figure 6.28 that the optimum c* for Friedman’s Artificial Dataset experiments should be between 7 and 11. 0.8 m=1.3 m=2.0
cviIFC
0.6 0.4 0.2 0
0
2
4
6 8 number of clusters
10
12
14
Fig. 6.28 Proposed cviIFC cluster validity values of proposed IFC results for Friedman’s Artificial Dataset for two different fuzziness values; m={1.3,2.0}
From the results of the exhaustive search, the optimum number of clusters of Improved Fuzzy Functions models of the Friedman Artificial Dataset are obtained to be c*={6,7,8}. Thus, the cviIFC correctly identifies the range of the possible number of optimum clusters.
278
6 Experiments
cviIFC Analysis for Auto-mileage Dataset To identify the range of the optimum number of clusters of Improved Fuzzy Functions approaches, the proposed validity measure, cviIFC, is also executed for two different values of m, i.e., m=1.3, to represent more crisp models, and m=2.0, to represent more fuzzy models, for a changing number of clusters, c=2,..,15. We obtained cviIFC values and averaged over five repetitions. Figure 6.29 depicts cviIFC values obtained from each experiment as a function of the number of clusters for two values of m. For the auto-mileage consumption dataset, the cvi-graph indicates that the optimum c* should be around {3 to 8}. At this point it is clear that there is not too much variation in the value of the cviIFC when c>8.
m=1.3 m=2
1
cviIFC
0.8 0.6 0.4 0.2 0
2
3
4
5
6
7
8 9 10 11 number of clusters
12
13
14
15
16
Fig. 6.29 Automobile Mileage Dataset cviIFC graphs for two different fuzziness values; m={1.3,2.0}
Using the exhaustive search we identified that the optimum number of clusters of Improved Fuzzy Functions models of the Auto-Mileage Dataset can be any of c*={3,5,6,7,8}. Thus, the cviIFC correctly identifies the range of possible number of optimum clusters.
6.5 Analysis of Experiments - Classification (Pattern Recognition) Domains The goal of classification learning algorithms is to build a classifier from a set of training examples such that the classifier can predict the unseen testing examples. In order to evaluate the performance of introduced fuzzy functions for classification strategies, five most frequently employed machine-learning classification datasets from UCI Repository [Newman, et al., 1998]: pima-diabetes, ionosphere, liver-disorders, breast cancer and credit scoring, are used. In addition, a larger real dataset, California Housing is also employed to analyze the performance of the proposed classification strategies, which was taken from Stats library [Meyer, Vlachos].
6.5 Analysis of Experiments - Classification (Pattern Recognition) Domains
279
6.5.1 Classification Datasets from UCI Repository To model classification datasets, the proposed fuzzy functions for classification methods are implemented. These methods use classifier functions to identify fuzzy function parameters. For instance, we implement linear fuzzy classifier functions using logistic regression or non-linear fuzzy classifier functions such as the support vector machines for classification. During structure identification, the proposed evolutionary algorithms identify the optimum classification method for the given dataset. In this work, classification datasets with binary dependent variables are used. The independent variables should be scalar values for support vector machines for classification. A limited number of datasets from the UCI repository matches these criteria. We have selected those datasets that are suitable for the algorithms we used. A short description of the datasets used in the experiments is given in Table 6.47. Table 6.47 Description of Classification Datasets from UCI Repository that are used in the experiments
# of Patterns # of Attributes # of Classes # of Pattern in Class 1* # of Pattern in Class 2 Training Sample Size Validation Sample Size Testing Sample Size
PIMA DIABETES 768 8 2 268 500 125 75 50
IONOSPHERE 349 34 2 223 126 150 120 79
LIVER DISORDERS 345 6 2 145 200 125 75 50
CREDIT SCORING 690 15 2 307 383 150 75 50
BREAST CANCER 277 9 2 196 81 127 70 80
* The original class ratios are preserved when constructing training, validation and testing datasets.
Ionosphere: This dataset will be referred as ion in this work. Each instance in this dataset represents a radar return from the ionosphere; part of the outer atmosphere, which contains large amounts of ions and free electrons (surrounds the earth and some planets). The targets were free electrons in the ionosphere. “Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere. There are 34 variables; most of the variables have no discriminating power on the output variables. Simple single variable logistic regression is applied to select the variables with the highest accuracy. Only six variables are selected based on the variable selection method. Liver Disorders: This dataset will be referred as liver in this work. This is a real dataset donated by BUPA Medical Research Ltd. Each of the six variables represents blood tests indicating a property of liver disorders that may increase from
280
6 Experiments
excessive alcohol consumption. Each record in this dataset corresponds to a male individual. The variables include: 1. mean corpuscular volume, 2. alkaline phosphatase, 3. aspartate aminotrasferase, 4. gamma-glutamyl transpeptidase, 5. alamine aminotransferase, 6. number of half-pint equivalents of alcoholic beverages drunk per day, and 7. output variable that indicates liver disease. Credit Scoring: This dataset, which represents credit card applications, will be referred as credit in this work. Each instance in the dataset indicates one individual; the output variable indicates the granted or rejected credit. The attributes are protected for security reasons. Breast Cancer: This dataset will be referred as cancer in this work. This is one of three domains provided by the Oncology Institute. The instances are described by nine attributes, some of which are real and some are nominal. We treated each variable as continuous valued attributes. These variables are: 1. 2. 3. 4.
Output: no-recurrence-events, recurrence-events age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. menopause: lt40, ge40, premeno. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34,35-39,40-44, 45-49, 50-54, 55-59. 5. invasive nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26,2729, 30-32, 33-35, 36-39. (Determines how much/fast the tumor has broken through tissue cells, bones, etc.) 6. node-capsule (in/out): yes, no. 7. degree of malignancy: 1, 2, 3. 8. breast: left, right. 9. breast-quad: left-up, left-low, right-up, right-low, central. 10. Irradiant: yes, no. Pima Diabetes: This dataset will be denoted with diabetes. The dataset, which represents kidney diseases with binary classes (tested positive or negative for diabetes) is donated by National Institute of Diabetes and Digestive. All eight attributes are numeric-valued. The diagnostic is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour postload plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA. The input variables are;
6.5 Analysis of Experiments - Classification (Pattern Recognition) Domains
281
1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)2) 7. Diabetes pedigree function 8. Age (years) 9. Output variable: Class variable (0 or 1)
6.5.2 Classification Dataset from StatLib This dataset is available from the Carnegie Mellon Univ. StatLib repository2. There are 20,640 observations in this dataset from the 1990 Census. The output variable is the median house value in each neighborhood measured in units of $100,000. The houses are divided into two classes based on their values 50K and above or less to determine their class labels. The predictor variables are demographics such as: 1. median income, 2. housing median age, 3. population, 4. households, 5. total rooms, 6. total bedrooms, 7. latitude, and 8. longitude There are eight predictors, all numeric. The dataset is randomly divided into: 8000 observations for structure identification and 12,640 observations for testing purposes.
6.5.3 Results from Classification Datasets Having presented the classification datasets, here we compare the performance of the proposed methods. For benchmark purposes, well-known classification methods are used including ANFIS, NN, SVM for classification (SVC), Logistic regression, Fuzzy K-nearest neighbors (FKNN). The results from the application of the benchmark and proposed methodologies on the UCI and StatLib classification datasets are analyzed based on Area Under the Receiver Operating Characteristics (AUC) curve and Accuracy. Both AUC and Accuracy values are between 0 and 1. On top of AUC and Accuracy measures, some performance criteria based on ranking methods are applied to the classification datasets to compare model performances including; Average Ranking method (AR), Success Rate Ratio Ranking method (SRR), Significance Win Ranking (SWR), and Power test Ranking (PR). 2
http://lib.stat.cmu.edu
282
6 Experiments
Table 6.48 Performance Measure in AUC values of proposed and benchmark classification methods on six different classification datasets Ion
Credit
Cancer
Liver
Diabetes
California
LR
0.782 (0.04)
0.865 (0.06)
0.622 (0.06)
0.67 (0.04)
0.722 (0.10)
0.848 (0.02)
ANFIS
0.893 (0.05)
0.764 (0.07)
0.621 (0.09)
0.677 (0.07)
0.639 (0.07)
0.746 (0.03)
NN
0.878 (0.04)
0.833 (0.08)
0.635 (0.06)
0.704 (0.07)
0.828 (0.06)
0.871 (0.02)
SVM
0.916 (0.03)
0.886 (0.07)
0.662 (0.07)
0.723 (0.04)
0.850 (0.06)
0.875 (0.01)
FKNN
0.674 (0.12)
0.591 (0.10)
0.352 (0.06)
0.290 (0.08)
0.511 (0.07)
0.698 (0.01)
T1FF-C
0.896 (0.03)
0.884 (0.04)
0.665 (0.07)
0.625 (0.09)
0.808 (0.11)
0.878 (0.01)
T1IFF-C
0.931 (0.03)
0.894 (0.05)
0.675 (0.05)
0.756 (0.03)
0.847 (0.05)
0.878 (0.01)
DIT2FF-C
0.960 (0.01)
0.892 (0.05)
0.711 (0.06)
0.809 (0.02)
0.840 (0.06)
0.898 (0.00)
DIT2IFF-C
0.905 (0.05)
0.911 (0.06)
0.637 (0.06)
0.750 (0.03)
0.882 (0.04)
0.874 (0.01) 0.885 (0.01)
ET1FF-C
0.883 (0.03)
0.894 (0.06)
0.658 (0.06)
0.709 (0.06)
0.834 (0.05)
ET1IFF-C
0.891 (0.03)
0.904 (0.06)
0.676 (0.05)
0.741 (0.05)
0.851 (0.06)
0.890 (0.01)
EDIT2FF-C
0.904 (0.02)
0.887 (0.06)
0.665 (0.07)
0.762 (0.04)
0.842 (0.05)
0.900 (0.00)
EDIT2IFF-C
0.901 (0.04)
0.890 (0.07)
0.663 (0.05)
0.748 (0.03)
0.849 (0.05)
0.883 (0.00)
In Table 6.48, we present the results of the evaluation of the classification methods based on the AUC percentage performance measure. The ROC graphs of each dataset are given in Appendix D.8., i.e., Appendix Figure D.13 to Figure D.17. The Accuracy tables of the cross validation experiments are displayed in the Appendix Table D.47, D.49, D.51, D.53, D.55, D.57. In Table 6.49 parameters of the optimum models based on the AUC performance are given. Table 6.49 Optimum Parameters of the best classification models of five different classification datasets Parameter
Ion
Credit
DIT2FF-C
DIT2IFF-C
Clustering type
FCM
IFC-C
# of Clusters
{4,5}
{4}
mL= 1.1 mU = 2.3
mL= 1.1 mU = 2.3
SVM
LR
Model Type
Degree of Fuzziness Fuzzy Function Type Fuzzy Function Parameters
Creg= {64,32}
Cancer DIT2FF -C
Liver
Diabet
California
DIT2FF-C
DIT2IFF-C
EDIT2FF-C
FCM
FCM
IFC-C
FCM
{2,3,4}
{3,5}
{2}
{5,7,9}
mL= 1.1 mU = 2.3
mL= 1.1 mU = 2.3
mL= 1.1 mU = 1.8
mL= 1.1 mU = 2.3
SVM
LR & SVM
LR and SVM
SVM
Creg= {8}
Creg={64}
Creg= {64}
Creg={32, 64}
The common property of the optimum methodologies of the classification datasets, i.e., DIT2FF-C, DIT2IFF-C, EDIT2FF-C, is that, they all apply discrete interval valued type-2 fuzzy functions method. Hence, each of the six methods of the corresponding six classification datasets identify two common collection
6.5 Analysis of Experiments - Classification (Pattern Recognition) Domains
283
tables; (1) the level of fuzziness collection table, i.e. m-Col*, which hold the optimum degree of fuzziness values for each training vector identified from the embedded fuzzy membership values and (2) the local fuzzy function parameters collection table, i.e., Φ-Col*, which holds the structure and parameters of fuzzy functions. Thus, each cluster would have different fuzzy function structures for each cluster identified with the embedded models of the uncertainty interval of fuzzy functions. Furthermore, one of the optimum methodologies, i.e., DIT2IFFC, uses an interim matrix to identify the parameters of the interim fuzzy functions in calculating the membership values using Improved Fuzzy Clustering (IFC) algorithm. The other optimum methodologies, i.e., DIT2FF-C and EDIT2FF-C, apply Fuzzy C-Means Clustering algorithm, therefore, they do not identify interim fuzzy functions. Thus, for the optimum models of the DIT2IFF-C methodology, interim fuzzy function collection tables, i.e., τ-Col*, are identified to hold the optimum parameter and structures of the interim fuzzy functions identified for each training data vector. In the earlier regression experiments, we demonstrated examples of the collection tables, and fuzzy functions structures, which are identified with linear and non-linear methods. The structures of the collection tables of the classification models are same as the regression collection tables. The only difference is the way the classification functions are identified. Instead of regression functions, such as LSE or Support vector regression, we use logistic regression or support vector classification, which also identify the similar parameters of interim and local fuzzy functions as the regression counterparts. We will demonstrate the optimum parameters of the optimum models of one of the classification datasets, i.e., the optimum models of DIT2IFF –C methodology using the Puma Diabetes dataset, in Appendix D.6.
6.5.4 Proposed Fuzzy Cluster Validity Index Analysis for Classification Improved fuzzy functions for classification problem domains implement the proposed Improved Fuzzy Clustering for Classification (IFC-C) algorithm. In chapter 3, a new cluster validity measure for classification problem domains- cviIFC-C is proposed. It is proven that this measure may identify the range of the optimum number of clusters by narrowing the available number of clusters. Hence, in the experiments, we used heuristic search to identify the optimum number of clusters of improved fuzzy functions for classification problem domains. Since we were able to identify the optimum number of clusters of the classification datasets using Improved Fuzzy Functions, we can validate the performance of the proposed fuzzy cluster validity measure – cviIFC-C- based on these results. Here the validity of the proposed cluster validity index is discussed using the UCI classification experiment results presented in the previous sub-section. The proposed validity measure, cviIFC-C to find the optimum number of clusters of the models of Improved Fuzzy Clustering for Classification (IFC-C) during the learning stage, is applied to each UCI datasets. Since the degree of fuzziness (m) parameter of the IFC-C algorithm is unknown, we iterated IFC-C algorithm
284
6 Experiments
for two different values of m, i.e., m=1.3, to represent more crisp models, and m=2.0, to represent more fuzzy models, for changing values of number of clusters, c=2,..,10. We obtained cviIFC-C values and averaged over five repetitions. Each graph Appendix Figure D.8 to D.12 depicts cviIFC-C values obtained from each experiment as a function of number of clusters for two values of m. For the Diabetes Dataset, the cvi-graph does not clearly indicate where the elbow is, hence it is uncertain as to where the optimum number of clusters, c*, should be. Still, since there is not too much variation on the value of the cviIFC after when c=5, one could estimate from the cvi-graph of diabetes dataset that the optimum c* should be around 5 for each fuzziness value. The elbow is clearer in ion dataset’s cvigraph in Appendix Figure D.9. One could determine with more confidence that c*∈ {4,5*,6}. From the cviIFC-C graph of the liver disorders dataset, one can observe that when c>4, the cviIFC-C index stabilizes to its minimum (elbowpoint). This suggests to us that optimum number of clusters, c*, should be around 4, 5 or 6. As can be observed from the cvi-graphs of credit and cancer datasets in Appendix Figure D.11 and D.12, the minimum cviIFC-C starts when c>3. Hence, the optimum number of clusters should be around 4 for credit and cancer datasets for the proposed improved fuzzy function methodologies. The estimated number of clusters of the optimum Improved Fuzzy Functions models for each of the UCI datasets is listed in Table 6.49. In Table 6.48, we present the results of the evaluation of the classification methods based on the AUC percentage performance measure. The ROC graphs of each dataset are given in Appendix D.8., i.e., Appendix Figure D.13 to Figure D.17. The Accuracy tables of the cross validation experiments are displayed in the Appendix Table D.47, D.49, D.51, D.53, D.55, D.57. In order to verify the performance of the proposed cviIFC-C, we compared the suggested optimum number of clusters by the cviIFC-C graphs to the estimated optimum number of clusters identified by the fuzzy functions models depicted in Table 6.49. Based on the results, it can be concluded that, the cviIFC-C function correctly identified the optimum number of clusters except the Diabetes dataset.
6.5.5 Performance Comparison Based on Elapsed Times In this section we will investigate the performance of the proposed fuzzy functions methodologies in comparison to the well-known methodologies based on the elapsed times of the learning and reasoning methods. We only present the regression type models here. It is evident from the results of the above regression experiments that in general the two proposed methodologies, i.e., evolutionary type-2 improved fuzzy functions (EDIT2IFF) and type-2 fuzzy functions (DIT2FF) are found to demonstrate good performances. The best benchmark methodologies are support vector machines (SVM) and earlier type-2 system modeling methods based on Fuzzy Rule bases (DIT2FRB). In this section, we compared these optimum models based on elapsed times of their learning and inference (reasoning) algorithms. In addition, in the previous chapters, it was
6.5 Analysis of Experiments - Classification (Pattern Recognition) Domains
285
shown that the number of iterations of parameter optimization of the proposed methods is reduced when evolutionary algorithms are used instead of exhaustive search method. To justify these findings, in this section, the results of the elapsed times of the proposed methods based on exhaustive search and evolutionary algorithms are discussed. All the experiments are executed on an AMD Turion 64-2GHz Windows Laptop with 2GB RAM. Each benchmark method is coded in MATLAB and version 7.0.1 is used. We measured the elapsed times in seconds. The desulphurization regression dataset is used to demonstrate the results. The learning algorithms use only one set of training and validation dataset to optimize the list of parameters and a testing dataset to run the inference method. The list of parameters and their values are listed in Table 6.50. Table 6.50 Parameters of four different models that are build using four different methods to compare their elapsed times on Desulphurization dataset Parameter Name of the Methodology
EDIT2IFF* Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions
Model Type
DIT2FF* Discrete Interval Type-2 Fuzzy Functions
SVM Support Vector Machines
DIT2FRB Interval Type-2 Fuzzy Rule Bases
Regression
Regression
Regression
Regression
Clustering type
Proposed IFC
FCM clustering [Bezdek, 1981]
N/A
FCM clustering
Number of Clusters
Min= 3,Max = 15
{2,3,…,15}
N/A
{2,…,15}
Degree of Fuzziness
m-lower = 1.4 m-upper = 2.6
m-lower = 1.4 m-upper = 2.6
N/A
{1.1, 1.3, 1.5, …,2.6}
Alpha-cut
0
0
0
0
N/A
Takagi-Sugeno first order polynomial
Fuzzy Function Type Fuzzy Function Parameters to be optimized
SVM and LSE separately Six different fuzzy functions structures are used, minCreg=2-1, maxCreg = 27, minEpsilon = 0.01 maxEpsilon = 0.1
SVM 1 type of fuzzy function structures are used Creg={2-1,…,27}, epsilon = {0.01,…,0.1}
Creg={2-1,…,27}, epsilon = {0.01,…,0.1}
N/A
* Indicate the best proposed models obtained from the experiments of this work.
An exhaustive search is applied using three of the above methodologies except EDIT2IFF. In the learning stage of these three methodologies, i.e., DIT2FF, SVM and DIT2FRB, one model is built for each set of parameter using the training dataset, and the performance of each model is evaluated using the validation dataset. Of these models, the model with the best performance measure is selected as
286
6 Experiments
the optimum model. At the end of the learning algorithm, the optimum model is identified and the elapsed time is measured. This process is repeated for each of these three algorithms. It should be pointed out that, the EDIT2IFF system has an evolutionary learning schema that tries to identify the optimum model by learning the model parameters from the training data and evaluating the model on the validation dataset. The EDIT2IFF model constucts a list of individual models called the population. Each individual are encoded as a genetic code, i.e., chromosome. Each individual model represented by these chromosomes of EDIT2IFF model may identify any regression approximation function given the list of regression types. In our experiments, we used LSE and SVM and the algorithm determines which one would be the most approximate approximation type. In this particular comparative analysis, we executed the EDIT2IFF method using exclusively LSE and SVM. The aim was to calculate how much it would take if one used only one of these methods to identify the local function parameters. As a result two EDIT2IFF models are build, one that uses LSE and one that uses SVM for function approximation. Hence, five different models are built using five different methodologies, i.e., SVM, DIT2FRB, EDIT2IFF*(LSE), EDIT2IFF*(SVM), DIT2FF*.The corresponding optimum models of each of these five methodologies are evaluated using the testing dataset and their individual elapsed times are measured. In order to investigate the elapsed times on changing number of training datasets, each of these five methodologies is applied on training datasets of different sizes, i.e., 25, 50, 100, 150, 200, 250, 500, 750 training observations. After the optimum models are identified for each method, each of the five optimum models is then applied to the testing datasets of different sizes, i.e., 25, 50, 100, 250, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000 testing observations, to compare their elapsed times. SVM
Type 2 FRB
DIT2 FF*
EDIT2 IFF (SVM)
90 Structure Identification (Training) Algorithm
80
Inference Algorithm 1.6 1.4
CPU time in minutes
CPU time in minutes
70 60 50 40 30 20
1.2 1 0.8 0.6 0.4
10 0
EDIT2 IFF (LSE)
1.8
0.2 0
200
400
600
Number of training samples
800
0
0
2000
4000
6000
8000
Number of testing samples
Figure 6.30 The Elapsed times (min) of structure identification (learning) and reasoning algorithms of optimum methodologies using different samples of training and testing datasets, respectively. (left) the training (learning) periods, (right) the inference periods.
6.5 Analysis of Experiments - Classification (Pattern Recognition) Domains
287
It is evident from Figure 6.30 that EDIT2IFF method which uses LSE takes the least amount of time to learn and reason. The learning times start to increase linearly as the number of training samples are over 200. The second fastest method is the SVM regression method. It takes twice the time to learn the EDIT2IFF models when SVM is used to approximate fuzzy functions compared to the pure SVM methodology (left graph in Figure 6.30). Although, their speed is the same up to 2000 testing data points, SVM is faster in reasoning than the proposed EDIT2IFF methods as the number of testing samples is increased. Nonetheless, the difference between them is less than 30 seconds for 8000 testing samples. It should be pointed out from the right graph in Figure 6.30 that, it is expected that there is a linear relationship between the testing sample size and the time spent. However, for some methods, e.g., DIT2FF and EDIT2IFF, it is not the case. This is due to the cased-based inference method, which is based on a search method. The search method may differ based on the selected training and testing samples. We used MATLAB’s dsearch(.) method, which returns the indices of the closest points in the training dataset for each point in the testing dataset. In chapter 4 and 5, the evolutionary extensions of the fuzzy functions strategies are introduced and it was shown that when genetic fuzzy functions are used, the time it takes to learn the algorithms would be much less compared to the fuzzy functions that use exhaustive search methods. This was proven with parametric Table 6.51 Parameters of different models that are build to measure their elapsed times. Four different methods are paired based on their optimization methods. Parameter
Values
Fuzzy Functions that use exhaustive search method for parameter optimization Fuzzy Functions that use evolutionary algorithms for parameter optimization Fuzzy Function Methodologies that are paired in order to compare their elapsed times Model Type
T1IFF, DIT2IFF
ET1IFF, EDIT2IFF T1IFF ÆET1IFF DIT2IFFÆEDIT2IFF * Regression
Clustering type
Improved Fuzzy Clustering
Clustering Parameters
#of clusters (8 values)={2, 3, 4, 5, 6, 7, 8, 9} m (fuzziness) (10 values) = {1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.9, 2.0, 2.1} Number of clustering iterations = 50 # iterations = 10 # populations = 10 LSE Five different Fuzzy Functions types Alpha-cut = 0
Genetic Algorithm Parameters Fuzzy Function Type and parameters
* ‘Æ’ indicates that the elapsed times of the paired methodologies are compared in the analysis.
288
6 Experiments
assumptions. Herein, this assumption will be analyzed using real datasets. Four different proposed methods, i.e., T1IFF, ET1IFF, DIT2IFF, EDIT2IFF, are applied on the same desulphurization dataset using the following parameters as shown in Table 6.51. We individually compared the elapsed times of type-1 fuzzy functions methodologies based on exhaustive search and evolutionary methods, i.e., T1IFF and ET1IFF. In addition their type-2 extensions, i.e., DIT2IFF, EDIT2IFF are also used to compare their elapsed times. An exhaustive search is applied to optimize the parameters of the T1IFF and DIT2IFF methods. The DIT2IFF uses the optimum number of clusters of T1IFF model. Then the uncertainty boundaries are identified based on the given fuzziness values, m, and fuzzy function structures. The rest of the two methods, ET1IFF and EDIT2IFF use evolutionary algorithms for optimization. In these methods, the evolutionary algorithm is executed using 10 populations and 10 iterations for evolutionary operations. It should be pointed out that, these values are set for demonstration purposes only for the elapsed time comparison experiments. These values are very small for an actual model. However, to keep the number of different values of the system parameters the same across type-1 and type-2 models and their evolutionary extensions, we assigned the same number of values for each parameter. For instance, we used approximately 10 different values to set the number of clusters and degree of fuzziness parameters. The elapsed times of the structure identification of these four models are depicted in Figure 6.31. It is evident from Figure 6.31 that, the elapsed times of the evolutionary extensions of type-1 and type-2 fuzzy functions systems, i.e., ET1IFF and EDIT2IFF, are considerably less compared to the fuzzy functions based on exhaustive search method, i.e., T1IFF and DIT2IFF. 25
35
20
15
10
5
DIT2IFF EDIT2IFF
30 Elapsed times in minutes
Elapsed times in minutes
T1IFF ET1IFF
25 20 15 10 5
0
0
200 400 600 800 Number of Training Samples
0
0
200 400 600 800 Number of Training Samples
Fig. 6.31 Elapsed times in minutes of structure identification of different fuzzy function systems
It should be emphasized that the reason for longer elapsed times of the learning algorithm presented above, is due to the software tools used to implement them and to conduct these experiments. For instance, the LIBSVM MATLAB toolbox
6.6 Overall Discussions on Experiments
289
functions [Chang and Lin, 2001], which are used to execute SVM techniques is developed for scientific research purposes in the C programming language and the function calls are employed through a script in MATLAB. The elapsed times can be reduced by implementing the proposed approaches using one single language such as C or the JAVA language. The MATLAB 2007 edition allows Distributed Computing, which enables an application to run across various machines. This brings down the computing time very significantly. Our MATLAB version we used does not have this feature, so it could not be employed during the experiments. In addition, the issue with the LIBSVM that is stated in [Chang and Lin, 2001] is that for larger datasets the learning methods take more time to converge. This issue can be overcome using better optimization techniques. The aim of this work is to investigate fuzzy function methodologies in terms of increasing prediction performance and build robust models. Therefore, optimization method improvements were not investigated. In addition, using evolutionary algorithms, we wanted to reduce the number of iterations of the fuzzy function approaches, which are based on the exhaustive search, while keeping the performance intact. It is generally more important to investigate the elapsed times of reasoning algorithms, i.e., inference elapsed times, than the training elapsed times. In our experiments the inference methods take a considerably shorter time compared to training methods. Most importantly, the elapsed times of the inference method of proposed approaches are not more than the well-known benchmark approaches. Hence, in terms of elapsed times, the reasoning algorithms of the proposed fuzzy functions approaches take reasonable amount of time.
6.6 Overall Discussions on Experiments The experiments in this work comprise two different problem domains, i.e., regression and classification, to evaluate the performance of the proposed fuzzy functions methodologies against benchmark methods. Stock price estimation problems, which are in regression domains, are analyzed differently than the rest of the regression domains. Therefore, we used different performance measures for these three different problem domains, regression, classification, and stock price estimation. In the stock price estimation experiments, we used a new Robust Simulated Trading Benchmark - RSTB measure, to evaluate the profitability of each stock price estimation model. To evaluate the performance and the robustness of the proposed methodologies, we averaged the RSTB results over five different stock price testing datasets. In order to evaluate the overall significance of the proposed methodologies, we also used significance t-test measures across five different testing datasets. The significance measures indicate that the proposed methodology is significantly better in five stock price testing datasets based on RSTB compared to the other benchmark methodologies. On the other hand, the performance of the rest of the four regression datasets, i.e., Friedman’s Artificial, Auto-Mileage, Dofasco Reagent1, and Dofasco Reagent2 are evaluated by the average testing R2 performances. The result of each
290
6 Experiments
methodology (system modeling technique) is ranked by average R2 performances of the test datasets. In order to evaluate the overall significance of the proposed methodologies, we also used significance t-test measures across four different testing datasets. Each cell in the t-test tables indicates: (i) the result of the null hypothesis, and (ii) the probability of accepting the null hypothesis. The result of the null hypothesis which is either Fail to Reject (FR), in which case null hypothesis is true, or Reject (R), in which case the null hypothesis is false. The probability indicates the chance of observing the null hypothesis. Small probabilities cast doubt on the validity of the null hypothesis. In these experiments, the null hypothesis is not rejected (fail to reject) if the probability is more than 5 percent. In other words, the null hypothesis is failed to reject with 95 percent confidence level.
6.6.1 Overall Comparison of System Modeling Methods on Regression Datasets In this section, we discuss the overall performances of the proposed system modeling methods in comparison to the benchmark methodologies on the regression datasets. Even though stock price estimations are regression problem domains, they are analyzed in this work separately from the rest of the regression models. Thus, while presenting overall comparisons, we separated them from the rest of the regression models. 6.6.1.1 Overall Comparison of Stock Price Datasets The overall robustness of the models is evaluated based on average testing dataset RSTB values across five different stock price datasets. The average rank is calculated with respect to the calculated average RSTB values as shown in Table 6.52. The table is sorted based on average rank in ascending order. It can be observed from the results in Table 6.52. that the proposed algorithms, EDIT2IFF and DIT2FF methodologies, the first two rows, outperformed the rest of methodologies as shown in the overall ranking column. It is evident that the profitability of these two methods is even higher than the Buy and Hold Strategy. The common property of the two most profitable models is that they both implement the proposed Discrete Interval Type-2 Fuzzy Function methods. The most profitable one, that is EDIT2IFF, implements the proposed Improved Fuzzy Clustering algorithm while optimizing the system parameters based on the evolutionary algorithms. The least profitable methodologies are NN, GFS, DENFIS and ANFIS, the last three rows of Table 6.52. Their imperfectness is mostly due to the fact that in some datasets the models constructed by these methodologies are discarded by the RSTB calculations, which assigned ‘0’ profitability. Therefore, their average profitability is plunged, and in turn, they cannot be considered as robust methods for stock price analysis. Based on average results obtained from the stock price prediction experiments, the EDIT2IFF methodology improves the systems prediction performance (RSTB=$98.88) by 2.17% based on $100 investment compared to the Buy and Hold Strategy (RSTB=$96.71). Similarly, EDIT2IFF
6.6 Overall Discussions on Experiments
291
Table 6.52 Average Profit based on the RSTB Performance Measure and Ranking of system modeling techniques on five Real Stock Price Datasets Overall Ranking
Overall Average
TD
BMO
SLF
ENB
LB
RSTB
AR
EDIT2IFF*
$115.44 (1)
$88.75 (3)
$91.75 (4)
$96.11 (2)
$102.35 (1)
$98.88
2.2
1
DIT2FF*
$113.12 (2)
$94.97 (1)
$95.95 (2)
$86.21 (10)
$94.53 (4)
$96.96
3.8
2
$102.28 (11)
$92.91 (2)
$95.56 (3)
$92.73 (3)
$100.43 (2)
$96.71
4.5
3 4
Buy&Hold ET1FF DIT2FRB
Ranking
$112.72 (3)
$86.38 (8)
$96.79 (1)
$83.32 (13)
$92.34 (8)
$94.31
4.8
$102.04 (12)
$86.26 (9)
$90.28 (6)
$97.21 (1)
$96.29 (3)
$94.42
7.3
5
$111.18 (6)
$88.07 (6)
$87.73 (11)
$89.53 (5)
$92.84 (7)
$93.87
7.3
5 6
ET1IFF T1FF
$111.33 (5)
$86.04 (10)
$89.85 (7)
$85.82 (11)
$90.77 (13)
$92.76
8.5
T1IFF
$110.61 (7)
$85.72 (12)
$88.57 (10)
$84.3 (12)
$93.8 (6)
$92.60
8.5
6
DIT2IFF
$109.62 (8)
$85.99 (11)
$87.26 (12)
$88.47 (7)
$93.91 (5)
$93.05
8.8
7
EDIT2FF
$111.93 (4)
$85.69 (13)
$88.76 (8)
$87.33 (9)
$91.58 (12)
$93.06
9
8
SVM
$109.54 (9)
$85.26 (14)
$90.29 (5)
$90.06 (4)
$92.06 (11)
$93.44
9.5
9
NN
$100.33 (14)
$88.17 (5)
$88.62 (9)
$88.52 (6)
$0 (14)
$73.13
10
10
$0 (15)
$86.82 (7)
$87.18 (13)
$87.34 (8)
$92.09 (10)
$70.69
11
11
DENFIS ANFIS
$106.3 (10)
$0 (15)
$0 (14)
$0 (14)
$92.28 (9)
$39.72
12
12
GFS
$100.4 (13)
$88.22 (4)
$0 (14)
$0 (14)
$0 (14)
$37.72
12
12
* Indicate the best models. The value on bold indicate the best model of the corresponding row method on average of five different models of five real stock datasets.
improves the prediction performance by 4.46%, compared to the closest benchmark model with the best RSTB value, which is the DIT2FRB model (RSTB=$94.42). Since the broker commissions, i.e., trading costs, are negligible for a sizeable investment, we did not include these costs in the RSTB calculations. We measure the significance of a system modeling technique in comparison to other techniques based on the results of five different stock price datasets. The t-test results are shown in Table 6.53. The null hypothesis of the stock price t-tests experiments indicates that the performances of two paired algorithms are significantly different than the rest of the benchmark methods. The null hypothesis of Stock Price experiments is as follows:
H0 :
(
1 5
) (
∑ 5dataset =1 RSTB j ,dataset −
1 5
)
∑ 5dataset =1 RSTBk ,dataset > $5 profit
In these experiments the null hypothesis indicates that the difference between the average RSTB values obtained from the application of two different strategies denoted with j(row) and k(column) is greater than $5 in profit value (based on the Canadian $100 investment) among these five stock datasets. Failing to Reject (FR) the null hypothesis indicates that the null hypothesis is true with 95 percent confidence level and the methodology j is significantly better than the algorithm k. Rejecting the null hypothesis indicates that the two methodologies are not significantly different. For
292
6 Experiments
instance, in Table 6.53, the t-test between row DIT2FF and column ANFIS methods, “FR (0.98)” indicates that the profitability of the proposed DIT2FF method is at least CAD$5 better than ANFIS method in average of five different stock price datasets. Table 6.53 Overall two-sample left tailed t-test results (p<0.05) for Stock Price Datasets. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R). Buy&Hold
NN
ANFIS
DENFIS
GFS
SVM
DIT2FRB
T1FF
R 0.01
FR 0.78
FR 0.97
FR 0.77
FR 0.97
R 0
R 0.044
T1IFF
R 0.01
FR 0.77
FR 0.98
FR 0.77
FR 0.97
R 0.00
R 0.04
DIT2FF*
FR 0.08
FR 0.84
FR 0.98
FR 0.82
FR 0.98
FR 0.26
FR 0.279
DIT2IFF
R 0.01
FR 0.78
FR 0.97
FR 0.77
FR 0.97
R 0
R 0.021
ET1FF
R 0.04
FR 0.80
FR 0.97
FR 0.78
FR 0.98
R 0.04
FR 0.133
ET1IFF
R 0.02
FR 0.80
FR 0.97
FR 0.78
FR 0.98
R 0.00
R 0.044
EDIT2FF
R 0.02
FR 0.78
FR 0.97
FR 0.77
FR 0.97
R 0
R 0.043
EDIT2IFF*
FR 0.19
FR 0.84
FR 0.98
FR 0.84
FR 0.98
FR 0.61
FR 0.418
* Indicate the best methods based on the performance measure. Shaded cells indicate the best methods in the rows that are significantly better than the benchmark methods on the intersecting column.
It is evident from Table 6.53 that the optimum models of the proposed DIT2FF and EDIT2IFF methods are significantly different from all the benchmark methods (since none of the null hypothesis is rejected). Discussions on the New Robust Simulated Trading Benchmark Performance Measure The Mean Absolute Percentage Error, MAPE, values of each stock dataset obtained from each benchmark and proposed models as shown in Table 6.43 in Section 6.4.4.3 that they are very small in magnitude, viz., the values are around 1% levels. This is due to the nature of the stock price estimation models. Therefore, it is
6.6 Overall Discussions on Experiments
293
generally difficult to determine the optimum algorithm to predict the stock prices based on MAPE or any other error measure [Hellstrom and Holmstrom, 1998]. It is practical and easily comparable if one measures the profitability of the models instead of error reduction abilities [Hellstrom and Holmstrom, 1998]. Thus, we developed Robust Simulated Trade Benchmark measure, RSTB, to calculate the profit of every model based on initial investment amounts. The objective is to find an algorithm that can build models to increase the profit. It can be observed from Table 6.52 that the average RSTB values are quite different from each other based on initial $100 investment. Now, we look closely at the reasons why the MAPEs are almost the same in each model, yet their profits are different. For this, we examined the testing dataset of one of the stock datasets in the experiments, i.e., Toronto Dominion Stock prices, and compared two different algorithms, as shown in Figure 6.32. 72 YACTUAL 71
Y
Stock Prices
SVM
Y
proposed
70 69 68 67 66
0
10
20
30
40
117
60
70
80
90
100
90
100
CashSVM
114 Current Portfolio Value
50 Days
Cashproposed 109
104
99
0
10
20
30
40
50 Days
60
70
80
Fig. 6.32 TD Canada Trust Stock Prices of Testing dataset (100 day trading period) – (Top) Actual and Predicted using two different algorithms, SVM and Proposed EDIT2IFF, (bottom) current-market-values at each t –trading day obtained from estimated models
Overall MAPE values of SVM and EDIT2IFF models are 0.24% and 0.22%, respectively. The estimated stock prices from these two models and the actual stock prices are illustrated in Figure 6.32 top graph. The lower graph illustrates the comparison of current value of the portfolio of the methods from the SVM and the
294
6 Experiments
EDIT2IFF, at any t day, which is either the cash in the portfolio or the current value of the stocks in the portfolio, i.e., [#stockst*stock_pricet], viz., the total amount of money or the current stock value of stocks in the portfolio at time t, calculated from the predicted SVM and EDIT2IFF models based on the RSTB calculation. The trader is assumed to have either cash or number of stocks in his/her portfolio at any t day during a trading process as explained in “Evaluation Criteria for Stock Price Prediction Models” in section 6.4.4.2.. Up until day 79, the estimations of the current-portfolio-valuet of each model are very close to each other. After this day, the current-market-portfolio-valuet of the models differs from each model, as shown in the bottom graph of Figure 6.32. The changes in estimations and buy-hold-sell strategies are analyzed with the help of magnified view of the stock prices as shown in Figure 6.33 lower graph and Table 6.54. In the upper graph of Figure 6.33, the actual, and predicted stock price values are shown which are obtained from a benchmark and the best optimum proposed methodology, the SVM and the EDIT2IFF methods, respectively. The lower graph is the comparison of current values of the portfolio of the optimum model from SVM and proposed EDIT2IFF method. Magnified View of Figure 6.32
69
YACTUAL Stock Prices
YSVM 68.5
68
Current Portfolio Value
67.5 75 114 113 112
Yproposed
Actual Stock Day 76 68.25 Actual Stock Day 77 67.60 76
77
78
79
Proposed Prediction 67.58 Decision: Sell-->Hold
Proposed Prediction 68.22 Decision: Sell
80
CashSVM Cashproposed
111 SVM Prediction: 67.74 Buy-->Hold
110 109 108 107 75
SVM Prediction: 68.32 Buy-->Hold
76
77
78
79
80
Days
Fig. 6.33 Magnified View of TD Canada Trust Stock Prices in Figure 6.32- Actual versus Predicted and Current Market Values
6.6 Overall Discussions on Experiments
295
It should be noted from Table 6.54 that, for day 77 the proposed model predicts that the closing price will be less than the previous day’s closing price, i.e., (68.22<68.25t=76), whereas SVM predicts that it will be higher, i.e, (68.32>68.25t=76). At this point, they make different decisions. While the proposed model, EDIT2IFF, decides to sell the stocks, the SVM model predicts to buy more stocks (since there are no cash in SVM model’s portfolio, the actual decision is changed to hold). Thus, the cash values of two models are different even though their modeling errors of estimating the particular closing price are very close to each other (errorSVM(t77)=|67.60-68.32|=0.72, errorEDIT2IFF(t77)=|67.60-8.22|=0.62), where the observed closing price is $67.6, whereas predicted closing price using SVM is $68.32 and using EDIT2IFF is $68.22 for day 77. The complete list of estimations of SVM and EDIT2IFF models for this particular dataset is depicted in Table D.29 in Appendix. Table 6.54 TD Canada Trust Buy–Sell-Hold trend and Current Market values between 7382th days obtained from SVM and proposed Evolutionary Type-2 Improved Fuzzy Functions model (EDIT2IFF). SVM Model Prediction Decis. Cash # $ $ Stock
Market $
Proposed EDIT2IFF Model Prediction Pred. Decis. Cash # Market $ $ $ Stock $
110.37 110.37 111.28
68.98 69.19 69.93
t Day
Actual $
Pred. $
73 74 75
69.20 69.77 69.45
68.97 69.18 69.86
76
68.25
69.53
Buy
0
1.6
110.77
69.58
77
67.60
68.32
Buy
0
1.6
108.86
68.22
Sell Sell Buy
110.37 110.37 0
0 0 1.6
Sell Sell Buy
110.02 110.02 0
0 0 1.6
110.02 110.02 110.92
Buy
0
1.6
110.41
Sell
110.41
0
110.41
78
67.95
67.73
Buy
0
1.6
107.82
67.58
Sell
110.41
0
110.41
79 80 81 82
68.01 69.22 69.06 70.12
68.14 68.14 69.35 69.12
Buy Buy Buy Buy
0 0 0 0
1.6 1.6 1.6 1.6
108.38 108.48 110.41 110.15
68.05 68.05 69.37 69.12
Buy Buy Buy Buy
0 0 0 0
1.63 1.63 1.63 1.63
110.98 111.08 113.06 112.80
.. 99
.. 69.17
.. 69.17
.. -1.00
.. 110.69
..
.. 110.69
.. 69.20
.. 1.00
.. 0.00
.. 1.67
.. 115.82
100
68.95
69.14
-1.00
110.69
0.00
110.69
69.16
-1.00
115.82
RSTB=
110.69
0.00
0.00
115.82
RSTB=
115.82
The rows at days 77 and 78 are highlighted to indicate the point of change of estimation values of two different prediction methods, traditional SVM and proposed EDIT2IFF.
6.6.1.2 Overall Comparison of Regression Datasets Other Than Stock Prices As mentioned in the previous sub-section, the overall robustness is evaluated based on average testing dataset R2 values across four different regression datasets, i.e., Friedman’s Artificial, Auto-Mileage from UCI Repository, Desulphurization dataset for Reagent1 and Reagent 2. The average rank is calculated with respect to the calculated average R2 values as shown Table 6.55. It is evident from Table 6.55 that the optimum model among the regression experiments on four different datasets is the proposed EDIT2IFF, which applies the
296
6 Experiments
new Improved Fuzzy Clustering method. Based on average R2 results obtained from the regression experiments, EDIT2IFF methodology (Average R2=0.854) improves the systems prediction performance by as little as 2% compared to the closest benchmark model, i.e., SVM (Average R2=0.838), with the best R2 value, which is the SVM model. The null hypothesis of regression t-tests experiments indicates that the performances of two paired algorithms are significantly different. The null hypothesis of Regression experiments is as follows:
H0 :
(
1 4
∑ dataset =1 R 2j ,dataset ) − ( 41 ∑ dataset =1 Rk2,dataset ) > %2.5 4
4
Table 6.55 Average R2 and Rankings of system modeling techniques on Regression Problems. The values in parenthesis are standard deviation obtained from cross validation iterations. The values in bold indicate the optimum method of the corresponding dataset of each column.
Friedman
Auto Mileage
Reagent1
Reagent2
Average R2
AR
Overall Rank
EDIT2IFF*
0.94 (4)
0.863 (1)
0.805 (1)
0.807 (1)
0.854
1.75
1
T1IFF
0.939 (6)
0.863 (1)
0.79 (4)
0.769 (6)
0.840
4.25
2
ET1IFF
0.941 (3)
0.83 (11)
0.794 (2)
0.805 (2)
0.843
4.5
3
ET1FF
0.942 (2)
0.84 (10)
0.792 (3)
0.776 (4)
0.838
4.75
4
SVM
0.938 (7)
0.848 (5)
0.789 (5)
0.776 (3)
0.838
5
5
T1FF
0.939 (5)
0.851 (4)
0.789 (6)
0.762 (8)
0.835
5.75
6
DIT2IFF
0.924 (9)
0.858 (2)
0.785 (8)
0.765 (7)
0.833
6.5
7
EDIT2FF
0.946 (1)
0.847 (6)
0.773 (9)
0.733 (11)
0.825
6.75
8
DIT2FF
0.936 (8)
0.853 (3)
0.785 (7)
0.739 (10)
0.828
7
9
NN
0.873 (11)
0.841 (9)
0.767 (10)
0.774 (5)
0.814
8.75
10
DIT2FRB
0.893 (10)
0.846 (7)
0.745 (11)
0.751 (9)
0.809
9.25
11
DENFIS
0.855 (12)
0.845 (8)
0.686 (12)
0.686 (13)
0.768
11
12
GFS
0.855(12)
0.84(10)
0.68(13)
0.73(12)
0.776
11.75
13
ANFIS
0.444 (13)
0.767 (12)
0.591 (14)
0.624 (14)
0.607
12.75
14
* Indicate the best system modeling tool in average.
The null hypothesis, H0, indicates that the difference between the average R2 of methodology j and k is greater than 2.5% (2.5 difference in R2 value). Failing to Reject (FR) the null hypothesis indicates that the null hypothesis is true at the 95 percent confidence level and the methodology j(row) is significantly better than the methodology k(column). Rejecting the null hypothesis indicates that the two methodologies are not significantly different. It is evident from Table 6.56 that the optimum models of the proposed EDIT2IFF method is significantly different from all the benchmark methods with 95 percent confidence level (since none of the null hypothesis is rejected).
6.6 Overall Discussions on Experiments
297
Table 6.56 Overall two-sample left tailed t-test results (p<0.05) for Regression Datasets. FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis. The numbers below each decision indicate the probability of observing the decision (FR/R). ANFIS
DENFIS
NN
GFS
SVM
DIT2FRB
FR 0.97
FR 0.95
FR 0.42
FR 0.91
R 0
FR 0.55
FR 0.97 FR 0.96 FR 0.97 FR 0.97 FR 0.97
FR 0.98 FR 0.94 FR 0.96 FR 0.94 FR 0.92
FR 0.54 FR 0.31 FR 0.33 FR 0.47 FR 0.59
FR 0.95 FR 0.85 FR 0.93 FR 0.91 FR 0.91
R 0 R 0.01 R 0 R 0 R 0.04
FR 0.77 FR 0.35 FR 0.46 FR 0.61 FR 0.69
EDIT2FF
FR 0.95
FR 0.91
FR 0.29
FR 0.81
R 0.01
FR 0.29
EDIT2IFF*
FR 0.98
FR 0.98
FR 0.92
FR 0.98
FR 0.09
FR 0.96
T1FF T1IFF DIT2FF DIT2IFF ET1FF ET1IFF
* Indicate the best method based on the R2 performance measure. Shaded cells indicate the best method in the rows that is significantly better than the benchmark methods, i.e., ANFIS, DENFIS, NN, GFS, SVM, DIT2FRB, on the intersecting column.
6.6.2 Overall Comparison of System Modeling Methods on Classification Datasets As mentioned in the previous sub-section, the overall robustness is evaluated based on the average testing dataset AUC values across six different classification datasets, Ionosphere, Credit Scoring, Breast Cancer, Liver Disorders, Diabetes, and California Housing. The average rank is calculated with respect to the calculated average AUC values as shown in Table 6.57. The mean AUC column in Table 6.57 indicates the average AUC values of the corresponding methodology of the row over six different classification datasets AUC values. The values in parenthesis indicate the ranks of each methodology in each column indicating each classification dataset. The Avrg Rank, i.e., average Rank is the average of ranks of each methodology in each row, and the over. Rank, i.e., overall Rank, indicate the truncated rank values. It is evident from Table 6.57 that the optimum model is the proposed DIT2FF-C, which applies the new Interval Valued Type-2 Fuzzy Functions. Based on mean AUC results obtained from the regression experiments, DIT2FF-C methodology (meanAUC=85.2%) improves the systems prediction performance by 3.3% compared to the closest benchmark model with the best AUC value, which are the SVM models (meanAUC=81.9%).
298
6 Experiments
Table 6.57 Average AUC values and Rankings of system modeling techniques on Classification Problems. The values in parenthesis indicate the rank of each methodology for the corresponding dataset in each column.
Ion
Credit
Cancer
Liver
Diabet
Califor.
Mean AUC
Avrg Rank
Over. Rank
LR
0.78(12)
0.86(10)
0.62(11)
0.67(11)
0.72(11)
0.84(11)
0.752
11.00
11
ANFIS
0.89(8)
0.76(12)
0.62(12)
0.67(10)
0.63(12)
0.74(12)
0.723
11.00
12
NN
0.88(11)
0.83(11)
0.63(10)
0.70(9)
0.82(9)
0.87(10)
0.792
10.00
10
SVM
0.92(3)
0.88(8)
0.66(7)
0.72(7)
0.85(3)
0.87(8)
0.819
6.00
7
FKNN
0.67(13)
0.59(13)
0.35(13)
0.29(13)
0.51(13)
0.69(13)
0.519
13.00
13
T1FF-C
0.89(7)
0.88(9)
0.66(4)
0.62(12)
0.81(10)
0.87(6)
0.793
8.00
9
T1IFF-C
0.93(2)
0.89(3)
0.67(3)
0.75(3)
0.84(5)
0.87(7)
0.830
3.83
2
DIT2FF-C*
0.96(1)
0.89(5)
0.71(1)
0.80(1)
0.84(7)
0.89(2)
0.852
2.83
1
DIT2IFF-C
0.91(4)
0.91(1)
0.63(9)
0.75(4)
0.88(1)
0.87(9)
0.827
4.67
5
ET1FF-C
0.88(10)
0.89(4)
0.65(8)
0.70(8)
0.83(8)
0.88(4)
0.811
7.00
8
ET1IFF-C
0.89(9)
0.90(2)
0.67(2)
0.74(6)
0.85(2)
0.89(3)
0.826
4.00
3
EDIT2FF-C
0.90(5)
0.88(7)
0.66(5)
0.76(2)
0.84(6)
0.9(1)
0.827
4.33
4
EDIT2IFF-C
0.90(6)
0.89(6)
0.66(6)
0.74(5)
0.84(4)
0.88(5)
0.822
5.33
6
• (*) Indicate the best methodology based on the performance measure. • The values in bold indicate the best performance obtained for the corresponding dataset of the column by the corresponding methodology in the row.
Next, we present the comparison of the significance of the AUC performances of the system modeling methodologies of classification problems based on significance t-test analysis across six different classification datasets. A significance of methodology j over methodology k indicates that methodology j is significantly better based on the results from all datasets. The significance t-tests are used to compare if the AUC values of the optimum models of a methodology are significantly better than the rest of the models. The null hypothesis of t-test experiments indicates that the performances of two paired algorithms are significantly different. The null hypothesis of classification experiments is as follows:
H0 :
(
1 6
) (
∑ 6dataset =1 AUC j ,dataset −
1 6
)
∑ 6cv =1 AUCk ,dataset > %5
The H0 indicates that the difference between the average AUC values methodology j and k obtained from six different classification datasets is greater than 5% (0.05 difference in AUC values). Fail to Reject (FR) the null hypothesis indicates that the null hypothesis is true with the 95 percent confidence level and the methodology j is significantly better than the algorithm k. Rejecting the null hypothesis indicates that the two methodologies are not significantly different. As can be seen from Table 6.58, the optimum models of the proposed DIT2FFC method are significantly different from the benchmark methods, since the null hypothesis is not rejected.
6.6 Overall Discussions on Experiments
299
Table 6.58 Overall two-sample left tailed t-test results (p<0.05) for the Classification Datasets: FR: Fail to Reject the Null Hypothesis, R: Reject the Null-Hypothesis LR
ANFIS
NN
SVC
FKNN
FR FR R R FR 0.35 0.7 0.01 0 1 FR FR FR R FR 0.9 0.97 0.11 0 1 T1IFF-C FR FR FR FR FR 0.97 1 0.75 0.13 1 DIT2FF-C* FR FR FR R FR 0.84 0.91 0.12 0 1 DIT2IFF-C FR FR R R FR 0.72 0.86 0 0 1 ET1FF-C FR FR R R FR 0.93 0.94 0.05 0 1 ET1IFF-C FR FR R R FR 0.91 0.95 0.03 0 1 EDIT2FF-C FR FR R R FR 0.86 0.93 0.01 0 1 EDIT2IFF-C • Indicate the best method based on the AUC performance measure. Shaded cells indicate the best method in the rows that is significantly better than the benchmark methods in each intersecting column. T1FF-C
Table 6.59 Overall Ranking Results of Classification Datasets AR
rj
Significance Win Rank
SRR
% Improvement Ratio
Rank
SRR j
Rank
SWR j
Rank
PIR j
Rank
Average Rank
Overall
LR
11.00
11
0.98
11
0.3611
9
-0.018
11
10.50
8
ANFIS
11.00
12
0.96
12
0.3750
8
-0.056
12
11.00
9 6
NN
10.00
10
1.04
9
0.5833
3
0.038
9
7.75
SVC
6.00
7
1.09
7
0.5972
2
0.076
7
5.75
4
FKNN
13.00
13
0.60
13
0.3194
11
-0.366
13
12.50
10 7
T1FF-C
8.00
9
1.04
10
0.5556
4
0.036
10
8.25
T1IFF-C
3.83
2
1.11
2
0.6250
1
0.094
2
1.75
2
DIT2FF-C*
2.83
1
1.14
1
0.6250
1
0.128
1
1.00
1
DIT2IFF-C
4.67
5
1.10
3
0.5417
5
0.087
5
4.50
3
ET1FF-C
7.00
8
1.07
8
0.4583
7
0.064
8
7.75
6
ET1IFF-C
4.00
3
1.10
4
0.5000
6
0.087
4
4.25
3
EDIT2FF-C
4.33
4
1.09
5
0.5000
6
0.089
3
4.50
3
EDIT2IFF-C
5.33
6
1.09
6
0.3472
10
0.083
6
7.00
5
* The values in bold indicate the optimum model based on each corresponding ranking method, i.e., AR, SRR, SWR, PIR and Average Rank.
300
6 Experiments
Having displayed the AUC results and the significant test for the classification datasets, we use ranking methods to prove that the proposed algorithms are better than the benchmark models across different datasets. Nevertheless, one simple ranking method may not be valid for all datasets; therefore, we provided different ranking methods [Brazdil and Soares, 2000]. The ranking methods presented here are: average ranks (AR), success rate ratios (SRR), significant wins ratios (SWR), percent improvement ratios (PIR), which are presented in the beginning of this chapter. The results obtained from each ranking methodology are summarized as below: Based on the ranking results in Table 6.59, the best model, which is ranked the first in all rankings, i.e., AR, SRR, SWR, and PIR, is the proposed DIT2FF-C method. The summary of all the rankings are depicted in Figure 6.34. The best algorithm, DIT2FF-C, has the highest overall AUC and is significantly different than the rest of the benchmark methods based on 5% AUC difference. The rest of the proposed methodologies which are ranked higher than the benchmark methods, T1IFF-C (rank=2), DIT2IFF-C, ET1IFF-C and EDIT2FF-C (rank=3), are not significantly different from the SVM results. The average significance win ratios (SWR) of the SVM (SWR=2) and NN (SWR=3) methods are also ranked high, which indicates that the AUC results of these methods are significantly better than most of the methodologies. However, both of these methods, SVM and NN, are not significantly better than proposed optimum DIT2FF-C method with 95 percent confidence level as shown in Table 6.58. 14 AR SRR SWR PIR
12
RANK
10 8 6 4 2 0
0
LR
NN ANFIS
FKNN SVM
T1IFF-C T1FF-C
DIT2IFF-C
DIT2FF-C
ET1IFF-C
ET1FF-C
EDIT2IFF-C
EDIT2FF-C
Fig. 6.34 Comparison of System modeling techniques based on classification Ranking Results
6.7 Summary of Results and Discussions In this chapter, the proposed Fuzzy Functions for regression case studies, i.e., T1FF, T2FF, ET1FF, EDIT2FF, T1IFF, DIT2IFF, ET1IFF, EDIT2IFF, and Fuzzy Classifier Functions approaches, i.e., T1FF-C, DIT2FF-C, ET1FF-C, EDIT2FF-C, T1IFF-C, DIT2IFF-C, ET1IFF-C, EDIT2IFF-C, as well as well-known system
6.7 Summary of Results and Discussions
301
modeling tools (benchmark methods), i.e., Support Vector Machines (SVM) for classification and regression, Neural Networks, (NN), Adaptive Network Based Fuzzy Inference Systems (ANFIS), Dynamic Evolving Neuro-Fuzzy Inference Systems (DENFIS), Genetic Fuzzy Systems (GFS), Discrete Interval Valued Type-2 Fuzzy Rule Bases for regression (DIT2FRB), Fuzzy K-Nearest Neighbor approach for classification, are applied to various datasets of regression and classification types. The aim is to build system models for regression and classification problem domains and analyze the performance of the proposed approaches in terms of significant performance improvements in terms of error reduction and robustness. Stock price datasets have a time-series structure but in this work they are treated similarly to regression problems keeping their consecutiveness of the data points intact; therefore, these datasets are analyzed differently than the rest of the regression domains. It is a rather difficult task to come to conclusions about the performance of the proposed methodologies when there are mixed structured problem domains. Nevertheless, the following generalized list of results is presented to reflect a general summary of the research experiments. ¾ The Friedman’s Artificial dataset is a ‘defined’ dataset generated by a fixed function. The strength of the benchmark methods, such as NN and ANFIS, in defined datasets such as this, is that they can simulate the observed system almost perfectly. For this reason, the proposed methods’ performance is only comparable to these benchmark methods. Nevertheless, due to their nature, the proposed methods are more effective in experiments where the system is not defined with given functions. ¾ In estimation problems of stock prices, the main objective is to build a decision support system that can predict the next day stock prices based on the previous period’s stock prices and based on direction, accuracy and profitability concepts. Input variables types, the performance measure, and the cross validation method that are used to build models for stock prediction is slightly different from the rest of the regression domains. For this study, earlier published work from the literature is used as a guide while constructing the datasets and selecting variables. A new performance measure, Robust Simulated Trading Benchmark, is presented to evaluate stock prices. The results indicate that the proposed fuzzy systems are robust in estimating stock prices in comparison to the benchmark methods. We demonstrated that the proposed fuzzy system models, i.e., type-1 and type-2 fuzzy functions and improved fuzzy functions, and their evolutionary extensions, are problem domain independent based on the datasets which are investigated in this work. ¾ A very important result that would influence machine-learning practices is that, the well-known error based performance measures may not be an effective way of comparing the stock prediction methodologies. In trading systems, the profitability is the key component of a system model. Hence, the machine-learning and soft computing methods should evaluate these problem domains using measures such as profitability not just an error rate. With five different stock prices, we are able to demonstrate this issue using a new conservative robust profit measure. In addition, the new performance measure indicates that with the proposed fuzzy function methodologies, much more profitable system models can be built.
302
6 Experiments
¾ It is well known that the SVM generalization performance (estimation accuracy) depends on a good setting of meta-parameters, i.e., C-reg, epsilon and the kernel parameters. The problem of optimal parameter selection is further complicated by the fact that SVM model complexity (and hence its generalization performance) depends on all three parameters which was proven in [Cherkassky and Ma, 2004], where they present functions to capture values of Creg and epsilon later to be used in SVM models. The problem with these functions is that, they are domain dependent and should not be applied to every dataset. We explored different methods to find the optimum parameters. In this work, exhaustive search methods are used to find the optimum values of these parameters. The exhaustive search method is computation intensive but with a wide range of values, the optimum parameters can be found. We also implemented genetic algorithms into fuzzy functions methods to build a hybrid fuzzy system in order to identify the optimum model parameters based on a stochastic search method. The results indicate that, the new hybrid systems are effective in reducing the computation time compared to the exhaustive search. ¾ In the experiments, we also demonstrated that performance measures based on a system model error such as R2, RMSE, and MAPE, are highly correlated and one could use only one of these in system modeling experiments. Therefore, the performance comparisons of proposed and benchmark methodologies for regression and classification datasets, except the stock price datasets, are based on performance measures such as R2 and AUC, respectively. Since these performance measures are in the percentage domain [0,1], it is much easier to evaluate the improvements and robustness of the proposed methodologies in comparison to the benchmark methods and furthermore do cross comparison analyses between dataset domains. ¾
In general, the following results are obtained from the experiments.
Table 6.60 Summary of Results Best Optimum Methodology
Function Type Linear Regression LSE
Percent Improvement against best optimum benchmark method.
Stock Price Models
EDIT2IFF using Improved Fuzzy Clustering
Regression Models
EDIT2IFF and DIT2FF Improved Fuzzy Clustering and FCM clustering
LSE and NonLinear SVM
2%-24.7%
Classification Model
DIT2FF and DIT2IFF using Improved Fuzzy and FCM clustering
Logistic Regression and NonLinear SVC
3.3%-33.3%
4.46%-28.19% based on CAD$100 investment
In summary, the proposed approaches are superior in performance compared to the benchmark datasets. The significance of the results (based on statistical tests) also indicates that the results of the corresponding proposed methods are significant in total compared to the benchmark methods. The common structure of the optimum methods in three different system modeling experiments as shown in
6.7 Summary of Results and Discussions
303
Table 6.60 is that these optimum models use Type-2 Fuzzy Functions strategies. The evolutionary extension of the Improved Fuzzy Functions methods is more effective in regression problem domains, whereas Fuzzy Functions methods that apply the well-known Fuzzy C-Means clustering are more effective in classification problems. This is due to the structure of the semi-parametric case-based inference approach, which can be explained as follows. During the proposed inference algorithm of the Improved Fuzzy Functions approaches, e.g., Type-1 and Type-2, for classification problem domains, we applied cased based reasoning in order to find the nearest training vectors to the testing data samples, whose class labels are not known. The distances that are measured based on the Euclidean distance measure. The issue in classification datasets of this work is that the output values of training vectors are class labels {0,1}, but the rest of the variables have continuous domains. We treated the class labels as normalized scalar values as well. Therefore, we lost some information which affected the performance of the proposed Improved Fuzzy Functions methods. Since the output values have scalar values in regression experiments, we did not have to deal with this problem. The possible solution is using a mixed type distance measure which combines distance measures of scalar values with the distance measures of nominal or binary values. This issue of proposed improved fuzzy functions inference mechanism to be applied to classification problem domains is left for future work. In conclusion, the results indicate that in regression problem domains with the fuzzy functions are more effective when proposed Improved Fuzzy Clustering is used to identify hidden structures instead of the well-known Fuzzy C-Means Clustering method. ¾ Cluster validity Indexes. In chapter 3, a new fuzzy clustering method, improved fuzzy clustering (IFC), is introduced for regression and classification problem domains. Along with IFC, the two new cluster validity measures are introduced to validate the results of IFC and to capture the optimum number of clusters. Throughout the experiments, we used a heuristic search method to optimize the number of clusters, except the proposed approaches, which implement evolutionary methods to optimize the number of clusters automatically. We tested the performance of the proposed cluster validity methods by comparing the results with their results of the heuristic search. The comparative results of regression and classification experiments indicate that the proposed validity measures are effective in identifying a range of optimum number of clusters to search within. ¾ Linear versus Non-Linear fuzzy function approximators. One of the noticeable results of the experiments is that in stock price estimation models, the best fuzzy functions are optimized with linear regression functions. The optimization methods reveal that the non-linear methods such as SVM, which implement linear and non-linear kernel functions, are not the optimum models for this domain. This is due to the structure of the stock price dataset. The input variables are derived from the previous closing values of the stocks, which have linear relationship with the current closing value, the output variable. The fuzzy function models predicted that the optimum fuzzy function approximators are the LSE method. The linear SVM methods did not reveal as a good performance as the fuzzy functions
304
6 Experiments
methods. The evolutionary extensions of fuzzy functions approaches also identified the LSE method as the optimum function approximation in shaping the fuzzy functions. For the rest of the experiments, a mixture of fuzzy function types are identified as the optimum function estimation methods. Nonetheless, one does not need to implement SVM to approximate fuzzy functions. Simpler linear regressions such as the least squares regression can be applied into proposed fuzzy functions approaches and yet better performances can be obtained compared to the rest of the benchmark methodologies of this work. One of the advantages of using a simpler regression method in fuzzy functions is the time it takes to optimize the method. In particular, when the LSE is used instead of the SVM regression, the algorithm converged much faster. ¾ Optimality of Genetic Algorithms. It is well known that the genetic algorithms’ optimality may differ between runs, i.e., when the number of iterations is low, the algorithm may get stuck at the local minima. To analyze the optimality, all the methodologies that utilize genetic algorithms to optimize parameters, fuzzy functions or rules are rerun for a different number of iterations. The results of corresponding algorithms were better in terms of better performance in reducing error when the number of iterations is large. When the number of iterations was less than 30, the genetic algorithm could not always find a good performance when compared to the number of the iterations that were set to between 50 and 100. More than 100 iterations did not change the results. Hence, the optimum number of iterations is found to be around 50-100. We also changed the population size between runs to observe the performance change. The default population size is set to 50 in the genetic algorithm toolbox in MATLAB, viz., GAOT Algorithm [Houck et al., 1995]. When the population size was around 50-100, the algorithm always found the optimum model with the best performance. However, when the population size is below 50, the algorithm could not always find the optimum model. When the population size is increased to more than 100, the performance did not change. The issue with the population size in genetic algorithms is that, as its value is increased, the time it takes to find the optimum models increases also. In the experiments, it was found that the size of the populations should not exceed 100. ¾ Significance Tests on Performance Improvement of Proposed Methodologies. In order to show the significance of the results, we not only presented individual significance tests for every dataset, but we also calculated overall significance of methods on general dataset structures such as the regression and classification experiments. The hypothesis in significance tests is the performance improvement of the methodologies against other methodologies based on a threshold value. We assumed that in regression experiments, if the R2 value of an optimum model of the proposed methodology is higher than [2.5%-5%] of any benchmark methodology in cross validation experiments with 95% confidence; we can say that the proposed method is significantly better. We have found that in regression and classification problem domains, the proposed methods are significantly better than almost all of the benchmark methods.
Chapter 7
Conclusions and Future Work 7 Conclusions and Future Work
This chapter concludes the book by surveying related work and describing future research.
7.1 General Conclusions The main goal of this book was to formulate alternative methodologies to fuzzy rule bases in order to build fuzzy systems. The new fuzzy system applies Improved Fuzzy Functions strategies, which are more robust and efficient in terms of reducing error, automatically identifying uncertainties and optimum parameters. In particular, Improved Fuzzy Functions of this book can shape the membership values to improve the performance of local functions using evolutionary algorithms. In doing so, new methodologies are presented, including; two new fuzzy clustering algorithms along with two new clustering validation methods, a new inference methodology, uncertainty modeling of imprecise model parameters, global search methods based on genetic algorithms to build an algorithm that would automatically optimize these parameters, etc. Current literature suggests that the interval of fuzziness parameter of the fuzzy clustering algorithm represents some part of the uncertainty interval of the fuzzy systems [Uncu et al, 2004, Ozkan and Turksen, 2007]. The proposed algorithm applies a search method to identify the uncertainty interval of not only the fuzziness parameter of the fuzzy clustering algorithm, but also the structure of the local fuzzy functions. In the experiments section, the performance of the proposed fuzzy systems is identified with comparative experiments against benchmark methodologies. The effectiveness of the proposed methods of fuzzy functions is demonstrated in comparison to the fuzzy rule bases based on the modeling error and the number of fuzzy operators required. In addition, the effectiveness of the proposed methods are also compared with well known statistical system modeling tests and machine learning methods based on modeling error and elapsed times of inference to evaluate if the proposed approaches are comparable or even better. In fuzzy system modeling algorithms finding hidden structures is one of the most important steps. The relationships between the input variables and the output variable are identified within these hidden patterns separately, which we call the local models. The proposed fuzzy clustering algorithm can identify hidden structures in the given domain. It has a hybrid structure, which is motivated by two different clustering methodologies. One of them is the most commonly used clustering algorithm, the Fuzzy C-Means (FCM) clustering method [Bezdek, 1981], which has been used in many different variations of fuzzy system models to identify hidden patterns. The second method that the proposed fuzzy clustering A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 305–311. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com
306
7 Conclusions and Future Work
algorithm which is inspired by the fuzzy c-regression clustering method (FCRM) proposed by Hathaway and Bezdek [1993]. A prominent feature of this clustering algorithm that separates it from the other point-wise clustering algorithms, e.g. FCM clustering method, is that, cluster prototypes are functions instead of geometrical objects. FCRM clustering algorithm can be used to separate linear patterns, where each pattern can be identified by a linear or non-linear function. The proposed new Improved Fuzzy Clustering (IFC) method combines these two clustering methodologies in a novel way to improve the structure identification process. It optimizes the membership values so that they could be used to identify the fuzzy functions to explain the dependent variable in local models. With the new IFC, linear or non-linear local relationships can be identified. The objective of the new fuzzy clustering methods is two-fold: (i) to identify hidden structures in a given system, (ii) to improve the effectiveness of the membership values by identifying membership functions that can predict the variability of the output variable. Hence, the new IFC improves effectiveness of system models by producing enhanced parameters to explain the relationships between inputs and outputs in local models. Two different types of the new IFC method are presented for system modeling with fuzzy functions approach. The first improved fuzzy clustering algorithm is a supervised clustering method, which is applied to find the local structures in regression type datasets, where the output variable has continuous domain. While searching for the hidden structures in the given system, IFC identifies the membership values using a membership value calculation equation more effectively. What separates the new fuzzy clustering method from the earlier fuzzy clustering methods for regression problem domains is that, the membership value calculation equation produces improved membership values that enhance to explain an output variable more effectively. At each step of the algorithm interim fuzzy functions are identified for each cluster using membership values and their user-defined transformations as input variables, excluding the original input variables. This way, the individual effect of the new membership values can be evaluated in a clustering method. These functions are called the “interim fuzzy functions”, which are linear or nonlinear regression functions. Mostly, a simple least squares estimation is used. More sophisticated regression methods such as support vector machines (SVM) are also used to compare the results with the simple regression methods, i.e., LSE. The IFC method can be computationally expensive based on the type of the regression function implemented, yet one could achieve better predictions. In conclusion, the new IFC method can produce membership values that are more effective in explaining the output variable than the commonly used FCM clustering method. FCM clustering algorithm is used to compare the effectiveness of the IFC method. However, it was not possible to compare the results of the proposed clustering method with the FCRM clustering method, since the reasoning has not been clearly defined for these methods. The second type of improved fuzzy clustering method is for classification type (IFC-C) problem domains, which is also a supervised clustering algorithm. This book only deals with the binary output classification datasets. The new IFC-C produces improved
7.1 General Conclusions
307
membership values can help to obtain improved classification functions to separate the two classes. As in many other fuzzy clustering methods, a few initialization parameters should be identified by the domain expert prior to the execution of the new fuzzy clustering method. One of these parameters is the number of clusters and a common way of predicting the number of clusters is by applying a suitable validity measure that can evaluate the objective of the fuzzy clustering algorithm. The new fuzzy clustering algorithms have a unique feature of shaping membership values into good predictors in regression or classification problems. Thus, two new cluster validity measures are presented to identify the number of clusters of the proposed IFC and IFC-C methods. To demonstrate the effectiveness of the proposed validity measures, three well-known cluster validity indices that have similar objectives are used to evaluate the result of the proposed IFC methods. Later in the experiments we proved the effectiveness of the new cluster validity measures by comparing them with exhaustive search results. It is shown that the new cluster validity measures are effective measures to capture the number of hidden structures easily, without having to execute an exhaustive search method. After identifying the structure of the system model, and significant membership values, the next step is to identify the inference parameters of the system models. In this book type-1 and Type-2 Improved Fuzzy Functions are proposed to identify the inference parameters. Then, evolutionary algorithms are implemented into each of these methods to build hybrid fuzzy systems to improve the efficiency of the models in terms of error reduction. The evolutionary extensions require less iteration to converge compared to the exhaustive search. The inference method of the proposed type-2 improved fuzzy functions methods are based on the proposed type-1 improved fuzzy functions approaches. Similarly, the inference parameters of the proposed evolutionary type-2 improved fuzzy functions approaches are based on the proposed type-2 improved fuzzy functions approaches. Thus, conclusions of the proposed type-1 improved fuzzy functions and evolutionary type-1 improved fuzzy functions approaches will be presented together in the next paragraph. The proposed improved fuzzy functions approaches are extensions of the ones proposed by Delgado et al [1997] and the ones proposed by Turksen [2008] and Turksen and Celikyilmaz [2006]. In these methods, fuzzy c-means clustering (FCM) algorithms [Bezdek, 1981] are employed to identify the fuzzy system model structure and an exhaustive search method is used to identify the optimal rules (c*), optimal degree of fuzziness (m*), and optimal fuzzy functions types and structures. In this book, the new Improved Fuzzy Clustering algorithm is applied to obtain improved membership values that can help to improve the efficiency of local fuzzy models. To make the proposed system-modeling tools versatile, an extension of the improved fuzzy functions method, which enables building system models for classification datasets, is presented as well. In earlier fuzzy functions approaches the membership values obtained from the FCM clustering algorithm become the predictors of local functions. The unique property of the fuzzy functions approaches of this book is that the membership values obtained from improved fuzzy clustering algorithm and their transformations are
308
7 Conclusions and Future Work
used as additional predictors in identifying the local functions, which are called the improved fuzzy functions. In addition to the two parameters, the m* and the c*, the proposed structure identification also requires improved fuzzy function types and structures to be defined. Two different strategies are implemented to identify these parameters. The first one is the type-1 improved fuzzy functions (T1IFF) methods using an exhaustive search to identify the inference model parameters. The second method is the evolutionary type-1 improved fuzzy functions (ET1IFF), which uses genetic algorithms to optimize the system parameters. The ET1IFF is computationally inexpensive since it requires less optimization steps compared to T1IFF approach. Thus, one can easily reduce the exponentially growing search space to a manageable size with ET1IFF methods. With ET1IFF, the inference parameters are identified automatically, given the boundaries of the parameters. The proposed type-2 structure identification methods are extensions of the proposed type-1 improved fuzzy functions methods and the type-2 methods proposed by Uncu and Turksen [2007]. The major difference between the proposed type-1 and type-2 improved fuzzy functions is that, the structure of the fuzzy functions and the fuzziness parameter, m*, which is defined by the fuzzy clustering parameters, do not take single crisp values. Similarly, the idea of a casebased semi-parametric type-reduction of this book is similar to the Uncu and Turksen [2007]’s type-2 methodology, however they only focus on the fuzziness parameter, m*, to search for the optimum type-1 fuzzy rule base structures. Furthermore, their approach is based on fuzzy rule bases, which are structurally different from the fuzzy functions methods [Turksen, 2008, Celikyilmaz, 2005, Celikyilmaz and Turksen, 2007a,b] and they only use first order linear models to build Takagi-Sugeno type rule bases. In fuzzy functions one can use simple linear as well as more sophisticated non-linear regression methods in separate local functions within the same model. In the proposed type-2 approach one defines a range of values for fuzzy functions and the boundaries for fuzziness parameters to identify various type-1 fuzzy function models. Hence, the optimum solution does not contain one single type-1 model, but a range of embedded models with different fuzzy function structures and fuzziness values. In other words, the collection of models identified during structure identification is used to characterize the imprecision in inference parameters of type-2 models. In order to conduct this, during inference, an additional step, the case based type reduction method, is employed. With this step, the type-2 fuzzy functions are reduced to type-1 fuzzy functions at the beginning of the inference process by selecting the best type-1 improved fuzzy functions. With the new type-reduction, for each new observation the nearest training case’s optimum parameters are obtained from collection tables. The fuzzy function structure and m parameters associated with the selected training vectors are used to execute the type-1 improved fuzzy function inference method after this point, reducing the order from type-2 to type1. The rest of the inference method is the same as the proposed type-1 improved fuzzy functions structures. A unique feature of the type-2 improved fuzzy functions methods compared to the earlier methods is that, the new structure identification enables employment of different function structures in local
7.1 General Conclusions
309
structures, in which linear or non-linear functions can be used. In short, each new observation to be modeled may use a different optimum fuzziness value and fuzzy function for each identified structure in the system. Thus, the proposed type-2 fuzzy system models increase the efficiency of capturing uncertainty in local structures. Compared to the earlier type-2 fuzzy rule base approaches the type-2 fuzzy system models is found more effective and robust in the experimental analysis. Similar to the type-1 improved fuzzy functions methods, type-2 improved fuzzy function parameters are optimized by implementing two separate methods: (i) exhaustive search, which is inefficient in terms of computation time, yet a definite convergence to global optimum can be achieved, given a wide range of values for parameters, and (ii) evolutionary computation methods using genetic algorithms to capture the uncertainty interval of inference parameters. The second search method is efficient in terms of processing times; however, global optimum can be achieved if the right number of parameters and genetic operations are used. The proposed algorithms were applied on well-known benchmark and real life datasets of regression and classification types. In order to assess the success of the proposed algorithms, the same datasets are used to build models using several benchmark system-modeling tools. The regression datasets include two benchmark datasets and a real life desulphurization dataset with two outputs. The fourth category of regression datasets comprises five different real stock pieces of Canadian Stocks. Stock price estimation is one of the challenging areas of soft computing and statistical techniques. A new performance measure for stock price datasets is implemented to compare the performances of different methodologies based on the profitability of the models. A three-way sub-sampling cross validation method is used to evaluate the performance of each modeling tool. To present the success of the proposed Improved Fuzzy Clustering method in fuzzy functions methods we compared the results with standard FCM clustering. The proposed evolutionary type-2 improved fuzzy functions methodologies outperformed the benchmark system modeling methods in regression experiments and stock price prediction experiments. It should be noted that the proposed type-2 fuzzy functions methods were the best methods in classification problem domains which were closely followed by the type-1 improved fuzzy functions approaches. In conclusion, the improved fuzzy functions methodologies which apply improved fuzzy clustering methods are beneficial in terms of increasing prediction performance. Additionally, since the results of the proposed type-1 improved fuzzy functions methods shown to be better in comparison to ones obtained from benchmark tools, the proposed type-1 improved fuzzy functions methods should be used as an alternative method. In the experiments, the fuzzy functions of local models are identified by applying two different types of regression methods, LSE and SVM. The LSE represents a simpler linear model, which can converge fast but due to normality assumptions, may not identify outliers well. Whereas SVM is a more powerful regression method to explain the non-linear relations of inputs and output variables that may exist in the datasets under study. When fuzzy functions are constructed using the SVM method, in some cases better results are obtained. In
310
7 Conclusions and Future Work
other cases, Fuzzy functions with LSE captured the best models. The merit of the proposed approaches is that domain experts can implement other regression or classification methods other than LSE or SVM to identify the fuzzy function parameters, e.g., weighted least squares, ridge regression methods for regression, naïve-bayes, etc. In consequence of this, the proposed approaches are versatile and the plug-ins could be developed without much hassle.
7.2 Future Work Some parts of this research have been left for future study. These can be listed as follows: ¾
Originally, improved fuzzy clustering algorithms were developed for system domains of the regression type. This is mostly because the system problem domains we dealt with comprised of regression type datasets. Lately, extensions of the proposed algorithms are implemented for classification problem domains. Even though the case based reasoning structure worked effectively in regression problem domains, it did not work as well for the classification domains. The overall results of the classification models demonstrate that in terms of increasing the accuracy of the prediction models, the fuzzy classification functions, which apply improved fuzzy clustering methods, are not as good as the ones that apply FCM clustering algorithm. This is mostly due to the structure of the inference algorithm specifically being designed for regression domains. The experiments indicate that extending the existing methods for classification problem domains has not given outstanding results. To remedy this, an effective inference algorithm that implements a weighing concept to find the membership values of the new observation should be investigated. This issue has been discussed with fellow researchers during one of the conference presentations.
¾
The fuzzy functions map each input domain of each cluster onto a higher dimension using membership values and then identify the relationships in this new space. The current algorithms assume that all the initially selected variables are important. However, a better method would be to identify the most important variables for each local model separately after the membership values are introduced to the dataset. Using an evolutionary method, the importance of each input variable and the membership values can be determined for each local model. This approach is already in its development stage.
¾
In this book, in order to capture the uncertainties that might exist in systems, interval valued type-2 fuzzy system modeling approaches are implemented. In these methods, interval type-2 fuzzy sets are used. These fuzzy sets identify a footprint of uncertainty, which is defined by identifying upper, and lower bounds of membership functions and blurring the interval. A further step to higher order fuzzy sets is using full
7.2 Future Work
311
type-2 fuzzy sets, in which one can define the fuzziness of each value in between the uncertainty interval. The secondary membership values in full type-2 fuzzy sets are defined as the third dimension to define the strength of each individual membership value within the interval corresponding to the particular object. With the implementation of full type-2 fuzzy sets into fuzzy functions approaches, we can model highly uncertain systems with noise, missing values, or measurement inconsistencies. Our initial findings on Full Type-2 Fuzzy sets in clustering algorithms is promising [Celikyilmaz and Turksen, 2008a].
References
1. Abe, S.: Fuzzy Function Approximators with Ellipsoidal Regions. IEEE Trans. On Systems, Man, and Cybernetics – Part B 29(4) (1999) 2. Allison, P.D.: Logistic Regression Using the SAS System: Theory and Application. Wiley-SAS (2001) 3. Babuška, R., Verbruggen, H.B.: Constructing Fuzzy Models by Product Space Clustering. In: Hellendoorn, H., Driankov, D. (eds.) Fuzzy Model Identification: Selected Approaches, pp. 53–90. Springer, Berlin (1997) 4. Bezdek, J.C.: Cluster Validity with Fuzzy Sets. J. Cbernit. 3, 58–72 (1976) 5. Bezdek, J.C.: Pattern Recognition with fuzzy objective function. Plenum press, New York (1981a) 6. Bezdek, J.C., Coray, C., Gunderson, R., Watson, J.: Detection and characterization of cluster substructure, I. linear structure: Fuzzy c-lines. SIAM J. Appl. Math. 40(2), 339–357 (1981b) 7. Bezdek, J.C., Coray, C., Gunderson, R., Watson, J.: Detection and characterization of cluster substructure, I. Fuzzy c-variates and convex combinations. SIAM J. Appl. Math. 40(2), 348–372 (1981c) 8. Bi, J.: Support Vector Regression with Applications in Automated Drug Discovery. Ph.D. Thesis, Rensselaer Polytechnic Institute, Troy, New York (2003) 9. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997) 10. Brazdil, P.B., Soares, C.: A Comparison of Ranking Methods for Classification Algorithm Selection. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS, vol. 1810, pp. 63–74. Springer, Heidelberg (2000) 11. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. Wadsworth International Group, Belmont (1984) 12. Bodur, M., Acan, A., Unveren, A.: Reduction of Generalization error in Fuzzy System Models. In: IEEE Int. Conf. on Fuzzy Systems, pp. 2184–2189 (2006) 13. Bouguessa, M., Wang, S., Sun, H.: An objective approach to cluster validation. Pattern Recognition Letters 27, 1419–1430 (2006) 14. Camaro, H.A., Pires, M.G., Castro, P.A.D.: Genetic Design of Fuzzy Knowledge Bases – a study of different approaches. In: Proceedings of NAFIPS, Alberta, Canada, vol. 2, pp. 954–959 (2004)
314
References
15. Cao, Q., Leggio, K.B., Schniederjans, M.J.: A comparison between Fama and French’s model and artificial neural networks in predicting the Chinese stock market. Computers and Operations Research 32, 2499–2512 (2005) 16. Chang, C., Lin, C.: LIBSVM: a library for support vector machines. Software (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm 17. Cherkassky, V., Ma, Y.: Practical Selection of SVM parameters and noise estimation for SVM Regression. Neural Network 17, 113–126 (2004) 18. Chiu, S.: Fuzzy Model Identification Based on Cluster Estimation. Jrn. of Intelligent and Fuzzy Systems 2(3) (1994) 19. Chen, J.-Q., Xi, Y.-G., Zhang, Z.-J.: A clustering algorithm for fuzzy model identification. Fuzzy Sets and Systems 98, 319–329 (1998) 20. Coupland, S., John, R.: Geometric Type 1 and Type 2 Fuzzy Logic Systems. IEEE Trns. Fuzzy Systems 15, 3–15 (2007) 21. Celikyilmaz, A.: Fuzzy Functions with Support Vector Machines. M.A.Sc. Thesis, Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Ontario, Canada (2005) 22. Celikyilmaz, A., Turksen, I.B.: Fuzzy Functions with Support Vector Machines. Information Sciences 177, 5163–5177 (2007a) 23. Celikyilmaz, A., Turksen, I.B.: Enhanced Fuzzy System Models with Improved Fuzzy Clustering Algorithm. IEEE Transactions on Fuzzy Systems, 905–919 (2007b) DOI:0/1109/TFUZZ.2007 24. Celikyilmaz, A., Turksen, I.B.: A new Cluster Validity index with Fuzzy Functions. In: 12th International Fuzzy Systems Association World Congress, IFSA 2007, Cancun, Mexico, June 18-21, 2007. Advances in Soft Computing, vol. 41, pp. 821– 830. Springer, Heidelberg (2007c) 25. Çelikyılmaz, A., Türkşen, I.B.: Evolution of Fuzzy System Models: An Overview and New Directions. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 119–126. Springer, Heidelberg (2007) 26. Çelikyılmaz, A., Türkşen, I.B., Aktaş, R., Doğanay, M.M., Ceylan, N.B.: A New Classifier Design with Fuzzy Functions. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 136–143. Springer, Heidelberg (2007) 27. Celikyilmaz, A., Turksen, I.B.: Enhanced Type 2 Fuzzy System Models with Improved Fuzzy Functions. In: Proc. 25th Intern. Conf. of the North American Fuzzy Information Processing Society – NAFIPS, in IEEE Proceedings. BEST STUDENT PAPER AWARD, June 24-27, 2007, pp. 140–145 (2007) 28. Celikyilmaz, A., Turksen, I.B.: Improved Interval Valued Fuzzy Reasoning with Evolutionary Computing. In: 12th International Conference on Fuzzy Theory & Technology (JCIS 2007), USA, July 18-24, pp. 1238–1244. World Scientific Publishing, Singapore (2007g) 29. Celikyilmaz, A., Turksen, I.B.: Uncertainty modeling with evolutionary improved fuzzy functions approach. IEEE Transactions on Systems, Man, and CyberneticsSMCB (under review) (2007h) 30. Celikyilmaz, A., Turksen, I.B.: Increasing Accuracy of Two-Class Pattern Recognition with Enhanced Fuzzy Functions. Expert Systems with Applications (2007i) doi: 10.1016/j.eswa.2007.11.039
References
315
31. Celikyilmaz, A., Turksen, I.B.: Evolutionary Fuzzy System Models with Improved Fuzzy Functions and Its Application to Industrial Process. IEEE- SMC Montreal (2007k) 32. Celikyilmaz, A., Turksen, I.B.: A Type-2 Fuzzy C-regression Method. In: International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU), Malaga, Spain (2008a) 33. Celikyilmaz, A., Turksen, I.B.: Uncertainty Bounds of Fuzzy C-Regression Method. In: IEEE World Congress on Computational Intelligence (WCCI 2008), Hong Kong (2008b) 34. Celikyilmaz, A., Turksen, I.B.: Validation Criteria for Enhanced Fuzzy Clustering. Pattern Recognition Letters 29, 97–108 (2008c) 35. Celikyilmaz, A., Turksen, I.B.: Enhanced Type 2 Fuzzy System Models with Improved Fuzzy Functions. International Journal of Approximate Reasoning (accepted, 2008d) 36. Celikyilmaz, A., Turksen, I.B.: Industrial Applications of Evolutionary Improved Fuzzy Functions. Journal of Computers (accepted, 2008e) paper ID: JCP 12583 37. Celikyilmaz, A., Turksen, I.B.: Optimization of Hybrid Fuzzy Clustering Parameters with Genetic Algorithms. Pattern Recognition (submitted, 2008f) 38. Celikyilmaz, A., Turksen, I.B.: A Stock Trading Decision Support System Using Novel Interval Valued Type-2 Fuzzy Functions. Fuzzy Sets and Systems (submitted, 2008g) 39. Cococcioni, M., Ducange, P., Lazzerini, B., Marcelloni, F.: Evolutionary MultiObjective Optimization of Fuzzy Rule-Based Classifiers in the ROC Space. In: IEEE Fuzzy Systems Conference (2007) 10.1109/FUZZY.2007.4295465 40. Cordon, O., Herrera, F., Hoffmann, F., Magdalena, L.: Genetic Fuzzy Systems. Evolutionary Tuning and Learning of Knowledge Base Systems. In: Advances in Fuzzy Theory – Applications and Theory, vol. 19. World Scientific, Singapore (2001) 41. Cordon, O., Gomide, F., Herrera, F., Hoffmann, F., Magdalena, L.: Ten years of genetic fuzzy systems: current framework and new trends. Fuzzy Sets and Systems 141, 5–31 (2004) 42. Dave, R.N.: Characterization and detection of noise in clustering. Pattern Recognition Letters 12, 657–664 (1991) 43. Dave, R.N.: Validating fuzzy partition obtained through c-shells clustering. Pattern Recognition Letters 17, 613–623 (1996) 44. Deboeck, G.: Pre-processing and Evolution of Neural Nets for Trading Stocks. Advanced Technology for Developers (August 1992) 45. Delgado, M.R., Gomez-Skennata, A.F., Martin, F.: Rapid Prototyping for Fuzzy Models. In: Hellendoorn, H., Driankov, D. (eds.) Fuzzy Model Identification: Selected Approaches, pp. 121–161. Springer, Berlin (1997) 46. Demirci, M.: Fuzzy functions and their fundamental properties. Fuzzy Sets and Systems 106(2), 239–246 (1999) 47. Dubois, D., Parade, H.: Operations in a fuzzy valued logic. Information Control 43, 224–240 (1979) 48. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Natural Computing Series. Springer, Heidelberg (2003) 49. Emami, M.R., Turksen, I.B., Goldenberg, A.A.: Development of a Systematic Methodology of Fuzzy Logic Modeling. IEEE Transactions on Fuzzy Systems 63, 346–361 (1998)
316
References
50. Friedman, J.H.: Multivariate adaptive regression splines. The Annals of Statistics 19, 1–141 (1991) 51. Fukuyama, Y., Sugeno, M.: A new method of choosing the number of clusters for the fuzzy c-means method. In: Proc. 5th Fuzzy systems Symposium, pp. 247–250 (1989) 52. Gath, I., Geva, A.B.: Unsupervised optimal fuzzy clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(7), 773–781 (1989) 53. Garibaldi, J., Ozen, T.: Uncertain Fuzzy Reasoning: A case Study in Modeling Expert Decision Making. IEEE Transaction on Fuzzy Systems 15(1) (February 2007) 54. Goldberg, D.E.: Genetic Algorithms in Search and Optimization in Machine Learning. Addison Wesley Publication Co., Reading (1989) 55. Gunn, S.: Support Vector Machines for Classification and Regression, ISIS Technical Report (1998) 56. Gustafson, D., Kessel, W.: Fuzzy clustering with a fuzzy covariance matrix. In: Proc. IEEE CDC, San Diegom, USA, pp. 761–766 (1979) 57. Hathaway, R.J., Bezdek, J.C.: Switching regression models and fuzzy clustering. IEEE Transactions on Fuzzy Systems 1(3), 195–204 (1993) 58. Hellstrom, T., Holmstrom, K.: Predicting the Stock Market. Technical Report Series IMa-TOM-1997-07, Malardalen University (1998) 59. Holland, J.H.: Genetic Algorithms and Optimum Allocation of Trials. SIAM J. of Computing 2, 88–105 (1973) 60. Houck, C., Joines, J., Kay, M.: A Genetic Algorithm for Function Optimization: A Matlab Implementation, NCSU-IE TR 95-09 (1995) 61. Huang, J., Ling, C.X.: Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Transactions on Knowledge and Data Engineering 17(3) (2005) 62. Hwang, C., Rhee, F.C.-H.: Uncertainty Fuzzy Clustering: Interval Fuzzy Approach to C-Means. IEEE Trans. On Fuzzy Systems 15(1), 107–120 (2007) 63. Höppner, F., Klawonn, F., Kruse, R.: Fuzzy Cluster analysis. John Wiley & Sons, Chichester (1999) 64. Höppner, F., Klawonn, F.: Obtaining Interpretable Fuzzy Models from Fuzzy Clustering and Fuzzy Regression. In: 4th Int. Conf. Knowledge-Based Intelligent Eng. Systems & Allied Tech., UK, pp. 162–165 (2000) 65. Höppner, F., Klawonn, F.: Improved fuzzy partitions for fuzzy regression models. Int. Jrnl. of Approximate Reasoning 32, 85–102 (2003) 66. Ishibuchi, H., Nakashima, T., Kuroda, T.: A hybrid fuzzy genetics based machine learning algorithm: Hybridization of Michigan approach and Pittsburg Approach. In: Proc. IEEE Int. Conf. Syst. Man, Cybern, October 1999, pp. 29–301 (1999) 67. Ince, H., Trafalis, T.: Kernel Principle Component Analysis and Support Vector Machines for Stock Price Prediction. In: IEEE Intern. Joint Conf. Neural Networks, July 2004, vol. 3, pp. 2053–2058 (2004) 68. Jang, J.-S.R.: ANFIS: Adaptive Network Based Fuzzy Inference System. IEEE Trans. On System, Man and Cybernetics 23(3), 665–685 (1993) 69. Kandel, A.: On the minimization of incompletely specified fuzzy functions. Information and Control, 141–153 (1974) 70. Kandel, A.: A Note on the simplification of fuzzy switching functions. Information Sciences 13, 91–94 (1977) 71. Karnik, N.N., Mendel, J.M., Liang, Q.: Type-2 Fuzzy Logic Systems. IEEE Transactions on Fuzzy Systems 7(6) (1999)
References
317
72. Kasabov, N.K., Song, Q.: DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Trans. On Fuzzy Systems 10(2), 144–154 (2002) 73. Kecman, V.: Learning and Soft Computing: Support Vector Machines, Neural Networks and Fuzzy Logic Models. MIT Press, Cambridge (2001) 74. Keller, J.M., Gray, M.R., Givens, J.J.A.: A Fuzzy K-Nearest Neighbor Algorithm. IEEE Transactions on Systems, Man, and Cybernetics 15(4), 580–585 (1985) 75. Khuri, A.I.: Advanced Calculus with applications in statistics. Wiley Interscience, Hoboken (2003) 76. Kilic, L., Turksen, I.B., Sproule, B.A., Naranjo, C.A.: A K-Nearest Neighbourhood Based Fuzzy Reasoning Schema. In: 10th IEEE International Conference on Fuzzy Systems, Melbourne, Australia, pp. 236–239 (2001) 77. Kilic, K.: A Proposed Fuzzy System Modeling Algorithm with an Application in Pharmacokinetic Modeling. Ph.D Thesis, Department of Mechanical and Industrial Engineering, University of Toronto, Toronto (2002) 78. Kim, E., Park, M., Kim, S., Park, M.: Transformed Input-Domain Approach to Fuzzy Modeling. IEEE Transactions on Fuzzy Systems 6(4) (1998) 79. Kim, M., Ramakrishna, R.S.: New indices for cluster validity assessment. Pattern Recognition Letters 26, 2353–2363 (2005) 80. Kim, D.-W., Lee, K.H., Lee, D.: Fuzzy cluster validation index based on inter-cluster proximity. Pattern Recognition Letters 24, 2561–2574 (2003) 81. Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic Theory and Applications. Prentice Hall, USA (1995) 82. Klir, G.J., Folger, T.A.: Fuzzy Sets, Uncertainty and Information. Prentice Hall, Englewood Cliffs (1988) 83. Klir, G.J., Wierman, M.J.: Uncertainty-Based Information. Physica-Verlag, Hiedelberg (1998) 84. Kosko, B.: Neural Networks and Fuzzy Systems. Prentice Hall, Englewood Cliffs (1992) 85. Krishnapuram, R., Keller, J.M.: A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems 1(2), 98–110 (1993) 86. Kung, C.-C., Lin, C.-C.: A new Cluster Validity Criterion for Fuzzy C-Regression Model and Its Application to T-S Fuzzy Model Identification. In: Proc. on IEEE Intern. Conf. on Fuzzy Systems, vol. 3, pp. 1673–1678 (2004) 87. Leigh, W., Hightower, R., Modani, N.: Forecasting the New York Stock exchange composite index with past price and interest rate on condition of volume spike. Expert Systems with Applications 28, 1–8 (2005) 88. Leski, J.: e-Insensitive Fuzzy c-Regression Models: Introduction to e-Insensitive Fuzzy Modeling. IEEE Tran. Systems, Man, and Cybernetics-Part B 34(1), 4–15 (2004) 89. Liang, Q., Mendel, J.M.: Interval Type 2 Fuzzy Logic Systems: Theory and Design. IEEE Trans. on Fuzzy Systems 8(5), 535–550 (2000) 90. Lin, H.-T., Lin, C.-J., Weng, R.C.: A note on platt’s probabilistic outputs for support vector machines. Technical report (2003) 91. Lu, J., Plantaniotis, K.N., Venetsanopoulos, A.N.: Face recognition using feature optimization and nu-support vector learning. In: Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing, pp. 373-382, Falmount, MA., USA (2001) 92. The MathWorks Inc. Fuzzy Logic Toolbox User’s Guide (2002)
318
References
93. Mamdani, E.H., Assilian, S.: An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller. Int. J. Man-Machine Studies 7, 1–13 (1974) 94. Menard, M.: Fuzzy Clustering and Switching Regression Models Using Ambiguity and Distance Rejects. Fuzzy Sets and Systems 122, 363–399 (2001) 95. Marinos, P.N.: Fuzzy Logic and Its Application to Switching Systems. IEEE Trans. On Computers C-18(4) (1969) 96. Mendel, J.M., Karnik, N.N.: Introduction to Type-2 Fuzzy Logic Systems. In: IEEE Syst., Man. Cybern. Conf. (1998) 97. Mendel, J.M., Liang, Q.: Interval Type-2 Fuzzy Logic Systems: Theory and Design. IEEE Transactions on Fuzzy Systems 8(5) (2000) 98. Mendel, J.M.: Uncertainty Rule-Based Fuzzy Logic Systems: Introduction and new directions. Prentice Hall, Upper Saddle River (2001) 99. Mendel, J.M., John, R.I., Liu, F.: Interval Type-2 Fuzzy Logic System Made Simple. IEEE Trans. Fuzzy Systems, vol. FS-4, pp. 808–821 (2006) 100. Mendel, J.M.: Fuzzy sets for words: a new beginning. In: Proc. of IEEE Int’l. Conf. in Fuzzy Systems, St. Louis, MP, pp. 37–42 (2003) 101. Meyer, M., Vlachos, P.: StatLib. Department of Statistics of Carnegie Mellon University, http://lib.stat.cmu.edu 102. Mild, A., Natter, M.: A critical view on recommendation systems. Working Paper Series (82) (2001) 103. Mizumoto, M.: Method of fuzzy inference suitable for fuzzy control. J. Soc. Instrument Control Engineering 58, 959–963 (1989) 104. Mizumoto, M., Tanaka, K.: Some Properties of Fuzzy Sets of Type-2. Infor. Control 31, 312–340 (1976) 105. Muller, K.R., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.: Predicting time series with support vector machines. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, p. 999. Springer, Heidelberg (1997) 106. Murphy, J.: Technical Analysis of Financial Markets. New York Institute of Finance, NY (1999) 107. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. Dept. of Information and Comp. Sci., University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html 108. NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ 109. Oh, S.-K., Pedrycz, W., Roh, S.-B.: Genetically optimized fuzzy polynomial neural networks with fuzzy set-based polynomial neurons. Information Sciences 176, 3490– 3519 (2006) 110. Ozkan, I., Turksen, I.B.: Entropy Assessment for Type-2 Fuzziness. In: IEEE International conf. on Fuzzy Systems, Budapest, Hungary, 25-29 July (2004) 111. Ozkan, I., Turksen, I.B.: Upper and lower level of fuzziness of FCM. Information Sciences Special Issue 177(23), 5143–5152 (2007) 112. Pal, N.K., Bezdek, J.C.: On Cluster Validity For Fuzzy C-Means Model. IEEE Trans. Fuzzy Systems 3(3), 370–379 (1995) 113. Pedrycz, W.: Fuzzy clustering with knowledge based guidance. Pattern Recognition Letters 25(4), 469–480 (2004) 114. Pedrycz, W., Reformat, M.: Evolutionary Fuzzy Modeling. IEEE Trans. On Fuzzy Systems 11(5), 652–665 (2003)
References
319
115. Platt, J.: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. MIT Press, Cambridge (2000) 116. Rowland, J.J.: Generalisation and model selection in supervised learning with evolutionary computation. In: Raidl, G.R., Cagnoni, S., Cardalda, J.J.R., Corne, D.W., Gottlieb, J., Guillot, A., Hart, E., Johnson, C.G., Marchiori, E., Meyer, J.-A., Middendorf, M. (eds.) EvoIASP 2003, EvoWorkshops 2003, EvoSTIM 2003, EvoROB/EvoRobot 2003, EvoCOP 2003, EvoBIO 2003, and EvoMUSART 2003. LNCS, vol. 2611, pp. 119–130. Springer, Heidelberg (2003) 117. Salakhutdiniv, R., Mnih, A., Hinton, G.: Restrictive Boltzmann Machine Learning for Collaborative Filtering. In: Proc. of 24th International Conference on Machine Learning, Corvallis, OR (2007) 118. Siy, P., Chen, C.S.: Minimization of fuzzy functions. IEEE Trans. Comput. 32(1), 100–102 (1972) 119. Smola, A.J.: Regression estimation with support vector learning machines. Master’s thesis, Technische Universität München (1996) 120. Smola, A.J., Schölkopf, C.: A tutorial on support vector regression. Statistics and Computing 14, 199–222 (2004) 121. Sugeno, M., Kang, G.: “Structure Identification of Fuzzy Model. Fuzzy Sets and Systems 26(1), 15–33 (1988) 122. Swets, J.A.: Signal Detection Theory, and ROC Analysis in Psychology and Diagnostics. Collected Papers. Lawrence Erlbaum Associates, Publishers, Mahwah (1995) 123. Takagi, T., Sugeno, M.: Fuzzy Identification of Systems and Its Applications to Modeling and Control. IEEE Transactions on Systems, Man and Cybernetics SMC15(1), 116–132 (1985) 124. Tettamanzi, A., Tomassini, M.: Soft Computing: Integrating Evolutionary, Neural, and Fuzzy Systems. Springer, Heidelberg (2001) 125. Turksen, I.B.: Interval valued fuzzy sets based on normal forms. Fuzzy Sets and Systems 20(2), 191–210 (1986) 126. Turksen, I.B.: Four Methods of Approximate Reasoning with Interval-valued Fuzzy Sets. Int. J. of Approximate Reasoning 3, 121–142 (1989) 127. Turksen, I.B.: Measurement of Membership Functions and Their Acquisitions. Int. Journal of Fuzzy Sets and Systems (Special Memorial Issue) 40(1), 5–38 (1991) 128. Turksen, I.B.: Fuzzy Normal Forms. Fuzzy Sets and Systems 69, 319–346 (1995) 129. Turksen, I.B.: Non-specificity and Interval Valued Fuzzy Sets. Fuzzy Sets and Systems 80, 87–100 (1996) 130. Turksen, I.B.: Type-1 and Type-2 fuzzy system modeling. Fuzzy Sets and Systems 106, 11–34 (1999) 131. Turksen, I.B.: Type 2 uncertainty in knowledge representation and reasoning. In: Proc. Of Joint 9th IFSA World Congress and 20th NAFIPS Int. Conf., Vancouver BC, July 2001, pp. 1914–1919 (2001) 132. Turksen, I.B.: Type 2 representation and reasoning for CWW. Fuzzy Sets and Systems 127, 17–36 (2002) 133. Turksen, I.B.: An Ontological and Epistemological Perspective of Fuzzy Theory. Elsevier, The Netherlands (2006) 134. Turksen, I.B., Celikyilmaz, A.: Comparison of Fuzzy Functions with Fuzzy Rule Base Approaches. International Journal of Fuzzy Systems 8(3), 137–149 (2006) 135. Turksen, I.B.: Meta Meta-linguistic axioms as a foundation for computing with words. Information Sciences 172(2), 332–359 (2007)
320
References
136. Turksen, I.B.: Fuzzy Functions with LSE. Applied Soft Computin (2008) doi:10.1016/j.asoc.2007.12.004 137. Uncu, O.: Type 2 Fuzzy System Models with Type 1 Inference. Ph.D Thesis, University of Toronto (2003) 138. Uncu, Ö., Kilic, K., Turksen, I.B.: A New Fuzzy Inference Approach Based on Mamdani Inference Using Discrete Type 2 Fuzzy Sets. In: IEEE International Conference on Systems, Man and Cybernetics, The Hague, Netherlands, 10-13 October (2004a) 139. Uncu, Ö., Kilic, K., Turksen, I.B.: A New Fuzzy Inference Approach Based on Mamdani Inference Using Discrete Type 2 Fuzzy Sets. In: IEEE Inter. Conf. on Sys., Man and Cyber, The Hague, Netherlands (October 2004b) 140. Uncu, O., Turksen, I.B.: Discrete Interval Valued Type 2 Fuzzy System Models Using Uncertainty in Learning Parameters. IEEE Transactions of Fuzzy Sets 15(10), 90–106 (2007) 141. Vapnik, V.: Nature of Statistical Learning Theory. Springer, New York (1995) 142. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) 143. Vergara, V., Moraga, C.: Optimization of Fuzzy Models by Global Numeric Optimization. In: Hellendoorn, H., Driankov, D. (eds.) Fuzzy Model Identification: Selected Approaches, pp. 252–278. Springer, Berlin (1997) 144. Wang, S.-T., Jiang, H.-F., Lu, H.J.: An Integrated Fuzzy Clustering Algorithm GFC for Switching Regression. International Jrn. of Pattern Recognition and Artificial Intelligence 16(4), 433–446 (2002) 145. Wang, H., Kwong, S., Jin, Y., Wei, W., Man, K.F.: Agent Based Evolutionary Approach for Interpretable rule-based knowledge extraction. IEEE Trans. Syst., Man, Cybern. Part C 35(2), 143–155 (2005) 146. Wang, L.-X., Mendel, J.M.: Fuzzy Basis Functions, universal approximation, and orthogonal least squares learning. IEEE Trans. Neural Networks 3, 807–814 (1992) 147. Xie, X.L., Beni, G.A.: Validity Measure for Fuzzy Clustering. IEEE Trans. Pattern and Machine Intelligence 3(8), 841–846 (1991) 148. Yu, J., Cheng, Q., Huang, H.: Analysis of the weighting exponent in the FCM. IEEE Transactions Systems Man Cybernt—Part B 34, 634–639 (2004) 149. Zadeh, L.A.: Fuzzy Sets. Information and Control 8, 338–353 (1965) 150. Zadeh, L.A.: The Concept of a Linguistic Variable and its Application to Approximate Reasoning-I. Information Sciences 8, 199–249 (1975a) 151. Zadeh, L.A.: Calculus of fuzzy restrictions. In: Fuzzy sets and their applications to cognitive decision processes, pp. 1–40. Academic Press, London (1975b) 152. Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1, 3–28 (1978) 153. Zadeh, L.A.: Fuzzy Logic=Computing with Words. IEEE Transactions on Fuzzy Systems 4(2), 103–111 (1996) 154. Zadeh, L.A.: From computing with numbers to computing with words- from Manipulation of Measurements to Manipulation of Perceptions. In: Wang, P.P. (ed.) Computing with Words. Wiley Series on Intelligent Systems, pp. 35–68. Wiley and Sons, New York (2001) 155. Zarandi, M.H.F., Turksen, I.B., Razaee, B.: A systematic approach to fuzzy modeling for rule generation from numerical data. In: IEEE Annual Meeting of the Fuzzy Information, Proc. NAFIPS 2004, pp. 768–773 (2004) 156. Ziwei, X.: On the Representation and Enumeration of Fuzzy Switching Functions. Information and Control 51(3), 216–226 (1981)
Appendix A
A.1 Set and Logic Theory – Additional Information Connective A function, or the symbol representing a function, which corresponds to conjunctions such as "and," "or," "not," etc. that takes one or more truth values as an input and returns a single truth-value as an output. The following table summarizes some common connectives and their notations. connective AND Equivalent Implies NAND Nonequivalent NOR NOT OR XNOR XOR
symbol A∧B, A∨B, A⋅B, A.B, , AB, A&B, A&&B A≡B, A⇔B A⇒B, AÆB A⋅ B A≡B A∨ B !A, A , ~A A∨B A XNOR B A ∨ B , A⊕B
Implies "Implies" is a connective between two propositions. It states, "if A is true, then B is also true." In formal terminology, the term conditional is often used to refer to this connective. The symbol used to denote "implies" is A⇒B or AÆB. A⇒B has the following Struth table: Table A.1 The Truth Table of A⇒B
A T T F F
B T F T F
A ⇒B T F T T
If A⇒B and B⇒A (i.e., (A⇒B)∧(B⇒A)), then A and B are said to be equivalent. A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 321–400. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
322
A Appendix
And, Or, Not “AND” is the connective in logic, which yields true if all conditions are true, and false if any condition is false. A AND B is denoted as A∩B. The binary AND operator has the following truth 7table: Table A.2 The Truth Table of A∧B
A T T F F
B T F T F
A∧B T F F F
OR is a connective in logic, which yields true if any one of a sequence conditions is true, and false if all conditions are false. In formal logic, the term disjunction (or, more specifically, inclusive disjunction) is commonly used to describe the OR operator. A OR B is denoted
A ∪ B . The binary OR operator has the following truth table:
Table A.3 The Truth Table of A∨B
A T T F F
B T F T F
A∨B T T T F
NOT is a connective in logic, which converts, true to false and false to true. NOT A is denoted as !A.
A.2 Fuzzy Relations (Composition) – An Example The composition of fuzzy relations can be displayed easier if the relations are mapped graphically. X={a,b,c}, Y={1,2,3,4}, Z={Δ,∇,Ω}. Fuzzy relations R and S can be defined on X × Y and Y × Z respectively as shown in the Figure A.1. Therefore, the composition of the relations R o S between b and D can be defined as follows:
⎧⎪min [( R( b,1 ),S( 1, Δ )] ,min [( R( b,2 ),S( 2, Δ )] ,⎫⎪ ( R o S )( b, Δ ) = max ⎨ ⎬ ⎩⎪min [( R( b,3 ),S( 3, Δ )] ⎭⎪
= max {min [0.2,0.5 ] ,min [0.9,0.3] ,min [0.7 ,1]}
= max {0.2,0.3,0.7} = 0.7
B.1 Proof of Fuzzy c-Means Clustering Algorithm
323
Y X
Z 1
0.7 a
0.5
0.
2 2
1
0.9 b
0.
7
∇
0.8
1
3
Ω
1
c
Δ
0.3
0.
6
0.4 4
R
S
Fig. A.1 Composition of binary fuzzy relations R and S
B.1 Proof of Fuzzy c-Means Clustering Algorithm The fuzzy c-means clustering (FCM) [Bezdek, 1981a] is a constraint optimization problem. In order to solve constraint optimization problems, a common method in mathematics, namely the Lagrange Multiplier method [Khuri, 2003], is used and the inequality constraint optimization problem is converted into unconstraint optimization model with one objective function. In order to get an equality constraint problem, the primal constraint optimization problem is first converted into an equivalent unconstraint problem with the help of creating unspecified parameters known as Lagrange Multipliers, λ, such that; for each pattern; n
max W ( M , Y ) = k =1
c
c
i =1
i =1
∑ (μik )m ( xk − υi )2 A − λ (∑ μik − 1)
(B.1)
is satisfied. According to LaGrange Method, the LaGrange function must be minimized with respect to the primal parameters and maximized with respect to the dual parameters. According to this the derivative of the LaGrange function with respect to the original model parameters, M and Y should vanish. The Primal optimization problem of FCM algorithm is given as:
min J ( M , Y ) =
n
c
∑∑ (μ
ik
) m ( xk − vi ) 2 A
k =1 i =1
s.t.
0 ≤ μik ≤ 1, ∀i, k c
∑μ
ik
= 1, ∀j
i =1
n
0 ≤ ∑ μik ≤ n, ∀i k =1
(B.2)
324
A Appendix
J is the objective function to be minimized. One of the most prominent ways to solve this problem is to construct a LaGrange method, which is used to solve optimization problems without the need to explicitly solve the conditions and use them to eliminate extra variables. This method requires a set of necessary conditions to identify optimum points of equality constraint optimization problems. In order to get an equality constraint problem, the primal optimization problem is first converted into an equivalent unconstraint problem with the help of parameters known as Lagrange Multipliers, denoted with λ.
Primal Problem min f(x1, x2,…,xnv) subject to h(x1, x2,…,xnv)=0 q=1,…,nq (number of constraints) k=1,…,n (number of input vectors) where f(x1, x2,…,xnv) is the objective function to minimize and hk is the constraint of this optimization problem. The LaGrange methodology is employed as follows: Firstly, the constraint minimization primal objective function is converted into a Lagrange maximization function as follows: nq
n
max L(x,λ ) = f ( x ) − ∑ ∑ λkq hkq ( x )
(B.3)
q =1 k =1
where L(x, λ) is the LaGrange function to minimize, λkq is called the Lagrange multiplier of the kth observation in qth constraint, f(x) is the primal objective function, hkq(x) is the qth constraint for the kth data vector in primal model. The LaGrange Method states that, the derivative of the LaGrange function with respect to the unknown parameters, i.e., μ, and υ, should vanish. The application of the LaGrange Method on FCM optimization problem is employed as follows: (i) The objective function is converted into a Lagrange function for each kth vector (pattern) as: n
c
c
k =1
i =1
i =1
W ( M , Y ) = ∑ ( μik ) m ( xk − υi )2 A − λ ( ∑ μik − 1)
(B.4)
(ii) The derivative of the objective function with respect to the primal variables, i.e., μik and υi must be equal to zero. c ∂W = m( xk − υi )A2 ∑ ( μik )m−1 − λ .c = 0 ∂μ i =1
(B.5)
B.1 Proof of Fuzzy c-Means Clustering Algorithm
325
(iii) From the constraints of minimization algorithm in (B.2), for any pattern, the total value of the degree of membership function for each cluster must be equal to 1. Therefore, equation in (B.5) can be restated as:
m( xk − υi )A2 ( μik )m −1 − λ = 0 ( μik )m −1 =
λ m
(B.6)
( xk − vi )A−2
(iv) The membership value is calculated as follows: 1
−2
⎛ λ ⎞ m −1 μik = ⎜ ⎟ ( xk − υi )Am−1 ⎝m⎠
(B.7)
c
μ jk = 1, ∀k in (B. 2): (v) Based on the constraint ∑ j =1 1
⎛ λ ⎞ m −1
∑ cj =1 μik = ⎜ ⎟ ⎝m⎠
c
∑
j =1
(( x −υ k
j
)
−2 /( m −1 )
)=1
(B.8)
(vi) From (B. 8), the Lagrange multiplier, λ, can be extracted as follows: 1
⎛ λ ⎞ m −1 = ⎜ ⎟ ⎝m⎠
1 ⎛ ⎜ 1 ⎜ ∑ 2 j =1 ⎜ m −1 ( x υ ) − j A ⎝ k c
⎞ ⎟ ⎟ ⎟ ⎠
(B.9)
(vii) Next, the equation (B. 9) is inserted into (B. 7) to get the membership function as:
μik =
1 ⎛ c ⎜ 1 ∑ ⎜ 2 j =1 ⎜ m −1 ( x ) υ − j A ⎝ k
⎛ ⎜ 1 ⎜ ⎞⎜ ⎟ ⎝ xk − vi ⎟ ⎟ ⎠
2 ⎛ ⎞ m −1 c ⎛ ⎞ ⎜ ⎟ ( xk − vi )A ⎟ ⎟ μik = ⎜ ∑ ⎜ ⎜ j =1 ⎜⎝ ( xk − v j )A ⎟⎠ ⎟ ⎝ ⎠
2 m −1 A
⎞ ⎟ ⎟ ⎟ ⎠
(B.10)
−1
(B.11)
326
A Appendix
The distance function ||.|| is a function between two vectors and it can be re-written as in quadratic form as:
xk − υi
2
= ⎛⎜ ⎝
( xk − υi )T ( xk − vi ) ⎞⎟
2
⎠
= xk T xk − 2xk Tυi + υiTυi
(B.12)
According to LaGrange method, the derivative of the objective function with respect to the cluster centers, vi must also be equal to zero. Therefore: ∂W = ∂υ
c
c
i =1
i =1
∂(W( M ,Y ) = ∑ ( μik )m ( xk 2 − 2xkυi + υi 2 ) − λ( ∑ μik − 1 ) ∂v
(B.13)
c ∂W = ( −2xk + 2υi ) ∑ ( μik )m = 0 i =1 ∂υ
c
m ∑ ( μik ) xk
υi = i =1c
(B.14)
∑ ( μik )
m
i =1
B.2 Proof of Improved Fuzzy Clustering Algorithm The application of the Lagrange Method in the improved fuzzy clustering (IFC) is as follows: i)
The objective function is converted into a LaGrange function for each kth vector (pattern) as: c
n
c
L = ∑ ∑ μikm dik2 + ∑ m i =1 k =1 14 4244 3 i =1
⎛
⎛ y − h (τ , wˆ ) ) -λ ⎜ ⎜ ∑ μ ∑ μ (144 42444 3 ⎝ n
2
m ik
k
k =1
i
ik
i
SE of Fuzzy Function
⎝
c
i =1
ik
⎞ ⎞ ) ⎟ − 1⎟ ⎠ ⎠
(B.15)
FCM
ii) The derivatives of the objective function with respect to primal variables, i.e., μik and υi must be equal to zero.
∂ μik J mIFC = mμikm −1dik2 + mμikm −1 SEik − λ = 0
(B.16)
iii) From (B.16) we obtain;
μik = ( λ m )
1/( m −1)
⎡⎣ dik2 + SEik ⎤⎦
−1/( m −1)
(B.17)
Combining (B.16)and (B.17), we get;
⎡ ( λ m) ⎤ μik = ∑ ⎢ 2 ⎥ ∑ i =1 j =1 ⎣ ⎢ (d jk + SE jk ) ⎦⎥ c
c
1/( m −1)
=1
(B.18)
C.1 Artificial Neural Networks (ANNs)
(λ m)
1/( m −1)
c
∑ ⎡⎣1 (d
2 jk
j =1
(λ m)
1/( m −1)
=
327 1/( m −1)
+ SE jk ) ⎤⎦
=1
(B.19)
1 1/( m −1)
⎡ ⎤ 1 )⎥ ⎢ 2 ∑ j =1 ⎣ ⎢ d jk + SE jk ⎦⎥ c
(B.20)
⎛ ⎞ 1 ⎜ ⎟ μik = 1/( m −1) 1/( m −1) 2 ⎜ ⎟ c ⎡ ⎤ ⎡⎣ dik + SEik ⎤⎦ 1 ⎝ ⎠ ) ⎢ 2 ⎥ ∑ d + SE j =1 ⎢ jk ⎥ ⎣ jk ⎦
1
(B.21)
Therefore, the new membership function is formulated as follows: 1/( m −1) ⎛ ⎞ ⎛ ( d )2 + ( y − h (τ , wˆ ) ) 2 ⎞ k i ik i ⎜ c ⎜ ik ⎟ ⎟ μij = ⎜ ∑ j =1 2 2 ⎟ ⎜ ( d ) + ( y − h (τ , wˆ ) ) ⎟ 1< i , j ≤ c ⎜ ⎟ k j jk j ⎝ jk ⎠ 1≤ k ≤ n ⎝ ⎠
−1
(B.22)
C.1 Artificial Neural Networks (ANNs) There are different types of artificial neural networks (ANN) in use today. This thesis implements only supervised learning ANNs as a benchmarking tool for solving non-linear classification/regression tasks by learning from given datasets. An ANN structure composes of many computing units called neurons [Kecman, 2001]. Neurons are generally organized into layers in which all neurons possess an activation function. Given a continuous function, f :[0,1]nÆℜm , f(x)=y, f can be implemented by a network of input and output layers. The strength of connection or link between two neurons is called the weight. Weights have different physical meanings which is not usually easy to interpret. However, their geometrical interpretation is much easier to grasp their usage. Basically, weights define the shape of basis functions in neural networks. The weights can be adjusted in output and hidden layers, which are organized in weight matrices as V and W. V represents the weights of hidden layers and W represent the weight matrix of output layer. When there is only one output neuron, then W matrix is a column vector denoted with w. ANN functions are nonlinear activation functions in each hidden layer neurons. Non-linear activation functions in the hidden layer neurons enable ANNs to be universal approximator, viz., any continuous function could be implemented here. The input layer is not a processing unit; rather it is an input vector, eventually augmented with a bias term. Its components are fed to the next layer, the hidden or output layers. Output layer neurons may be linear for regression type of problems, or sigmoid activation functions, e.g., logistic functions, for classification type
328
A Appendix
problems. The graphical representation of the approximation schema of the multi-layer perceptrons in Figure C.1 is formulized by
o ( x,V ,w,b ) = F ( x,V ,w,b ) = ∑ j =1 w jσ j ( vTj x + b j ) J
V
W
y1
w1
x1
y2
vki
w2
xk xn
(C.23)
O
vkn wj
OUTPUT LAYER
xn+1 =+1 yj INPUT LAYER +1
Fig. C.1 A Neural Network with an input layer, hidden layer and one-output neurons
In (C.23) σj stands for a sigmoid activation function, e.g., a logistic function. J corresponds to the number of hidden layer neurons. Here, the output depends on the weights contained in V, w and b. Input vector x, bias weights vector b, hidden layer weights matrix V, and output layer weights vector w are described as follows:
x = [ x1
xn ] , b = [b1 T
x2 L
x2 L bJ ]
T
T
⎡ v11 L v1 j L v1nv ⎤ ⎢M M M ⎥⎥ ⎢ , w = [ w1 V = ⎢ vk 1 L vkj L vknv ⎥ ⎢ ⎥ M M ⎥ ⎢M ⎢ vn1 L v1 j L vn1 ⎥ ⎣ ⎦
w2 L
wJ
wJ +1 ]
T
(C.24)
In (C.24) k=1,…,n, represents number of data points, j=1,…,nv, represents number of input variables. Feed-forward networks have the following characteristics: 1.
2.
Perceptrons are arranged in layers, with the first layer taking in inputs and the last layer producing outputs. The middle layers have no connection with the external world, and hence are called hidden layers. Each perceptron in one layer is connected to every perceptron on the next layer. Hence information is constantly "fed forward" from one layer to
C.2 Support Vector Machines
3.
329
the next, and this explains why these networks are called feed-forward networks. There is no connection among perceptrons in the same layer.
For a given number of hidden layers, a given number of neurons in each layer, and given transfer functions for each neuron, the weights between artificial neurons model will be the inference parameters of the neural network model. The biggest advantage of neural networks is the learning and generalization capabilities (i.e., neural networks can learn, recall and generalize a system from a given training data set). It has been shown that a multi-layer neural network is a universal approximator. Mathematically speaking, it means that a multi-layer neural network can approximate any continuous nonlinear function, provided that there are a sufficient number of hidden units in the hidden layers. However, the correct (sufficient) size of the neural network, number of hidden layers and number of neurons cannot be known a priori. If the size of the neural network is increased excessively, then the neural network can overfit to the training data set and lose its generalization capability.
C.2 Support Vector Machines Foundations of Support Vector Machines (SVM) have been presented by Vapnik in 1995. SVM has been known for good performance in classification problems wherein output variable has binary or ordinal values. It has been applied to a number of applications ranging from face identification [Lu et al., 2001], text categorization to group similar documents, bioinformatics [Bi, 2003], time series analysis [Muller et al., 1997], etc. SVM method has also been extended to solve regression type problems, support vector regression (SVR) [Smola, 1996; Smola and Schölkopf, 2004; Druckar et al., 1997; Gunn, 1998]. One of the scopes of this thesis is to utilize SVM method as an approximator to identify fuzzy functionparameters. Hence, the structure of SVM for classification (SVC) and for regression models (SVR), will be presented next. SVM is a type of optimization technique in which prediction error and model complexity are simultaneously minimized [Vapnik, 1998]. The idea of SVM is as follows. In order to study the problem of learning, a structure identification is needed that would generalize the unseen data points. Therefore, given some new data sample (observation), we want to predict the output label of an observation in the sense that it is similar to the training samples. Hence, we seek for some similarity measures in X. In SVM, dot product, i.e., 〈xk,xl〉 = ∑j(xjk*xjl), a common type of similarity measure is used to describe the similarity between data points. The dot product formula needs vectors, which determine a plane. The formula extends directly to vectors in three-dimensional space (and even higher!). In order to be able to use the dot product as a similarity measure, the input patterns are represented as vectors in some dot product space, denoted with H. Φ represents the
330
A Appendix
function to map the input data into dot product space, Φ:XÆH. The space H is called the feature space. When dot products are used, the objective function is the hyper-plane in the future space. Linear SVM for Classification – SVC Consider a classification problem where the aim is to separate two classes using the dataset, D={(x1,y1)…(xk,yk)}, where x∈ℜn are the inputs and y∈{-1,1} is the output variable. A hyper-plane to separate to classes is defined as 〈w,x〉+b=0. When the distance between the closest vectors to the hyper-plane is maximal and the two classes are well separated, then the vectors are said to be optimally separated. When there is an optimal separation, Vapnik [1995] suggests using a canonical hyper-plane, which is defined as: min | 〈w, x〉 + b |= 1
(C.25)
In (C.25) w represents weight vector perpendicular to hyper-plane, and b is the offset. In other words, the norm of the weight vector should be equal to the inverse of the distance, of the nearest point to the hyper-plane. So for points on the negative side of the hyper-plane, e.g., y=-1, the vector should return negative output, and for positive points, it should return positive outputs, as shown in Figure C.2 This yields the following constraint;
yk| 〈w, xk〉 + b |= ≥1, k=1…n.
(C.26)
The distance d(w,b;x) of a point x from the hyper-plane (w,b) is,
d ( w, b; x) =
Fig. C.2 Separating Hyperplane with relaxed constraints
w, xk + b w
(C.27)
C.2 Support Vector Machines
331
The optimal hyper-plane is given by maximizing the margin [Gunn, 1998],
ρ ( w, b) = max
w, xk + b
yk =−1
+ max
w, xk + b
yk =1
w
w
(C.28)
2 = w According to (C.28), hyper-plane that separates data should minimize: min: ½ ||w||2
(C.29)
In order to avoid infeasible solutions, the algorithm should allow some error, e.g. introduce some relaxation to the constraints, so that up to a certain error limit, ξk, k=1…n, the algorithm should minimize: min: ½ ||w||2 + (Creg ∑k ξk ) ,
(C.30)
subject to constraints:
yk |〈w, xk〉 + b|= ≥ 1-ξk. k=1…n ,
(C.31)
ξk>0
In (C.30) C-reg is a user defined variable. The equation (C.30) and (C.31) represent the primal minimization model of the SVC method. The mathematical algorithm to convert the primal minimization model of SVC into dual maximization model using Lagrange Multipliers is explained next. The solution to the optimization problem of (C.30) under the constraint of (C.31) is given by saddle point of the Lagrange Equation, L( w, b,α , ξ , β ) =
1 2
(
w + Creg ∑ k =1ξ k −∑ k =1 β k yk ⎡⎣ w, xk + b ⎤⎦ − 1 + ξ k n
2
n
− ∑ k =1α k ξ k n
)
(C.32)
In (C.32),α, β are Lagrange multipliers. Lagrange function has to be minimized with respect to w, b and ξ and maximized with respect toα,β. The derivative yields:
∂L n = 0 ⇒ ∑ k =1 β k yk = 0 ∂b ∂L n = 0 ⇒ w = ∑ k =1 β k yk xk = 0 ∂w ∂L = 0 ⇒ α k + β k = Creg ∂ξ
(C.33)
332
A Appendix
Using the definitions of (C.33) the solution to classification problem is given by:
arg min 12 ∑ l =1 ∑ k βl β k yk yl xk , xl − ∑ k =1 β k n
n
n
β
s.t.
0 ≤ β k ≤ Creg k = 1,..., n
∑
n k=1
(C.34)
β k yk = 0.
The parameter Creg is related to a regularization parameter, which reflects the content of the noise in the dataset.
Kernels The Support Vector Classification (SVC) methods construct linear models in dot product space. In order to explain the non-linear functions in an arbitrary number of dimensions efficiently, one needs a mapping function of input space, denoted by φ, into a higher dimensional space, and constructs a linear classification function there. This can be satisfied with different kernel functions. The kernel concept can be explained best with a small example. Let a mapping function non-linearly maps a two-dimensional vector into a three dimensional vector as follows:
ϕ : ℜ2 → ℜ3 ( x1 ,x2 ) → ( x12 , 2x1 x2 ,x2 2 )
(C.35)
The mapping function, φ, is executed before all other steps of the classification method is described. Therefore, all the data vectors are first converted into a higher dimensional space, namely, the feature space, using the mapping function. Hence, the dot products are exchanged from 〈x1,x2〉 into 〈φ (x1), φ(x2)〉 first, and then the dot product calculation takes place in the new dimension, as follows:
x,x′ → ( x1 ,x2 ),( x1′ ,x2′ ) → ( x12 , 2x1 x2 ,x2 2 ),( x1′2 , 2x1′ x2′ ,x2′ 2 ) → (x12 x1′2 + 2x1 x2 x1′ x2′ + x2 2 x2′ 2 )
(C.36)
→ (x1 x1′ )2 + 2x1 x2 x1′ x2′ + ( x2 x2′ )2 → (x1 x1′ + x2 x2′ )2 → x,x′
2
In (C.36) above, the subscripts correspond to the components of x ∈ℜ , x={x1,x2}. Note that even though we have done an explicit mapping by turning the 2-dimensional vectors into 3 dimensional vectors by taking the square of the 2
C.2 Support Vector Machines
333
vectors, the result shows that we could carry the vectors into a feature space, i.e., dot product space, and then do the square operation there. This process is time consuming and when the vectors have high dimensions, i.e., when the input space X consists of images of 16X16 pixels which makes 256 dimensional vectors, it would be impossible to map them into feature space easily.
Kernel Trick A kernel function is used to do an implicit map, i.e., K(x1,x2)=〈φ(x1),φ(x2)〉 instead of explicit mapping with φ(.) as shown in equations (C.36). This implicit mapping is called the Kernel Trick. The kernel trick means taking the original algorithm and formulating it in such a way that we only use φ(x) in dot products. Thus, SVM learns a non-linear function in the original space using a kernel function, which simulates the dot products in (high-dimensional) feature space. The above dot product demonstration in (C.36) is an example of a polynomial kernel. Note that, the desired kernel k is simply the square of the dot product in input space. The same concept works not only for second order dimensions but also for d’th order polynomials. Polynomial Kernels: Define φd to map x∈ℜd to the vector φd(x) whose observations are all possible d’th degree ordered products of the observations of x. Then the corresponding kernel, which computes the dot products of vectors mapped by φd(x) is:
K(x1,x2)=〈 φ(x1), φ (x2)〉= 〈x1,x2〉 d ,
(C.37)
where φd denotes the future map. Proof: Below can be directly computed as: d
d
j1 =1
jd =1
Φ d ( x1 ) ,Φ d ( x2 ) = ∑ ⋅⋅⋅ ∑ [ x1 ] j ⋅ ... ⋅[ x1 ] j ⋅ [ x2 ] j ⋅ ... ⋅ [ x2 ] j 1
d
1
d
(C.38)
d
Φ d ( x1 ) ,Φ d ( x2 )
⎛ d ⎞ = ⎜ ∑[ x1 ] j ⋅ ... ⋅[ x2 ] j ⎟ = x1 , x2 ⎝ j =1 ⎠
d
Radial Basis Functions (RBF) as Non-Linear Kernel Functions A Gaussian RBF is a specific type of RBF and it is represented with the following function as:
⎛ x −x K( x1 ,x2 ) = exp ⎜ − 1 22 ⎜ 2σ ⎝
2
⎞ ⎟ ⎟ ⎠
(C.39)
x1 and x2 are two input vectors from the same dataset, and ||.|| represents the Euclidian norm and the σ is the standard deviation that one decides heuristically or from the dataset which determines the spread of the Gaussian kernel function.
334
A Appendix
One important advantage of using RBF kernels over other kernels like polynomial kernels is that, the data vectors are not forced into a specific shape as in polynomial kernels. In radial bases kernels, only the centers and the spread of the Gaussian structures (nodes) are given.
Non-Linear SVM for Classification – SVC Non-linear support vector algorithms are constructed using kernelized form of linear SVC algorithm as follows:
arg min 12 ∑ l =1 ∑ k βl β k yk yl K xk , xl − ∑ k =1 β k n
n
n
β
s.t.
0 ≤ β k ≤ Creg k = 1,..., n
∑
n k=1
(C.40)
β k yk = 0.
The weight vector and the SVC function using a kernel function are represented by
f ( x′) = ∑ k β k yk K xk , x′ n
(C.41)
w = ∑ k β k yk Φ xk n
Linear SVM for Regression – SVR Suppose we are given a set observations generated from an unknown probability distribution P(x,y), where the data vectors are represented as Z={(x1,y1),…, (xn,yn))}, where n is the total number of training data vectors, xk=[x1,k,…, xnv,k ]T ∈ ℜnv, yk∈ℜ, k=1,..,n. In linear SVR, the aim is to find a pair (w, b), where w∈ℜnv is the weight vector and b∈ℜ is the bias term in regression equation, such that the value of the point, xk, can be predicted according to the real-valued function as;
f ( xk ) = yˆ k = w, xk + b
(C.42)
In (C.42), ‚ . Ú is dot product representation. Linear SVR utilizes linear combination of training patterns in some feature space, H, i.e., dot product space, and searches for a weight vector and bias term in a feature space. The goal is to find a function, f(x), that has at most ε deviation from actual targets, yk. This concept was first proposed by Vapnik [1995,1998] and it can be expressed in ε-insensitive loss function, lε as:
lε =| yk − f ( xk )|ε = max{0,| yk − f ( xk ) | −ε
(C.43)
C.2 Support Vector Machines
335
Note from (C.43) that, loss function does not penalize errors below some ε≥0 , where ε is chosen prior to calculations. Thus the goal of learning is to find a function with a small risk on test samples. In addition SVR not only tried to minimize the empirical risk from training samples, but also find a simple function to minimize the complexity of the model. In particular, the more flat the functions are, the less complex they would be in other words they would get simpler, because they would be closer to the linear functions. Note that, the more flat a function is the smaller would be the weight vector (coefficients). Hence, in SVR the complexity term is expressed in weights. In order to ensure that the weight vector is small, Euclidian norm, i.e., ||w||2 is used. If one can minimize the square of the weight vector of the model, then one would have smaller risk. The objective function of SVR with two conditions similar to SVC is as follows:
min 12 w
2
w ,b
(C.44)
⎧⎪ y − w , xk − b ≤ ε s.t. ⎨ k ⎪⎩ w , xk + b − yk ≤ ε
The assumption in the convex optimization problem of (C.44) is that, it is possible to find such a function that approximates all pairs (x,y) with ε precision. This means that convex optimization problem is feasible. However, we may still face some problems such that, we might not be able to satisfy an over-fitting problem or we might just want to allow some extra errors in some places, or maybe constraints are not always feasible. To overcome these problems, optimization problem in (C.44) is changed into a softer optimization problem as follows:
min ς ( w , b, ξ , ξ * ) : 12 w + Creg ∑ k =1 (ξ k + ξ k* ) 2
n
w , b ,ξ ,ξ *
⎧ w,xk + b − yk ≤ ε + ξ k ⎪ * ⎪ yk − w,xk − b ≤ ε + ξ k s.t. ⎨ ⎪ξ k ≥ 0 ⎪ξ * ≥ 0, k = 1...n ⎩ k
(C.45)
This model is called Primal Model of Soft-Marginε-SVR. Unknown parameters are; w, weight vector, b, bias term, ξk and ξk*, slack variables for every kth observation, k=1,…,n, and they soften the error margin. The algorithm accepts the errors larger than ξ or smaller than ξ*. Introducing these slack variables helps us to solve optimization problem which otherwise may be infeasible to solve the constraints. This way, even if the deviation is larger than ε, the samples will not be penalized up to a certain value denoted by these slack variables. Graphical
336
A Appendix
Fig. C.3 Soft Margin Loss Function
representation of the loss function and the slack variables are displayed in Figure C.3. In Figure C.3 the curvy continuous line on the left graph represents estimated function, f(x), constructed as a result of the learning algorithm of SVR. The shaded area, ε-tube, shows data points in which deviation between actual and estimated output values are less then an ε value. The points that fall inside this tube do not affect the decision function. The points outside the ε-tube has an effect on the regression function. However, in soft margin SVR, the points outside the ε-tube does not effect the regression function unless the deviation is less than (ε+ξk) or (ε+ξk*) for all data points. Creg, is a constant variable (Creg>0) which determines the trade-off between the flatness of f(x) represented with term
1
2
2
w and the
amount up to which deviations between the actual and predicted output that are larger than ε are tolerated. In order to solve the (C.45) two Lagrange Multipliers, α, η, are introduced to represent the primal model in dual optimization form as follows:
max L = − 21 ∑ nk ,l = 1 (α *k − α k )(α l* − α l ) x k , xl
− ε ∑ nk = 1 (α *k + α k ) + ∑ nk = 1 (α k* − α k ) yk (C.46)
⎧ n ⎪ ∑ k = 1 (α − α k ) = 0, s.t. ⎨ * ⎡ ⎤ ⎩⎪ α k ,α k ∈ ⎣0,C reg ⎦ * k
k ,l = 1,...,n ,C reg > 0
There were originally four Lagrange multipliers in the model, one for each constraint of primal model in (C.45). However, two of them were eliminated in deriving the model in (C.46). From the LaGrange model, weight vector can now be explained using Lagrange multipliers as:
C.2 Support Vector Machines
337 n
w = ∑ ( α k − α k* )xk
(C.47)
k =1
where, xk is kth observation and αk, αk* are Lagrange multipliers of observation xk. Estimation function of a new vector is as follows: n
n
k =1
k =1
f ( x new ) = ∑ (α k − α k* ) xk , x new + b = ∑ (α k − α *k ) xk T x new + b
(C.48)
where xnew is a new vector, whose output is to be predicted, T represents transpose operation on vectors. Since, the dual model can be explained in terms of dot products of each datum, and f(x) calculates estimated output using training input vectors, xk, there is no need to calculate the weight vector explicitly. Model output value in (C.48) can only be calculated using input vectors (observations) of which the sum of their corresponding Lagrange multipliers, i.e., for the kth input vector the value of the (αk-αk*) are non-zero. (As a result of optimization algorithm, some Lagrange Multipliers will be zero.) These vectors satisfying the condition in dual model are called the SUPPORT VECTORS.
Non-Linear Support Vector Regression (SVR) The non-linear SVR optimization algorithm utilizes a kernel function to learn a non-linear function in the original space without explicitly executing a non-linear mapping or constructing dot products. The dual optimization function in (C.46) can be re-formulized for non-linear models as follows:
(
)(
)
max L = − 21 ∑ k ,l =1 α k − α k* α l − α l* K xk ,xl n
(
)
(
)
− ε ∑ k =1 α *k + α k + ∑ k =1 α k − α *k yk n
(
n
)
⎧ ∑ n α − α* = 0 k ⎪ k =1 k s.t. ⎨ * ⎪⎩α k ,α k ∈ ⎡⎣0,Creg n ⎤⎦
(C.49)
The weight vector and non-linear regression function at a given point is given by n
(
)
w = ∑ α k − α *k ϕ ( xk ) k =1
n
(
)
f ( x new ) = ∑ α k − α *k K xk ,x new + b k =1
(C.50)
(C.51)
where xnew represents a new vector whose output can be estimated using the function in (C.51), φ(xk) represents support vectors in feature space. The only
338
A Appendix
difference between linear and non-linear SVR is that in non-linear case, the flattest function is searched in the feature space.
C.3 Genetic Algorithms Genetic algorithm (GA) is an optimization and search technique based on principles of genetics and natural selection. Genetic algorithms were developed by John Holland in 1975 and later developed and applied by David Goldberg [1989]. A major characteristic of genetic algorithms is that the algorithm works with a population of models, unlike classical approaches, which operate on a single solution at a time. GA allows a population to evolve to a state that maximized the “fitness” functions under certain rules. Each chromosome represents a different possible model structure and the algorithm explores different regions of solution space simultaneously. The pseudo-code of GA is given in Appendix Algorithm C.1.
Genetic Algorithm Components GA’s encode decision variables, in other words input variables, of the underlying problem into strings. Each of these strings is called individual or chromosomes (genomes). Each character in the string is called a gene. Each gene is characterized by their position in the chromosomes and these positions are called tokens. There are Algorithm C.1 Pseudo-Code of Generalized Genetic Algorithm
Step 1: INITIALIZATION Generate initial population P randomly or based on prior knowledge. Step 2: FITNESS EVALUATION Evaluation of the fitness functions for all individuals in P. Step 3: SELECTION Select a set of promising candidates S from P. Step 4: CROSSOVER Apply crossover to the mating pool S for generating a set of off-springs O. Step 5: MUTATION Apply mutation to the offspring set O for obtaining its perturbed set O′. Step 6: REPLACEMENT Replace the current population P with the set of offspring O′. Step 7: TERMINATION If the termination criteria are not met, go to Step 2.
C.3 Genetic Algorithms
339
two different representation classes: genotype and phenotype. Genotype represents the coding of the variables and phenotype is their actual values. A fitness function is used to measure the performance of each individual. Based on the fitness value, the individuals go into a series of operations, or dropped from the search space. Initial population is created at random or with prior knowledge about their domains. The individuals are evaluated to measure the quality of candidate solutions with a fitness function. In order to generate or evolve the offspring, genetic operators are applied to the current population. The genetic operators are: selection (or reproduction), crossover (or recombination) and mutation.
Genetic Selection Operators The individuals with higher fitness values are selected as parents of the next generation. This way, the selection process is intended to improve average quality of population by giving superior individuals a better chance to get copied into the next generation. There are two types of selection plans: proportionate and ordinal. In proportionate selection, the individuals are picked based on their fitness values relative to the fitness of other individuals in the population. Some of the wellknown examples of these type of selection operators are roulette-wheel selection or stochastic remainder selection. On the other hand, ordinal selection selects individuals based on their rank in the population. The individuals are ranked based on their fitness values. Tournament selection is the most common selection methods of this type.
Other Genetic Operators Crossover operator enhances and combines partial solutions from two or more parental individuals according a crossover probability, pc, in order to create an offspring. There are many crossover variations and one-point and uniform crossover methods are the most commonly used operators as shown in Figure. C.4 and Figure. C.5. Onepoint crossover method randomly chooses a crossover point and exchanges all the genes behind the crossover point. Uniform crossover exchanges each gene with probability 0.5, which can achieve a maximum token-wise mixing rate. Mutation operates by altering small percentage of genes in the list of individuals to slightly perturb the recombined solution. The most common mutation operator
1
0
1
1
0
1
Crossover Point 0
0
0
1
1
0
Fig. C.4 One-point Crossover
1
0
0
1
1
0
Crossover Point
Crossover 0
0
1
1
0
1
340
1
A Appendix
0
1
1
0
1
1
0
1
1
1
1
0
0
0
1
0
0
Crossover 0
0
0
1
1
0
Fig. C.5 Uniform Crossover
is the bit-wise mutation, in which binary valued tokens are complemented with a mutation probability, pm. For example an individual, I=[1 1 1 1 1 1] may turn into I′=[1 1 0 1 1 1] when the third gene is randomly chosen for mutation. In general mutation should change the individuals slightly. By balancing between the selection, exploration (crossover and mutation) operators and mutation, GA’s can find the global minimum solution within the provided search space. Stopping criteria determine what causes the algorithm to terminate. Among possible different criteria number of generations, which specifies the maximum number of iterations the generic algorithm will perform, or the fitness limit, in which the algorithm stops if the best fitness value is less than or equal to the value of fitness limit.
C.4 Multiple Linear Regression Algorithms with Least Squares Estimation Regression is an area of statistics that deals with methods for investigating the existing associations between various observable quantities in an assumption that an output variable is dependent on a number of input variables. The dependent variable is usually called output (or response variable), and the independent variables are called inputs (or explanatory variable). There are various variations of regression models. Hence, in this section, general linear regression models are described. In a regression model, the assumption is that the dependent variable is a linear function of one or more independent variables plus an error factor. Let the regression model be defined as a multi-input, single output (MISO) model as follows:
y = β0 + β1 x1 + ... + β nv xnv + ε
(C.52)
‘y’ is the dependent (output) variable, xnv’s are input or explanatory variables for j=1 to nv, nv is the number of inputs, ε is the independent error term of the response, which is typically assumed to be normally distributed. The goal of regression analysis is to obtain estimates of the unknown parameters, βj’s, which indicates how a change in one of the independent variables affects the values taken by the dependent variable. The usual method of estimation for the regression model is ordinary least squares (OLS).
C.5 Logistic Regression
341
In matrix notation the general linear model is expressed as:
y = Xβ +ε
(C.53)
,
where y is [n×1] vector of response values, X is [n×p] matrix of known inputs, n represents the number of input vectors in the dataset and nv is the number of input variables, β is [nv×1] vector of regression parameters and ε is [n×1] vector of errors. The objective is to minimize the total residuals. Therefore the simplest linear regression, which tries to minimize the total squared error between the actual and estimated output, is called the least squares regression. The parameters are chosen in such a way that the sum of squares of prediction errors have the smallest possible value among all the possible choices of parameters. Therefore, we minimize Q such that: n
min Q = ∑ ( yi − β0 + β1 x1,k + ... + β nv xnv ,k )2
(C.54)
k =1
In matrix notation we minimize
minQ = ( y − X β )'( y − X β ) ∂ [( y − X β )'( y − X β )] = 0 ∂β 2( X ' X )β = 2X ' y
β = ( X ' X )−1 X ' y
(C.55) ,
1
provided that the matrix (X'X) is not singular where X' is the transpose of X. After parameters are identified, prediction ability of calculated model is measured using a performance measure of domain expert’s choice.
C.5 Logistic Regression Logistic Regression is a type of predictive model that can be used when the target variable is a categorical variable with two categories – for example live/die, has disease/doesn’t have disease, purchases product/doesn’t purchase, etc. A logistic regression model does not involve decision trees and is more akin to nonlinear regression such as fitting a polynomial to a set of data values. Logistic regression can be used only with different types of target variables: ¾
A categorical target variable that has exactly two categories (i.e., a binary or dichotomous variable), ¾ A categorical target variable that has more than two categories (i.e., y∈ {0,1,2,3..}, or ¾ A continuous target variable that has values in the range 0.0 to 1.0 representing probability values or proportions.
1
A matrix is singular if and only if its determinant is 0.
342
A Appendix
In this thesis, we only dealt with problem domains in which the target values take on binary values. The logistic model formula computes the probability of the selected response as a function of the values of the predictor variables. If a predictor variable is a categorical variable with two values, then one of the values is assigned the value 1 and the other is assigned the value 0. If a predictor variable is a categorical variable with more than two categories, then a separate dummy variable is generated to represent each of the categories except for one, which is excluded. The value of the dummy variable is 1 if the variable has that category, and the value is 0 if the variable has any other category; hence, no more than one dummy variable will be 1. If the variable has the value of the excluded category, then all of the dummy variables generated for the variable are 0. Before explaining the formula of logistic model, it is helpful to understand the odds and odds ratios. The odds of an event are the ratio of the expected number of times that an event will occur to the expected number of times it will not occur. Additionally, there is a simple relationship between probabilities and odds. If p is the probability of an event and O is the odd of an event, then
O=
p probability _ of _ event = 1 − p probability _ of _ no _ event
Based on the odds formula, the logistic regression formula is given as:
⎛ P ( y = 1) ⎞ n log ⎜⎜ ⎟⎟ = ∑ β0 + β1 x1,k + ... + β nv xnv ,k , P y 0 = ( ) ⎝ ⎠ k =1
(C.56)
where P is the probability of y=1. The expression on the left-hand side is refereed to as logit or log-odds. We can solve the logit equation to obtain P as follows: ⎛ ⎛ ⎛ n ⎞⎞⎞ P ( yk = ±1| x,β ) = 1 ⎜ 1 + exp ⎜⎜ − ⎜ ∑ β0 + β1 x1,k + ... + β nv xnv ,k ⎟ ⎟⎟ ⎟ ⎜ ⎠ ⎠ ⎟⎠ ⎝ ⎝ k =1 ⎝
(C.57)
Where β0 is a constant and βj are coefficients of the predictor variables (or dummy variables in the case of multi-category predictor variables). The computed value, P, is a probability in the range 0 to 1. The exp() function is e raised to a power. The parameters of the logistic regression are estimated using Maximum likelihood formula. The likelihood function is constructed which express the probability of observing the data in hand as a function of the unknown parameters. The likelihood observing all the yk’s, k=1,…,n, to obtain overall probability can be factored into the product of the individual probabilities: n
L = P(y1). P(y2)… P(yn)=
∏ P( yk ) k =1
(C.58)
C.6 Fuzzy K-Nearest Neighbor Approach
343
It is known that P(yk)=pkyk(1-pk) 1-yk, and substituting into (C.58) we obtain;
⎛ pk ⎞ L = ∏⎜ ⎟ k =1 ⎝ 1 − pk ⎠ n
y
k
(1 − pk )
(C.59)
At this point, the logarithm is much easy to work with and is taken by:
Log L =
⎛
p
∑ y k log ⎜ 1 − kp ⎝
k
k
⎞ ⎟ + ∑ log ( 1 − pk ) ⎠ k
(C.60)
Substituting our expression for the logit model in (C.57) into the above logarithmic of the objective function we get:
Log L =
∑ β T xk y k − ∑ log (1 + eβ k
k
Tx k
)
(C.61)
The parameters of the function above are found by taking the derivative of the function with respect to β and set the derivative to 0 and then solve for β.
C.6 Fuzzy K-Nearest Neighbor Approach The Fuzzy K-Nearest Neighbor (FKNN) approach [Keller, et al., 1985], is a classification method, that requires no prior knowledge of the sample, and uses fuzzy logic. An unknown sample’s membership in each class is assigned as its k nearest known neighbors’
BEGIN Input x, of unknown classification Set K, 1≤K≤n Initialize i = 1 DO UNTIL (K-nearest neighbours to x found) Compute distance from x to xi Find the nearest IF (i≤K) THEN neighbours Include xi in the set of Knearest neighbours END IF END DO UNTIL Initialize i = 1 DO UNTIL (x assigned membership in all classes) Assign membership Compute ui(x) using (C.62) values Increment i END DO UNTIL END
344
A Appendix
membership in those classes, divided by a function of the neighbors’ distances from the unknown sample. Let input data point x ‘s membership in class i is given as: 2 ⎛ ⎞ − m −1 − u x x ⎟⎟ ∑ j =1 ij ⎜⎜ j ⎝ ⎠ ui ( x ) = 2 ⎛ ⎞ − k ∑ j =1 ⎜⎜ x − x j m−1 ⎟⎟ ⎝ ⎠ k
(C.62)
In (C.62) the uij is the membership value of any labeled pattern set xj in the class i. The fuzziness parameter m, may have affect the system performance. According to the paper [Keller, et al., 1985], m=2 could be used in any experiment, however no theoretical or experimental data is given to justify this claim. Hence, we used the default m=2, in the experiments. The labeled patterns can be assigned class memberships in several ways. As in [Keller et al., 1985], a reasonable membership assignment in each class can be computed as:
⎧0.51 + ( n j / K ) * 0.49 if j = i ⎪ uj = ⎨ if j ≠ i ⎪⎩ ( n j / K ) * 0.49
(C.63)
nj denotes the number of neighbors, which belong to the jth class. The FKNN algorithm is summarized as follows:
D.1 T-Test Formula The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate when one wants to compare the means of two groups as shown in Figure. D.1. Fig. D.1 Idealized distributions for treated and comparison group values
Two different groups with statistically different averages assuming unequal variances: The assumptions are: (1) The samples (n1 and n2) from two normal populations are independent, (2) One or both sample sizes are less than 30, (3) The appropriate sampling distribution of the test statistic is the t distribution, (4) The unknown variances of the two populations are not equal. When one is looking at the differences between scores for two groups, one ought to judge the difference between their means relative to the spread or variability of their scores which can be measured by the t-test. The formula for the t-test is a ratio as shown
D.2 Friedman’s Artificial Dataset: Summary of Results
345
in (D. 64). The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores.
t=
x1 − x 2
σ 12 n2
+
σ 22
(D.64)
n2
x1 is the mean of sample 1, x2 is the mean of sample 2, n1 and n2 is the number of subjects in sample 1 and 2 respectively, σ12 and σ22 are the variances of sample 1 and sample 2 respectively, which are calculated by σ 12 = ∑ ( x1 − x1 ) n1 and 2
σ 22 = ∑ ( x2 − x2 ) n2 . 2
Once one computes the t-value, one has to look it up in a table of significance to test whether the ratio is large enough to say that the difference between the groups is not likely to have been a resulted by chance. To test the significance, one needs to set a risk level (called the alpha level). In most social research, the "rule of thumb" is to set the alpha level at .05. This means that five times out of a hundred one would find a statistically significant difference between the means even if there was none (i.e., by "chance").
D.2 Friedman’s Artificial Dataset: Summary of Results Table D.1 Friedman’s Artificial R2 results on Training Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
std
ANFIS
0.997
0.999
1.000
1.000
1.000
0.999
0.001
DENFIS
0.931
0.918
0.923
0.907
0.907
0.917
0.010
NN
0.877
0.902
0.847
0.860
0.948
0.887
0.040
GFS
0.982
0.985
0.987
0.988
0.984
0.985
0.002
SVM
0.969
0.967
0.964
0.970
0.961
0.966
0.004
DIT2FRB
0.970
0.971
0.967
0.975
0.974
0.972
0.003
T1FF
0.979
0.965
0.962
0.965
0.958
0.966
0.008
T1IFF
0.979
0.961
0.963
0.966
0.956
0.965
0.009
DIT2FF
0.997
0.985
0.985
0.987
0.984
0.988
0.005
DIT2IFF
0.998
0.991
0.989
0.985
0.987
0.990
0.005
ET1FF
0.973
0.972
0.964
0.969
0.969
0.969
0.003
ET1IFF
0.973
0.970
0.974
0.972
0.967
0.971
0.003
EDIT2FF
0.975
0.985
0.972
0.985
0.977
0.979
0.006
EDIT2IFF
0.983
0.988
0.983
0.971
0.975
0.980
0.007
346
A Appendix
Table D.2 Friedman’s Artificial R2 results on Validation Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
Std
ANFIS
0.777
0.426
0.194
0.560
0.474
0.486
0.211
DENFIS
0.866
0.871
0.857
0.859
0.860
0.863
0.006
NN
0.857
0.893
0.852
0.803
0.944
0.870
0.053
GFS
0.844
0.828
0.466
0.801
0.736
0.735
0.156
SVM
0.944
0.943
0.939
0.928
0.943
0.939
0.007
DIT2FRB
0.920
0.908
0.899
0.896
0.919
0.908
0.011
T1FF
0.945
0.947
0.938
0.931
0.945
0.941
0.007
T1IFF
0.945
0.947
0.937
0.932
0.944
0.941
0.006
DIT2FF
0.995
0.982
0.985
0.979
0.985
0.985
0.006
DIT2IFF
0.973
0.972
0.969
0.958
0.969
0.968
0.006
ET1FF
0.948
0.943
0.943
0.933
0.947
0.943
0.006
ET1IFF
0.950
0.943
0.932
0.933
0.950
0.941
0.009
EDIT2FF
0.969
0.980
0.975
0.972
0.979
0.975
0.004
EDIT2IFF
0.957
0.964
0.962
0.934
0.963
0.956
0.013
Table D.3 Friedman’s Artificial R2 results on Testing Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
std
ANFIS
0.743
0.430
0.129
0.539
0.376
0.444
0.225
DENFIS
0.852
0.867
0.848
0.852
0.856
0.855
0.007
NN
0.837
0.883
0.858
0.842
0.943
0.873
0.043
GFS
0.850
0.817
0.493
0.835
0.646
0.728
0.155
SVM
0.939
0.938
0.938
0.937
0.938
0.938
0.001
DIT2FRB
0.882
0.891
0.888
0.905
0.878
0.893
0.009
T1FF
0.938
0.940
0.937
0.940
0.939
0.939
0.001
T1IFF
0.938
0.941
0.938
0.940
0.939
0.939
0.001
DIT2FF
0.929
0.940
0.937
0.937
0.937
0.936
0.004
DIT2IFF
0.917
0.927
0.924
0.926
0.925
0.924
0.004
ET1FF
0.943
0.941
0.942
0.940
0.942
0.942
0.001
ET1IFF
0.944
0.940
0.935
0.940
0.946
0.941
0.004
EDIT2FF
0.948
0.945
0.946
0.945
0.947
0.946
0.001
EDIT2IFF
0.941
0.938
0.936
0.941
0.944
0.940
0.003
D.2 Friedman’s Artificial Dataset: Summary of Results
347
Table D.4 Friedman’s Artificial MAPE results on Training Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
std
ANFIS
1.404
0.790
0.528
0.244
0.231
0.639
0.485
DENFIS
8.473
9.705
8.683
11.010
9.518
9.478
1.005
NN
10.563
13.955
12.538
12.212
7.381
11.330
2.515
GFS
4.296
3.913
3.136
3.511
3.645
3.700
0.435
SVM
5.791
6.965
5.763
6.083
6.033
6.127
0.489
DIT2FRB
4.732
4.575
5.012
5.023
4.263
4.721
0.319
T1FF
4.431
7.177
6.045
6.745
6.334
6.146
1.050
T1IFF
4.373
8.091
5.978
6.682
6.610
6.347
1.347
DIT2FF
1.582
4.416
3.466
3.736
3.449
3.330
1.053
DIT2IFF
1.293
2.597
2.824
3.817
2.969
2.700
0.912
ET1FF
5.550
5.518
5.370
5.767
5.411
5.523
0.155
ET1IFF
5.400
4.873
4.422
5.173
5.750
5.124
0.507
EDIT2FF
5.014
3.051
4.579
3.272
4.641
4.111
0.886
EDIT2IFF
4.000
3.099
3.328
5.594
4.818
4.168
1.040
Table D.5 Friedman’s Artificial MAPE results on Validation Dataset from Cross Validation (CV)Trials CV1
CV2
ANFIS
19.903
25.822
DENFIS
13.346
12.072
NN
12.731
11.401
GFS
15.685
13.318
SVM
10.167
DIT2FRB T1FF
CV3 51.765
CV4
CV5
Average
std
23.313
35.889
31.338
12.880
11.884
11.905
12.807
12.403
0.648
11.946
12.864
8.334
11.455
1.844
24.133
14.390
19.562
17.418
4.435
8.175
7.481
8.396
8.327
8.509
0.995
11.072
10.065
9.658
9.610
9.394
9.960
0.667
9.370
7.942
7.800
8.229
8.048
8.278
0.630
T1IFF
9.491
7.997
7.730
8.206
8.192
8.323
0.681
DIT2FF
1.851
3.791
3.224
3.770
3.490
3.225
0.802
DIT2IFF
5.104
4.939
4.813
5.679
5.008
5.109
0.336
ET1FF
9.414
8.123
7.195
8.087
7.976
8.159
0.798
ET1IFF
9.147
8.157
8.048
8.094
7.906
8.270
0.499
EDIT2FF
6.813
3.625
4.493
3.964
4.694
4.718
1.245
EDIT2IFF
8.153
5.860
5.330
7.982
6.123
6.690
1.291
348
A Appendix
Table D.6 Friedman’s Artificial MAPE results on Testing Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
std
ANFIS
16.612
26.368
49.245
24.325
30.483
29.407
12.181
DENFIS
12.080
11.838
12.869
12.447
12.051
12.257
0.406
NN
12.143
11.949
12.675
12.118
8.273
11.432
1.786
GFS
12.898
13.685
23.194
13.605
18.216
16.320
4.384
SVM
8.556
8.545
8.678
8.361
8.434
8.515
0.122
DIT2FRB
10.565
10.119
10.631
9.750
9.999
10.213
0.377
T1FF
8.243
8.355
8.797
8.207
8.392
8.399
0.235
T1IFF
8.255
8.393
8.698
8.216
8.424
8.397
0.190
DIT2FF
8.586
8.141
8.488
8.170
8.406
8.358
0.196
DIT2IFF
9.321
8.993
9.300
8.682
9.027
9.065
0.262
ET1FF
8.053
8.151
8.123
8.151
8.110
8.117
0.040
ET1IFF
7.920
8.161
8.249
8.084
7.908
8.064
0.149
EDIT2FF
7.588
7.661
7.844
7.797
7.763
7.731
0.104
EDIT2IFF
8.040
8.196
8.547
8.113
8.039
8.187
0.211
Table D.7 Friedman’s Artificial RMSE results on Training Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
std
ANFIS
0.260
0.128
0.099
0.058
0.049
0.119
0.085
DENFIS
1.345
1.424
1.360
1.576
1.499
1.441
0.097
NN
1.757
1.540
1.874
1.915
1.120
1.641
0.326
GFS
0.668
0.592
0.536
0.561
0.616
0.595
0.051
SVM
0.889
0.898
0.910
0.893
0.962
0.910
0.030
DIT2FRB
0.872
0.844
0.878
0.812
0.795
0.840
0.036
T1FF
0.730
0.928
0.940
0.966
1.002
0.913
0.106
T1IFF
0.720
0.972
0.925
0.959
1.034
0.922
0.120
DIT2FF
0.289
0.607
0.598
0.606
0.628
0.546
0.144
DIT2IFF
0.239
0.480
0.503
0.628
0.568
0.484
0.148
ET1FF
0.832
0.816
0.908
0.913
0.864
0.867
0.044
ET1IFF
0.823
0.855
0.777
0.854
0.886
0.839
0.041
EDIT2FF
0.791
0.611
0.802
0.646
0.747
0.719
0.087
EDIT2IFF
0.648
0.533
0.634
0.883
0.774
0.694
0.136
D.2 Friedman’s Artificial Dataset: Summary of Results
349
Table D.8 Friedman’s Artificial RMSE results on Validation Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
std
ANFIS
2.616
5.307
10.043
3.938
6.641
5.709
2.851
DENFIS
1.886
1.831
1.807
1.792
1.979
1.859
0.076
NN
0.961
1.658
1.848
2.127
1.249
1.568
0.466
GFS
2.097
2.179
4.885
2.226
3.133
2.904
1.185
SVM
1.225
1.221
1.179
1.284
1.274
1.237
0.043
DIT2FRB
1.456
1.540
1.520
1.550
1.517
1.516
0.036
T1FF
1.202
1.183
1.191
1.253
1.248
1.216
0.033
T1IFF
1.206
1.187
1.198
1.250
1.260
1.220
0.033
DIT2FF
0.350
0.698
0.583
0.702
0.652
0.597
0.146
DIT2IFF
0.853
0.875
0.862
0.994
0.953
0.907
0.063
ET1FF
1.167
1.211
1.233
1.222
1.194
0.041
ET1IFF
1.153
1.216
1.263
1.235
1.199
1.213
0.041
EDIT2FF
0.900
0.722
0.760
0.797
0.777
0.791
0.067
EDIT2IFF
1.073
0.976
0.932
1.227
1.024
1.046
0.114
1.135
Table D.9 Friedman’s Artificial RMSE results on Testing Dataset from Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
Average
std
ANFIS
2.750
5.263
10.150
4.419
6.122
5.741
2.762
DENFIS
1.899
1.801
1.937
1.901
1.874
1.883
0.051
NN
1.996
1.693
1.871
1.966
1.184
1.742
0.333
GFS
1.986
2.266
4.759
2.105
3.453
2.914
1.186
SVM
1.226
1.235
1.233
1.239
1.232
1.233
0.005
DIT2FRB
1.711
1.634
1.658
1.528
1.580
1.622
0.070
T1FF
1.234
1.216
1.252
1.215
1.222
1.228
0.015
T1IFF
1.236
1.210
1.239
1.215
1.222
1.224
0.013
DIT2FF
1.326
1.206
1.248
1.244
1.242
1.253
0.044
DIT2IFF
1.424
1.345
1.368
1.345
1.369
1.370
0.032
ET1FF
1.183
1.201
1.191
1.210
1.188
1.195
0.011
ET1IFF
1.173
1.211
1.263
1.208
1.153
1.201
0.042
EDIT2FF
1.128
1.160
1.145
1.166
1.135
1.147
0.016
EDIT2IFF
1.201
1.227
1.260
1.206
1.171
1.213
0.033
350
A Appendix
Fuzzy Function Parameters of the Optimum Models of Freidman’s Artificial Dataset Friedman’s Artificial Dataset - Parameters of the Best Cross Validation Models of the first Optimum Methodology, EDIT2FF: Among benchmark and proposed methodologies that are applied on the Friedman’s dataset, such as SVM, NN, etc., as shown in the MAPE, RMSE, and R2 tables above, the proposed EDIT2FF, i.e., Evolutionary Discrete Interval Type-2 Fuzzy Functions Method, which implements Fuzzy C-Means Clustering algorithm to identify hidden structures and optimizes the parameters based on evolutionary algorithms, is identified as the optimum methodology in Chapter 6, based on the error reduction performances on testing datasets. The parameters of the optimum models obtained from the application of EDIT2FF are stored in collection tables, i.e., m-Col*, and Φ-Col*. There is one collection table set for each cross validation (cv) iteration. Here only one of set of collection table structures obtained from a single cross validation model will be presented. Each row of m-Col* collection table holds the optimum degree of fuzziness parameter, m* values, of an optimum embedded type-1 fuzzy function model identified for the corresponding training vector. Hence the m-Col* table is a (500×1) matrix, where n=500 indicate the number of training vectors in the particular cross validation training dataset of Freidman’s Artificial. The Φ-Col* collection table on the other hand holds the local fuzzy function structures, i.e., the parameters, Ŵi,k, k=1…,500,i=1,…,c*, and structures, Φi,k of local fuzzy functions, f(Φik,Ŵi,k), of each embedded fuzzy function model identified for each kth training vector. Thus, if n=500 indicates the number of training vectors of Friedman’s Artificial Dataset, and c* be the optimum number of clusters, which was identified as {6} for the particular cross validation model to be presented here, then the Φ-Col* is a matrix of (500×6) dimensions, each row holding parameters of an embedded fuzzy function model. It should also be noted that, since EDIT2FF applies Fuzzy C-Means (FCM) clustering algorithm, the collection tables to hold parameters of interim fuzzy functions, τ-Col, a property of Improved Fuzzy Clustering (IFC) algorithm, are not utilized. The optimum parameters obtained from five different cross validation models of EDIT2FF method are summarized as follows: Optimum Parameters of the EDIT2FF Methodology Obtained from Cross Validation Trials on Friedman’s Artificial Dataset.
1 2 3 4 5
Model Name Fuzzy Clustering Type Regression Type # of clusters Fuzziness degree
EDIT2FF Fuzzy c-Means Clustering Non-linear SVM {6,7} [1.2,1.6]
D.2 Friedman’s Artificial Dataset: Summary of Results
6
351
Optimum List of membership value transformations to be used as additional input variables Alpha-cut
7 8 9
10
(μ), (eμ), (log(1-μ)), [0,0.1] [1.37,1.74] {0.08}
C- Regularization Epsilon m-Col , Φ-Col tables
Each optimum model identifies one optimum C-regularization* and Epsilon* value determined by the genetic learning process of the EDIT2FF method. The interval valued C-regularization=[1.37,1.74] in the above table indicates that for each five cross validation models one C- regularization parameter is identified, which can be defined as an interval, but in reality each of the five models have one crisp optimum C-regularization and epsilon value. Whereas the optimum values of the degrees of fuzziness, m, is identified as intervals in five of the cv models in order to identify the interval membership values. The particular cross validation model from EDIT2FF methodology presented here identifies c*=6 as the optimum number of clusters. The m collection table, m-Col, and the fuzzy function structure collection table, Φ-Col, of this particular EDIT2FF model is as follows:
(
⎡ m1 = 1.3271 ⎤ ⎢ m = 1.3747 ⎥ ⎢ 2 ⎥ 500×1 = ⎢ m3 = 1.2796 ⎥ m − Col * ⎢ ⎥ M ⎢ ⎥ ⎢⎣ m500 = 1.2558⎥⎦
)
( Φ − Col )
* 500×6
(
)
⎡ Φ1,1 ,Wˆ1,1* ⎢ =⎢ M ⎢ * ⎢ Φ1,500 ,Wˆ1,500 ⎣
(
L
)
O L
(Φ (Φ
ˆ* 6,1 ,W6,1
)
M
ˆ* 6,500 ,W6,500
)
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
In this particular optimum model, the fuzzy function parameters are approximated with non-linear support vector regression method. Hence, each cell
(
in Φ − Col *
)
500×6
(
)
, i.e., Φ i , k ,Wˆi*,k , k=1,…,500, i=1,..6, holds the following list
of parameters:
•
Support vector coefficients of the embedded type-1 fuzzy function model of ith cluster, viz. ¾ Lagrange multipliers for each training vector, i.e.,
(
Wˆi =1, k =1 : α i − α i*
)
500×1
= [ 24 −24 L L 19.27 ]1×500 T
¾ Support vectors, i.e.,
Φ i =1, k =1
⎡ ( μi ,1 ) ⎢ ⎢ (μ ) ⎢ i ,2 =⎢ M ⎢ ⎢ M ⎢ ⎢⎣( μi ,500 )
( e ) ( x ) ( x ) ( x ) ( x ) ( x ) ⎤⎥ ( e ) ( x ) ( x ) ( x ) ( x ) ( x ) ⎥⎥ μi ,1
1,1
1,2
1,3
1,4
2,1
M M
2,2
2,3
2,4
M M
M M
M M
1,5
μi ,2
M M
(e ) ( x ) ( x ) ( x ) ( x ) μi ,500
500,1
500,2
500,3
500,4
2,5
⎥ ⎥ ⎥ ⎥ x ( 500,5 )⎥⎦ M M
352
A Appendix
It should be recalled that in EDIT2FF method, the fuzzy functions of different clusters of an embedded model might take on different membership value transformations. For instance, in the above, the fuzzy function structure of the first cluster in the first row is shown, Φi=1,k=1. This corresponds to the optimum model identified for the first training vector in the first cluster. For the same training vector k=1, another cluster of the same model may identify different fuzzy function parameters and support vectors using any other combination of the optimum membership value transformations. The list of different fuzzy function structures are identified in the first stage of the algorithm, i.e., genetic learning process, and in the second phase the optimum ones are selected for each training vector and the parameters are retained in collection tables to be later used during reasoning (inference). Friedman’s Artificial Dataset - Parameters of the Best Cross Validation Models of the Second Optimum Methodology, ET1IFF The ET1IFF, i.e., Evolutionary Type-1 Improved Fuzzy Functions methodology, applies proposed Improved Fuzzy Clustering (IFC) method and optimizes parameters with genetic algorithms. Since type-1 fuzzy functions are utilized instead of type-2 fuzzy functions, collection tables are not required. The summary of the optimum parameters of each of the five optimum cross validation models of ET1IFF method is shown in the following table: Optimum Parameters of the ET1IFF Methodology Obtained from Cross Validation Trials on Friedman’s Artificial Dataset.
1 2 3 4 5 6 7 8 9 10
Model Name Fuzzy Clustering Type Regression Type # of clusters Optimum List of membership value transformations to be used as additional input variables κ (number of nearest training vectors for IFC) Alpha-cut Fuzziness degree Creg Epsilon
ET1IFF Improved Fuzzy Clustering (IFC) Non-linear SVM {6,7,8} (μ), (eμ) {2} [0,0.1] [1.3,1.85] [3.5,55] [0.014,0.1]
The average performance of the optimum model of ET1IFF is identified from five cross validation datasets, therefore, the values of some of the optimum parameters as shown in the above table are in intervals rather than discrete list of
D.2 Friedman’s Artificial Dataset: Summary of Results
353
values, although in reality each ET1IFF cross validation model holds one crisp values for these parameters. Nevertheless, in the next we will demonstrate the parameters and fuzzy function structures of one of the optimum cross validation models of Friedman dataset, which holds five input variable and an output variable
• • • • • •
c*: optimum number of clusters = 6 m* : optimum degree of fuzziness =1.85 C-regularization of SVM= 39.767 Epsilon* of SVM=0.095 The list of optimum membership value transformations to identify interim and local system fuzzy functions include only one transformation, i.e., (eμ) Optimum Interim Fuzzy Function parameters of each cluster, i, to obtain improved membership values are represented with support vector coefficients: ¾ The Lagrange multipliers, i.e.,
(
)
(
)
⎡wˆ : α − α * 500×1 = −53.19 +53.19 L L T [ ]1×500 ⎢ 1 1 1 ⎢M ⎢ ⎢wˆ : α − α * 500×1 = −53.19 0 L L T [ ]1×500 6 6 ⎢⎣ 6 ¾
Support vectors of each cluster, i.e.,
) (e ) ⎣(
τ i = ⎡ eμ
1,i
•
μ2 ,i
L L
( e )⎤⎦ μ500,i
T
Optimum Local Fuzzy Function parameters of each cluster, i, to identify the local models are represented by Support vector coefficients: ¾ ¾
The Lagrange multipliers: 500×1 T Wˆi : (αi ,αi* ) = [ −12.719 −39.767 L L 11.258]
Support vectors, i.e.,
⎡ ⎢ ⎢ ⎢ Φi = ⎢ ⎢ ⎢ ⎢ ⎢⎣
( e ) ( x ) ( x ) ( x ) ( x ) ( x ) ⎤⎥ ( e ) ( x ) ( x ) ( x ) ( x ) ( x ) ⎥⎥ μi ,1
1,1
1,2
1,3
1,4
1,5
2,1
2,2
2,3
2,4
2,5
M M
M M
M M
M M
μi ,2
M M
(e ) ( x ) ( x ) ( x ) ( x ) μi ,500
500,1
500,2
500,3
500,4
⎥ ⎥ ⎥ ⎥ ( x500,5 )⎥⎦ M M
For each cluster, i=1,…,6, the interim and local fuzzy functions hold the same structures as shown above.
354
A Appendix
D.3 Auto-mileage Dataset: Summary of Results Table D.10 Auto Mileage Dataset R2 values on Training Datasets of Cross Validation (CV) Trials CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
0.909
0.899
0.898
0.889
0.897
0.898
0.007
ANFIS
0.968
0.942
0.963
0.957
0.963
0.959
0.010
DENFIS
0.911
0.915
0.900
0.880
0.909
0.903
0.014
GFS
0.932
0.890
0.913
0.897
0.867
0.898
0.024
SVM
0.947
0.918
0.932
0.936
0.953
0.937
0.014
DIT2FRB
0.932
0.920
0.939
0.914
0.931
0.927
0.010
T1FF
0.931
0.922
0.934
0.914
0.928
0.926
0.008
T1IFF
0.937
0.930
0.926
0.915
0.935
0.928
0.009
DIT2FF
0.951
0.945
0.956
0.931
0.942
0.945
0.010
DIT2IFF
0.952
0.954
0.967
0.934
0.950
0.951
0.012
ET1FF
0.936
0.925
0.926
0.881
0.924
0.918
0.021
ET1IFF
0.957
0.931
0.878
0.838
0.946
0.910
0.050
EDIT2FF
0.956
0.941
0.940
0.932
0.929
0.940
0.011
EDIT2IFF
0.966
0.896
0.936
0.938
0.941
0.935
0.025
Table D.11 Auto Mileage Dataset R2 values on Validation Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
0.896
0.772
0.847
0.850
0.848
0.843
0.044
ANFIS
0.823
0.531
0.828
0.702
0.211
0.619
0.258
DENFIS
0.886
0.800
0.865
0.861
0.856
0.853
0.032
GFS
0.880
0.782
0.86
0.850
0.857
0.846
0.037
SVM
0.879
0.759
0.856
0.861
0.856
0.842
0.048
DIT2FRB
0.880
0.840
0.853
0.854
0.829
0.851
0.019
T1FF
0.887
0.845
0.855
0.863
0.835
0.857
0.020
T1IFF
0.897
0.830
0.867
0.864
0.850
0.861
0.024
DIT2FF
0.920
0.881
0.891
0.895
0.869
0.891
0.019
DIT2IFF
0.921
0.892
0.895
0.886
0.883
0.896
0.015
ET1FF
0.905
0.842
0.863
0.828
0.874
0.862
0.030
ET1IFF
0.923
0.836
0.855
0.820
0.894
0.866
0.042
EDIT2FF
0.930
0.863
0.916
0.943
0.886
0.908
0.033
EDIT2IFF
0.942
0.896
0.929
0.927
0.884
0.915
0.024
D.3 Auto-mileage Dataset: Summary of Results
355
Table D.12 Auto Mileage Dataset R2 values on Testing Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
0.829
0.877
0.798
0.889
0.812
0.841
0.040
ANFIS
0.815
0.803
0.722
0.825
0.669
0.767
0.068
DENFIS
0.829
0.872
0.804
0.893
0.829
0.845
0.036
GFS
0.841
0.880
0.767
0.87
0.842
0.841
0.040
SVM
0.852
0.861
0.806
0.914
0.805
0.848
0.045
DIT2FRB
0.815
0.883
0.816
0.897
0.821
0.846
0.040
T1FF
0.856
0.867
0.809
0.920
0.805
0.851
0.047
T1IFF
0.857
0.893
0.825
0.922
0.819
0.863
0.044
DIT2FF
0.859
0.877
0.805
0.908
0.815
0.853
0.043
DIT2IFF
0.859
0.889
0.808
0.914
0.819
0.858
0.045
ET1FF
0.847
0.880
0.817
0.862
0.794
0.840
0.035
ET1IFF
0.866
0.878
0.757
0.863
0.785
0.830
0.055
EDIT2FF
0.856
0.887
0.823
0.886
0.784
0.847
0.044
EDIT2IFF
0.887
0.866
0.810
0.882
0.869
0.863
0.031
Table D.13 Auto Mileage Dataset MAPE values on Training Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
8.099
8.306
8.726
8.053
7.716
8.180
0.371
ANFIS
4.514
6.383
5.608
5.255
4.952
5.343
0.707
DENFIS
7.950
7.38
8.64
8.48
7.497
7.989
0.565
GFS
10.743
10.975
11.232
10.575
10.687
10.842
0.26
SVM
5.564
6.898
6.791
4.986
4.916
5.831
0.959
DIT2FRB
6.293
5.738
5.594
6.014
6.216
5.971
0.301
T1FF
6.368
5.689
6.043
5.914
6.306
6.064
0.281
T1IFF
6.328
6.388
7.677
6.181
6.103
6.535
0.648
DIT2FF
4.544
4.176
4.100
4.741
5.101
4.532
0.413
DIT2IFF
4.869
4.527
3.997
5.149
4.807
4.670
0.436
ET1FF
6.692
7.012
8.305
9.321
7.657
7.798
1.053
ET1IFF
6.445
6.659
9.466
9.633
6.826
7.806
1.599
EDIT2FF
5.095
5.572
5.879
5.298
6.974
5.764
0.738
EDIT2IFF
2.567
6.252
4.718
4.696
4.711
4.589
1.313
356
A Appendix
Table D.14 Auto Mileage Dataset MAPE values on Validation Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
9.774
11.390
8.647
9.649
11.070
10.106
1.121
ANFIS
11.480
16.987
9.038
13.592
19.619
14.143
4.229
DENFIS
10.926
10.206
7.70
9.575
11.151
9.91
1.383
GFS
12.302
11.763
10.873
12.674
18.215
13.156
2.903
SVM
10.834
11.994
7.371
10.174
10.431
10.161
1.708
DIT2FRB
9.991
9.461
8.478
10.263
10.811
9.801
0.885
T1FF
9.697
9.466
8.273
9.686
10.607
9.546
0.836
T1IFF
9.366
8.986
7.644
9.643
10.017
9.131
0.913
DIT2FF
7.830
7.520
5.919
7.716
9.117
7.620
1.140
DIT2IFF
7.831
6.775
5.702
8.191
8.688
7.437
1.198
ET1FF
9.171
8.954
7.810
13.074
9.353
9.672
1.994
ET1IFF
8.550
9.109
8.363
10.779
8.105
8.981
1.071
EDIT2FF
7.067
7.723
5.346
4.951
8.954
6.808
1.666
EDIT2IFF
6.334
5.571
4.403
5.901
8.103
6.062
1.347
Table D.15 Auto Mileage Dataset MAPE values on Testing Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
10.559
8.750
10.888
9.541
10.957
10.139
0.961
ANFIS
10.181
14.201
13.250
11.712
12.526
12.374
1.53
DENFIS
10.286
8.866
10.723
8.605
9.115
9.519
0.931
GFS
10.09
9.150
10.27
10.541
9.957
10.0016
0.52
SVM
9.157
9.294
10.331
8.168
10.245
9.439
0.889
DIT2FRB
10.225
8.607
10.328
8.557
10.880
9.719
1.068
T1FF
9.171
9.394
10.413
7.852
10.284
9.423
1.031
T1IFF
9.027
8.383
10.122
8.050
10.008
9.118
0.934
DIT2FF
9.524
8.746
10.841
8.190
10.042
9.469
1.046
DIT2IFF
9.270
8.489
10.781
8.207
10.007
9.351
1.064
ET1FF
9.346
8.563
10.638
11.667
9.874
10.018
1.193
ET1IFF
8.680
8.431
12.594
10.835
10.598
10.228
1.712
EDIT2FF
9.025
8.280
10.750
10.042
11.315
9.882
1.238
EDIT2IFF
8.471
9.934
10.797
10.159
9.697
9.812
0.854
D.3 Auto-mileage Dataset: Summary of Results
357
Table D.16 Auto Mileage Dataset RMSE values on Training Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
2.480
2.516
2.555
2.682
2.383
2.523
0.109
ANFIS
1.465
1.911
1.529
1.664
1.427
1.599
0.196
DENFIS
2.448
2.312
2.522
2.786
2.263
2.466
0.207
GFS
1.989
2.252
1.958
2.354
1.972
2.105
0.185
SVM
1.896
2.269
2.077
2.019
1.622
1.977
0.240
DIT2FRB
2.143
2.232
1.974
2.352
1.977
2.136
0.164
T1FF
2.161
2.205
2.055
2.348
2.020
2.158
0.130
T1IFF
2.086
2.093
2.289
2.349
1.914
2.146
0.174
DIT2FF
1.827
1.857
1.696
2.127
1.829
1.867
0.158
DIT2IFF
1.829
1.700
1.472
2.072
1.691
1.753
0.220
ET1FF
2.074
2.173
2.189
2.771
2.061
2.253
0.295
ET1IFF
1.703
2.079
2.778
3.217
1.750
2.305
0.667
EDIT2FF
1.731
1.918
1.963
2.098
1.990
1.940
0.134
EDIT2IFF
1.552
2.619
2.040
2.091
1.841
2.029
0.392
Table D.17 Auto Mileage Dataset RMSE values on Validation Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
2.618
3.371
2.921
3.134
3.095
3.028
0.280
ANFIS
3.498
6.322
3.186
4.564
12.440
6.002
3.802
DENFIS
2.708
2.98
3.747
3.115
3.116
3.113
0.38
GFS
2.319
2.594
2.328
2.995
2.723
2.592
0.285
SVM
2.814
3.307
2.880
3.056
2.943
3.000
0.193
DIT2FRB
2.767
2.655
2.948
3.122
3.237
2.946
0.241
T1FF
2.681
2.652
2.911
3.010
3.178
2.886
0.222
T1IFF
2.556
2.694
2.895
2.999
3.040
2.837
0.206
DIT2FF
2.241
2.319
2.511
2.647
2.854
2.514
0.248
DIT2IFF
2.230
2.180
2.486
2.761
2.731
2.478
0.271
ET1FF
2.469
2.643
2.765
3.429
2.763
2.814
0.364
ET1IFF
2.265
2.710
2.870
3.431
2.472
2.750
0.445
EDIT2FF
2.093
2.460
2.245
1.962
2.657
2.284
0.279
EDIT2IFF
1.954
2.132
2.038
2.223
2.813
2.232
0.340
358
A Appendix
Table D.18 Auto Mileage Dataset RMSE values on Testing Datasets of Cross Validation (CV) CV1
CV2
CV3
CV4
CV5
AVERAGE
Std-Dev
NN
2.938
2.915
3.678
2.699
3.343
3.115
0.392
ANFIS
3.055
3.701
4.311
3.571
5.055
3.938
0.768
DENFIS
2.952
3.009
3.649
2.634
3.192
3.087
0.373
GFS
2.874
3.331
3.475
3.068
3.382
3.226
0.248
SVM
2.734
3.107
3.583
2.386
3.429
3.048
0.493
DIT2FRB
3.059
2.927
3.529
2.604
3.316
3.087
0.356
T1FF
2.701
3.037
3.558
2.313
3.425
3.007
0.513
T1IFF
2.693
2.791
3.481
2.279
3.298
2.908
0.484
DIT2FF
2.681
2.942
3.593
2.474
3.328
3.004
0.458
DIT2IFF
2.681
2.879
3.562
2.401
3.301
2.965
0.468
ET1FF
2.777
2.910
3.500
3.027
3.524
3.148
0.344
ET1IFF
2.603
2.943
4.016
3.032
3.583
3.235
0.560
EDIT2FF
2.702
2.866
3.419
2.727
3.604
3.064
0.419
EDIT2IFF
2.476
3.126
3.601
2.830
2.876
2.982
0.417
Fuzzy Function Parameters of the Optimum Models of Auto Mileage Dataset From the experimental trials on Auto Mileage Dataset using proposed and benchmark datasets, Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions (EDIT2IFF) and Type-2 Improved Fuzzy Functions (T1IFF) are identified as the optimum methodologies based on error reduction performance. In the next, we present the parameters of one optimum model from each of these two methodologies. Parameters of the Best Cross Validation Models of the First Optimum Methodology, EDIT2IFF From the experimental trials the Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions methodology (EDIT2IFF) which utilizes Improved Fuzzy Clustering (IFC) is identified as one of the best methodologies for Auto-Mileage dataset. The parameters of the optimum models obtained from the application of EDIT2IFF are stored in collection tables, i.e., m-Col*, τ-Col* and Φ-Col*. There is one collection table set, i.e., 〈m-Col*, τ-Col* and Φ-Col*〉, for each cross validation iteration; hence there are five collection table sets in order to do inference on five different testing datasets of Auto Mileage Dataset Experiment. Here, we will demonstrate the structures of only one of the collection table set.
D.3 Auto-mileage Dataset: Summary of Results
359
m-Col* is a column matrix of the size of training data vectors that holds one optimum fuzziness value, m*, of an embedded model identified as the optimum model for each training vector. Hence the m-Col* table is a (125×1) matrix, where n=125 indicate the number of training vectors of the particular cross validation dataset of Auto-Mileage Dataset. The τ-Col* collection table holds the interim fuzzy function structures, i.e., the optimum membership value transformations which are used as the only input variables to identify the interim fuzzy functions, hk(τik,ŵi,k) and their corresponding parameters, ŵi,k, k=1…,n, which are identified individually for each training data vector. The same interim fuzzy function structure holds for each cluster of an embedded model. Thus, the τ-Col* is a matrix of (125×1) dimensions. The Φ-Col* collection table on the other hand holds the local fuzzy function structures, i.e., the parameters, Ŵi,k, k=1…,125,i=1,…,c*, and structures, Φi,k of local fuzzy functions, f(Φik,Ŵi,k), of each embedded fuzzy function model identified for each kth training vector. Thus, if n=125 indicates the number of training vectors of Auto Mileage Dataset, and c* is the optimum number of clusters, then the Φ-Col* is a matrix of (125×6) dimensions, in which c*=6 is the total number of clusters of particular cross validation model presented here. The parameters of the optimum EDIT2IFF methodology from five different cross validation models of Auto Mileage dataset are as follows: Optimum Parameters of the EDIT2IFF Methodology Obtained from Cross Validation Trials on Auto-Mileage Dataset.
1 2 3 4 5 6
7 8 9
Model Name Fuzzy Clustering Type Regression Type # of clusters Fuzziness degree Optimum List of membership value transformations to be used as additional input variables κ (number of nearest training vectors for IFC) Alpha-cut C- Regularization
10 Epsilon 11 m-Col , τ-Col, Φ-Col tables
EDIT2IFF Improved Fuzzy Clustering Non-linear SVM {6,7,8} [1.23,1.89] (μ), (eμ), (log(1-μ)/μ) {2} [0,0.1] {1.26,5.25,10.8,103} {0.03,0.09,0.23}
Each optimum model identifies one optimum C-regularization* and Epsilon* value determined by the genetic learning process of the EDIT2IFF method. The m collection table, m-Col*, and the interim and local fuzzy function structure collection tables, τ-Col*, Φ-Col*, of one of the optimum models of the cross validation iteration trials in which the optimum number of clusters c*=6 is as follows:
360
(
A Appendix
⎡ m1 = 1.69 ⎤ ⎢ m = 1.61 ⎥ ⎢ 2 ⎥ 125×1 * = ⎢ m3 = 1.68 ⎥ τ − Col * m − Col ⎢ ⎥ M ⎢ ⎥ ⎢⎣ m125 = 1.59⎥⎦
)
( Φ − Col )
* 125×6
(
(
)
⎡ Φ1,1 ,Wˆ1,1* ⎢ M =⎢ ⎢ * ⎢ Φ1,125 ,Wˆ1,125 ⎣
(
O
)
)
(Φ
L
(Φ
L
⎡ (τ1 , wˆ1 ) ⎤ ⎢ (τ 2 , wˆ 2 ) ⎥⎥ ⎢ 125×1 = ⎢ (τ 3 , wˆ 3 ) ⎥ ⎢ ⎥ M ⎢ ⎥ ⎢(τ , wˆ ) ⎥ ⎣ 125 125 ⎦
ˆ* 6,1 ,W6,1
)
M ˆ* 6,125 ,W6,125
)
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
The fuzzy function parameters of the optimum EDIT2IFF are approximated by non-linear support vector regression algorithm. Since an optimum embedded model is identified for each training vector k, k=1,…n, each row of three collection tables represent the parameters of a single embedded fuzzy function models of EDIT2IFF model identified for a single training vector. Hence, the parameters of the optimum embedded model of kth training vector is as follows:
•
The kth cell in m-Col table holds the fuzziness parameter of the corresponding embedded model.
•
The kth cell in (τ − Col * )
125×1
ˆ k ) , k=1,…,125, holds the , i.e., (τ k , w
support vector coefficients of the interim fuzzy functions: ¾ The Lagrange multipliers, i.e.,
( : (α
) −α )
wˆ k ,1 : α1 − α1* wˆ k ,2
2
125×1
= [ 28.66 −28.66 L L 13.77 ]
* 125×1 2
T
= [ −28.66 −28.66 L L 15.031]
T
M
(
wˆ k ,6 : α 6 − α 6*
¾
•
)
125×1
= [ −8.007 −8.007 L L 18.55]
T
Support vectors, i.e., τ i , k
The kth cell in ( Φ − Col * )
125× 6
⎡ ( μi ,1 ) ⎢ ⎢ ( μi ,2 ) =⎢ M ⎢ ⎢ M ⎢ ⎢⎣( μi ,125 )
(
( log(1 − μ ( log(1 − μ
) / μi ,1 ) ⎤ ⎥ i ,2 ) / μi ,2 ) ⎥ ⎥ M ⎥ ⎥ M ⎥ ( log(1 − μi ,125 ) / μi,125 )⎥⎦ i ,1
)
, i.e., Φ i , k ,Wˆi*,k , holds the following list
of parameters where the fuzzy functions of the particular cluster i are identified with membership values as additional input variables of the original input variables as shown below:
D.3 Auto-mileage Dataset: Summary of Results
¾
The
Lagrange
(
Wˆi , k : α i , α ¾
)
361
multipliers
* 125×1 i
for
each
cluster
i
:
= [12.869 −28.66 5.42 L −27.48]
T
Support vectors, i.e.,
Φ i ,k
⎡ ( μi ,1 ) ⎢ ⎢ ( μi ,2 ) = ⎢⎢ M ⎢ M ⎢ ⎣⎢( μi ,125 )
(x ) (x ) (x ) (x ) (x ) (x ) (x ) (x ) 1,1
1,2
1,3
1,4
2,1
2,2
2,3
2,4
M M
M M
M M
M M
(x ) (x ) (x ) (x ) 125,1
125,2
125,3
125,4
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦⎥
Auto Mileage Dataset - Parameters of the Best Cross Validation Models of the Optimum Methodology, T1IFF
From the experimental trials the Type-1 Improved Fuzzy Functions methodology (T1IFF) which utilizes Improved Fuzzy Clustering (IFC) is identified as one of the best methodologies for Auto-Mileage dataset. Since type-2 fuzzy functions are not utilized in this model, collection tables are not required. The optimum parameters are: Optimum Parameters of the T1IFF Methodology Obtained from Cross Validation Trials on Auto-Mileage Dataset.
Model Name Fuzzy Clustering Type Regression Type # of clusters Optimum List of membership value transformations to be used as additional input variables κ (number of nearest training vectors for IFC) Alpha-cut Fuzziness degree Creg Epsilon
T1IFF Improved Fuzzy Clustering Non-linear SVM {3,5,7} (μ), (eμ) {3} [0,0.1] [1.3,1.5] [2] [0.1]
The parameters of the optimum models of the T1IFF as shown above are optimized based on an exhaustive search. The optimum model is identified based on the average performance of the methods on five cross validation datasets, therefore, some of the optimum parameters as shown above comprise of values in a list or intervals. Next, the parameters and fuzzy function structures of one of the optimum models of the cross validation datasets will be demonstrated as follows: • •
c*: optimum number of clusters =3 m* : optimum degree of fuzziness =1.3
362
A Appendix
• • • •
C-regularization of SVM= 2 Epsilon* of SVM=0.1 The list of optimum membership value transformations to identify interim and local system fuzzy functions include only membership values and one transformation, i.e., (μ) (eμ) Optimum Interim Fuzzy Function parameters of each cluster, i, to obtain improved membership values,
Support vector coefficients, i.e., The Lagrange multipliers,
( wˆ : (α wˆ : (α
) −α ) −α )
125×1
wˆ1 : α1 − α1* 2
2
3
3
* 2
= [ −2 −2 L L −1.94]
125×1
* 125×1 3
T
= [ −2 +2 L L +1.84]
T
= [ −2 −2 L L −2]
T
( ) ⎤⎥ ( ) ⎥⎥
μ ⎡ e i ,1 ⎢ ( μi ,1 ) ⎢ ( μ ) e μi ,2 ⎢ i ,2 Support vectors, i.e., τ i = ⎢ M M ⎢ M ⎢ M ⎢( μ ) μi ,125 i ,125 e ⎣⎢
(
•
)
⎥ ⎥ ⎥ ⎥ ⎦⎥
Optimum Local Fuzzy Function parameters: Support vector coefficients, i.e., The Lagrange multipliers,
( Wˆ : (α Wˆ : (α
) −α ) −α )
Wˆ1 : α1 − α1* 2
2
3
3
⎡ μi ,1 ⎢ ⎢μ ⎢ i ,2 Support vectors, i.e., Φ i = ⎢ M ⎢ ⎢ M ⎢ ⎢⎣ μi ,125
125×1
= [ −0.39 −2 0.02 L 2 ]
T
* 125×1 2
= [ −2 1.4 L L −2]
* 125×1 3
= [1.96 −2 L L 0.66]
T
T
(e ) ( x ) ( x ) ( x ) ( x ) (e ) ( x ) ( x ) ( x ) ( x ) μi ,1
1,1
1,2
1,3
1,4
2,1
2,2
2,3
2,4
M M
M M
M M
M M
μi ,2
M M
(e ) ( x ) ( x ) ( x ) ( x ) μi ,125
125,1
125,2
125,3
125,4
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
For each cluster, i=1,…,3, the interim and local fuzzy functions hold the same structures for the particular model of T1IFF as shown above.
D.4 Desulphurization Dataset: Summary of Results
363
D.4 Desulphurization Dataset: Summary of Results Kernel Density Diagrams of Reagent 1 Based on Categorical Variables
Car-Type 1
Car-Type 3
Practice 5
Car-Type 2
Practice 4
Practice 3
364
A Appendix
Practice 1
Position
Table D.19 Desulphurization Reagent1 R2 values on Training Dataset of Cross Validation (CV)Trials CV1 CV2 CV3 CV4 CV5 Average Stdev NN
0.762 0.780 0.780 0.719 0.705
0.749
0.035
ANFIS
0.913 0.886 0.890 0.839 0.834
0.872
0.034
DENFIS 0.824 0.841 0.819 0.826 0.742
0.810
0.039
GFS
0.856 0.837 0.834 0.801 0.750
0.816
0.041
SVM
0.714 0.728 0.710 0.676 0.656
0.697
0.030
DIT2FRB 0.795 0.810 0.833 0.766 0.796
0.800
0.024
T1FF
0.718 0.728 0.710 0.677 0.655
0.698
0.031
T1IFF
0.735 0.729 0.721 0.680 0.658
0.705
0.034
DIT2FF
0.778 0.785 0.740 0.741 0.729
0.755
0.025
DIT2IFF 0.796 0.843 0.786 0.752 0.749
0.785
0.038
ET1FF
0.721 0.728 0.711 0.676 0.656
0.698
0.031
ET1IFF
0.720 0.728 0.712 0.679 0.658
0.699
0.030
EDIT2FF 0.734 0.821 0.725 0.698 0.681
0.732
0.054
EDIT2IFF 0.772 0.728 0.714 0.678 0.658
0.710
0.044
D.4 Desulphurization Dataset: Summary of Results
365
Table D.20 Desulphurization Reagent1 R2 values on Validation Dataset of Cross Validation (CV)Trials CV1
CV2
CV3
CV4
CV5
Average Stdev
NN
0.664 0.618 0.672 0.703 0.675
0.666
0.031
ANFIS
0.417 0.437 0.467 0.585 0.605
0.502
0.087
DENFIS
0.598 0.520 0.623 0.671 0.557
0.594
0.058
0.601
0.596
0.039
GFS
0.55
0.65
0.581 0.597
SVM
0.646 0.657 0.701 0.728 0.718
0.690
0.036
DIT2FRB
0.626 0.610 0.659 0.672 0.656
0.645
0.026
T1FF
0.651 0.657 0.702 0.728 0.719
0.691
0.035
T1IFF
0.648 0.657 0.702 0.727 0.718
0.690
0.036
DIT2FF
0.716 0.736 0.738 0.767 0.766
0.745
0.022
DIT2IFF
0.705 0.734 0.747 0.765 0.771
0.744
0.027
ET1FF
0.650 0.658 0.702 0.728 0.718
0.691
0.035
ET1IFF
0.651 0.658 0.702 0.728 0.719
0.691
0.035
EDIT2FF
0.678 0.764 0.721 0.745 0.743
0.730
0.033
EDIT2IFF
0.713 0.663 0.702 0.729 0.719
0.705
0.025
Table D.21 Desulphurization Reagent1 R2 values on Testing Dataset of Cross Validation (CV)Trials CV1
CV2
CV3
CV4
CV5
Average
Stdev
NN
0.771 0.766 0.750 0.774 0.776
0.767
0.010
ANFIS
0.569 0.536 0.561 0.650 0.642
0.591
0.051
DENFIS
0.707 0.656 0.669 0.726 0.672
0.686
0.029
GFS
0.681 0.679 0.665 0.681 0.684
0.678
0.007
SVM
0.799 0.771 0.795 0.789 0.791
0.789
0.011
DIT2FRB
0.744 0.741 0.745 0.758 0.738
0.745
0.008
T1FF
0.795 0.770 0.795 0.786 0.798
0.789
0.011
T1IFF
0.796 0.770 0.794 0.790 0.798
0.790
0.011
DIT2FF
0.771 0.768 0.796 0.788 0.802
0.785
0.015
DIT2IFF
0.796 0.758 0.792 0.789 0.790
0.785
0.015
ET1FF
0.798 0.774 0.796 0.799 0.796
0.792
0.010
ET1IFF
0.788 0.777 0.800 0.801 0.803
0.794
0.011
EDIT2FF
0.733 0.737 0.801 0.798 0.799
0.773
0.035
EDIT2IFF
0.812 0.805 0.803 0.803 0.804
0.805
0.005
366
A Appendix
Table D.22 Desulphurization Reagent2 R2 values on Training Dataset of Cross Validation (CV)Trials CV1
CV2
CV3
CV4
CV5
Average
Stdev
NN
0.761
0.7671
0.7733
0.7931
0.7773
0.774
0.012
ANFIS
0.914
0.8918
0.8346
0.9191
0.892
0.890
0.034
DENFIS
0.784
0.7701
0.8589
0.8348
0.8654
0.823
0.043
GFS
0.650
0.701
0.780
0.8011
0.709
0.728
0.061
SVM
0.738
0.6724
0.7304
0.7388
0.7377
0.724
0.029
DIT2FRB
0.831
0.7839
0.7882
0.8612
0.8716
0.827
0.040
T1FF
0.744
0.6746
0.7304
0.7407
0.7378
0.726
0.029
T1IFF
0.748
0.691
0.7472
0.7415
0.741
0.734
0.024
DIT2FF
0.806
0.7456
0.7652
0.7716
0.797
0.777
0.024
DIT2IFF
0.816
0.7641
0.8036
0.7882
0.801
0.795
0.020
ET1FF
0.742
0.6736
0.7296
0.7388
0.7532
0.727
0.031
ET1IFF
0.759
0.6935
0.7299
0.7337
0.7336
0.730
0.024
EDIT2FF
0.803
0.6948
0.7417
0.7501
0.7619
0.750
0.039
EDIT2IFF
0.775
0.6731
0.7785
0.7929
0.7918
0.762
0.050
Table D.23 Desulphurization Reagent2 R2 values on Validation Dataset of Cross Validation (CV)Trials CV1 NN
0.710
CV2
CV3
CV4
CV5
Average
Stdev
0.6899
0.7209
0.6672
0.7089
0.699
0.021
ANFIS
0.554
0.445
0.6568
0.4243
0.6372
0.543
0.107
DENFIS
0.655
0.6397
0.6057
0.6105
0.6573
0.634
0.024
GFS
0.501
0.531
0.6500
0.4504
0.591
0.544
0.078
SVM
0.704
0.6963
0.7028
0.6973
0.7392
0.708
0.018
DIT2FRB
0.673
0.6668
0.6858
0.6288
0.7176
0.674
0.032
T1FF
0.706
0.7017
0.7027
0.6985
0.7392
0.710
0.017
T1IFF
0.709
0.7036
0.7075
0.6989
0.7393
0.712
0.016
DIT2FF
0.779
0.7506
0.7365
0.7326
0.7971
0.759
0.028
DIT2IFF
0.777
0.7668
0.7511
0.7375
0.7902
0.764
0.021
ET1FF
0.716
0.6992
0.704
0.6983
0.7431
0.712
0.019
ET1IFF
0.694
0.7038
0.7003
0.6843
0.7369
0.704
0.020
EDIT2FF
0.779
0.7212
0.7194
0.7133
0.7696
0.741
0.031
EDIT2IFF
0.741
0.6994
0.7359
0.7187
0.7776
0.735
0.029
D.5 Stock Price Datasets: Summary of Results
367
Table D.24 Desulphurization Reagent2 R2 values on Testing Dataset of Cross Validation (CV) Trials
CV1
CV2
CV3
CV4
CV5
Average
Stdev
NN
0.784
0.772
0.786
0.7507
0.7744
0.774
0.014
ANFIS
0.670
0.6601
0.6728
0.5082
0.6075
0.624
0.070
DENFIS
0.677
0.6874
0.6903
0.6854
0.688
0.686
0.005
GFS
0.510
0.603
0.623
0.652
0.505
0.5786
0.067
SVM
0.791
0.75
0.7947
0.75
0.7955
0.776
0.024
DIT2FRB
0.758
0.7473
0.7621
0.7351
0.7507
0.751
0.010
T1FF
0.807
0.792
0.7943
0.6211
0.7956
0.762
0.079
T1IFF
0.805
0.7595
0.7762
0.7071
0.7965
0.769
0.039
DIT2FF
0.684
0.7938
0.7901
0.6265
0.802
0.739
0.079
DIT2IFF
0.792
0.7582
0.7644
0.7143
0.7975
0.765
0.033
ET1FF
0.808
0.7721
0.8111
0.6953
0.7926
0.776
0.048
ET1IFF
0.804
0.7948
0.8115
0.8115
0.8045
0.805
0.007
EDIT2FF
0.721
0.7982
0.8135
0.5254
0.8075
0.733
0.122
EDIT2IFF
0.809
0.8005
0.811
0.8118
0.804
0.807
0.005
D.5 Stock Price Datasets: Summary of Results Stock Price Dataset Variables Moving Average (MA): The average of stock prices of a certain number of time periods. It is named as moving average because for each calculation, the latest x number of time periods' data is used to calculate the average of a given period. By definition, a moving average lags the market. An exponentially smoothed moving average (EMA) gives greater weight to the more recent data, in an attempt to reduce the lag. Exponential Moving Average (EMA): Reduces the lag (falling behind) by applying more weight to recent prices relative to older prices, which will react quicker to recent price changes than a simple moving average. The calculation formula for EMA is as follows: EMA(current)= [Price(Current)EMA(previous)×Multiplier]+ EMA(previous), where multiplier is equal to the a specified percentage. Usually, “Multiplier” is equal to 2/(1+N) where N is the specified number of periods. In this thesis, {10, 20, 50} day periods are used for N. For the first period’s EMA(0), the simple moving average was used as the previous period’s EMA values.
368
A Appendix
Relative Strength Index: The RSI compares the magnitude of a stock’s recent gains to the magnitude of its recent losses within [0-100] range. The calculation of RSI is as follows: RSI = 100-[100/(1+RS)] where RSinitial= (Average Gain/Average Loss), and
⎡( previous Average _ Gain ) ×13 + CurrentGain ⎤⎦ /14 , Smootheded RS = ⎣ ⎡⎣( previous Average _ Loss ) ×13 + CurrentGain ⎤⎦ /14
and Average_Gain=(TotalGain/n) and Average_Loss=(TotalLosses/n), where n is the number of RSI periods. In short, 14-day period is used to measure the RSI. For 14period RSI, the Average Gain equals to the sum total of all gains divided by 14. Bollinger Band (BB): Indicator that allows user to compare volatility and relative price levels over a period of time [Murphy, 1999]. The indicator consists of three bands, simple moving average of 20 days, and upper and lower bands, which are 2 standard deviations away from SMA-20. For general periods, Bollinger recommends a 10-day simple moving average for the short term, a 20-day simple moving average for the intermediate term and 50-day simple moving average for the long term. Moving Average Convergence/Divergence (MACD): The general formula for the "standard" MACD is the difference between a security's 26-day and 12-day EMAs [Murphy, 1999]. MACD measures the difference between two EMAs. A positive MACD indicates that the 12-day EMA is trading above the 26-day EMA. A negative MACD indicates that the 12-day EMA is trading below the 26-day EMA. Chaikin Money Flow (CMF): The formula for CMF is the cumulative total of the Accumulation/Distribution Values for 21 periods divided by the cumulative total of volume for 21 periods. Accumulation/Distribution are volume indicators reflect the amount of shares traded in a particular stock, and is a direct reflection of the money flowing into and out of a stock [Murphy, 1999]. The basic premise behind accumulation distribution line is that the degree of buying or selling pressure can be determined by the location of the close relative to the high and low for the corresponding period (closing location value). There is a buying pressure when a stock closes in the upper half of a period’s range and there is a selling pressure when a stock closes in the lower half of the period’s trading range. The closing location value multiplied by volume forms the accumulation/distribution value for each period [Murphy, 1999].
Close _ Location _ Value(CLV ) = t
CMF =
∑ ( CLVt * volumet )
t − 20
t
∑ ( volt )
t − 20
[closet − lowt ] − [ hight − closet ] [ hight − lowt ]
D.5 Stock Price Datasets: Summary of Results
369
TD Canada Trust – Cross Validation Performance Measures Table D.25 TD Canada Trust Cross Validation (CV) RSTB values for Testing Dataset
Buy_and_Hold profit =$102.28 RSTB
CV1
CV2
CV3
Average
Std
100.93
0
100.33
0.85
0
0
106.72
106.26
0.60
0
0
0
N/A
N/A
0
98.013
0
100.40
3.38
109.77
112.17
110.66
104.43
109.54
2.99
103.78
103.36
101.02
101.98
102.38
1.15
114.2
109.6
112.21
111.33
1.88
113.55
108.61
107.01
110.61
2.71
114.98
112.26
110.31
113.12
2.15
106.82
106.49
105.7
109.62
4.53
112.52
109.47
114.62
112.72
2.00
110.46
103.74
114.26
111.18
4.45
109.68
109.69
113.12
114.37
111.93
2.13
114.34
116.13
115.67
115.23
115.44
0.69
NN
99.721
0
ANFIS
105.87
106.2
0
0
GFS
102.79
0
SVM
110.69
DIT2FRB
101.75
T1FF
110.15
110.48
T1IFF
112.31
111.55
DIT2FF
112.48
115.58
115
114.09
ET1FF
112.91
114.07
ET1IFF
113.09
114.37
EDIT2FF
112.77
EDIT2IFF
115.82
DENFIS
DIT2IFF
CV4 0
CV5
Table D.26 TD Canada Trust Cross Validation (CV) MAPE% for Training Dataset CV1
CV2
CV3
CV4
CV5
Average
Std
NN
0.25
0.42
0.01
0.21
0.55
0.29
0.21
ANFIS
0.01
0.01
0.31
0.01
0.00
0.07
0.14
DENFIS
0.11
0.13
0.10
0.10
0.11
0.11
0.01
GFS
0.03
0.03
0.04
0.03
0.02
0.03
0.01
SVM
0.25
0.25
0.03
0.28
0.27
0.22
0.10
DIT2FRB
0.03
0.05
0.25
0.03
0.04
0.08
0.09
T1FF
0.22
0.22
0.20
0.22
0.19
0.21
0.01
T1IFF
0.20
0.21
0.20
0.20
0.21
0.20
0.01
T2FF
0.10
0.10
0.10
0.08
0.10
0.10
0.01
T2IFF
0.09
0.09
0.09
0.08
0.09
0.09
0.01
ET1FF
0.23
0.25
0.21
0.25
0.24
0.24
0.02
ET1IFF
0.24
0.22
0.21
0.26
0.24
0.24
0.02
EDIT2FF
0.08
0.09
0.08
0.07
0.10
0.09
0.01
0.09 0.10 EDIT2IFF CV: Cross Validation Experiment
0.09
0.08
0.11
0.09
0.01
370
A Appendix
Table D.27 TD Canada Trust Cross Validation (CV) MAPE% for Validation Dataset
NN
CV1
CV2
CV3
CV4
CV5
Average
Std
0.27
0.37
0.12
0.22
0.57
0.31
0.17
ANFIS
0.20
0.13
0.38
0.20
0.13
0.21
0.10
DENFIS
0.16
0.15
0.18
0.15
0.15
0.16
0.01
GFS
0.09
0.10
0.10
0.09
0.11
0.10
0.01
SVM
0.29
0.30
0.13
0.27
0.26
0.25
0.07
DIT2FRB
0.14
0.14
0.28
0.15
0.13
0.17
0.06
T1FF
0.26
0.26
0.25
0.26
0.23
0.25
0.01
T1IFF
0.23
0.24
0.25
0.22
0.22
0.23
0.02
T2FF
0.13
0.13
0.12
0.10
0.10
0.12
0.01
T2IFF
0.12
0.12
0.10
0.10
0.09
0.11
0.02
ET1FF
0.29
0.28
0.24
0.26
0.25
0.27
0.02
ET1IFF
0.30
0.24
0.25
0.27
0.23
0.26
0.03
EDIT2FF
0.11
0.11
0.10
0.10
0.11
0.11
0.01
EDIT2IFF
0.12
0.12
0.11
0.09
0.10
0.11
0.01
Table D.28 TD Canada Trust Cross Validation (CV) MAPE% for Testing Dataset CV1
CV2
CV3
CV4
CV5
Average
Std
NN
0.82
1.45
0.79
0.46
1.62
1.03
0.49
ANFIS
0.52
4.62
2.17
0.99
0.81
1.82
1.69
DENFIS
1.29
1.29
1.32
1.94
1.26
1.42
0.29
GFS
0.37
1.79
1.87
1.00
1.47
1.30
0.62
SVM
0.19
0.23
0.38
0.18
0.23
0.24
0.08
DIT2FRB
0.59
0.27
0.16
0.82
0.41
0.45
0.26
T1FF
0.18
0.17
0.17
0.21
0.21
0.19
0.02
T1IFF
0.18
0.16
0.18
0.22
0.24
0.20
0.03
T2FF
0.25
0.18
0.23
0.28
0.24
0.23
0.04
T2IFF
0.24
0.27
0.20
0.23
0.24
0.24
0.03
ET1FF
0.19
0.21
0.18
0.24
0.20
0.20
0.02
ET1IFF
0.19
0.21
0.18
0.29
0.21
0.21
0.04
EDIT2FF
0.22
0.22
0.18
0.22
0.20
0.21
0.02
EDIT2IFF
0.20
0.22
0.17
0.25
0.24
0.22
0.03
D.5 Stock Price Datasets: Summary of Results
371
Table D.29 TD Canada Trust – Market Value estimation of two prediction algorithms. The shaded region indicates that from that day forward the two methodologies starts to predict different values for the next day stock prices. SVM Model Prediction Day Actual
Predicted Decision
Cash
Proposed EDIT2IFF Model Prediction
#Stock Market $ Predicted Decision 67.81
Cash
#Stock Market $
67.41
67.76
100
0
100
0
1
67.96
67.43
1
0.00
1.48
99.63
67.42
1
0.00
1.48
99.63
2
67.95
68.03
1
0.00
1.48
100.44
68.03
1
0.00
1.48
100.44
3
67.47
68.05
1
0.00
1.48
100.43
68.09
1
0.00
1.48
100.43
4
66.55
67.41
-1
100.43
0.00
100.43
67.38
-1
100.43
0.00
100.43
5
66.59
66.37
-1
100.43
0.00
100.43
66.23
-1
100.43
0.00
100.43
6
66.97
66.52
-1
100.43
0.00
100.43
66.39
-1
100.43
0.00
100.43
7
67.03
67.00
1
0.00
1.51
101.00
66.92
-1
100.43
0.00
100.43
8
67.27
67.12
1
0.00
1.51
101.09
67.05
1
0.00
1.50
100.52
9
67.6
67.41
1
0.00
1.51
101.46
67.35
1
0.00
1.50
100.88
10
67.52
67.81
1
0.00
1.51
101.95
67.78
1
0.00
1.50
101.37
11
68
67.74
1
0.00
1.51
101.83
67.72
1
0.00
1.50
101.25
12
67.35
68.30
1
0.00
1.51
102.56
68.30
1
0.00
1.50
101.97
13
67.2
67.47
1
0.00
1.51
101.58
67.39
1
0.00
1.50
101.00
14
67.85
67.42
1
0.00
1.51
101.35
67.36
1
0.00
1.50
100.77
15
67.42
68.14
1
0.00
1.51
102.33
68.14
1
0.00
1.50
101.75
16
67.93
67.69
1
0.00
1.51
101.68
67.67
1
0.00
1.50
101.10
17
67.48
68.26
1
0.00
1.51
102.45
68.28
1
0.00
1.50
101.87
18
67.77
67.75
1
0.00
1.51
101.77
67.72
1
0.00
1.50
101.19
19
67.72
68.07
1
0.00
1.51
102.21
68.04
1
0.00
1.50
101.63
20
67.95
68.03
1
0.00
1.51
102.14
67.99
1
0.00
1.50
101.55
21
67.81
68.28
1
0.00
1.51
102.48
68.25
1
0.00
1.50
101.90
22
68.4
67.98
1
0.00
1.51
102.27
67.89
1
0.00
1.50
101.69
23
69.26
68.60
1
0.00
1.51
103.16
68.57
1
0.00
1.50
102.57
24
70.04
69.40
1
0.00
1.51
104.46
69.43
1
0.00
1.50
103.86
25
69.72
70.07
1
0.00
1.51
105.63
70.15
1
0.00
1.50
105.03
26
69.25
69.67
-1
105.63
0.00
105.63
69.69
-1
105.03
0.00
105.03
27
69.63
69.13
-1
105.63
0.00
105.63
69.19
-1
105.03
0.00
105.03
28
69.74
69.57
-1
105.63
0.00
105.63
69.58
-1
105.03
0.00
105.03
29
69.24
69.69
-1
105.63
0.00
105.63
69.70
-1
105.03
0.00
105.03
30
69.47
69.15
-1
105.63
0.00
105.63
69.11
-1
105.03
0.00
105.03
31
68.93
69.41
-1
105.63
0.00
105.63
69.39
-1
105.03
0.00
105.03
32
68.52
68.83
-1
105.63
0.00
105.63
68.76
-1
105.03
0.00
105.03
33
68.6
68.42
-1
105.63
0.00
105.63
68.41
-1
105.03
0.00
105.03
372
A Appendix
Table D.29 (continued) SVM Model Prediction Day Actual Predicted Decision
Cash
Proposed EDIT2IFF Model Prediction
#Stock Market $ Predicted Decision
Cash
#Stock Market $
34
68.7
68.52
-1
105.63
0.00
105.63
68.53
-1
105.03
0.00
105.03
35
68.8
68.64
-1
105.63
0.00
105.63
68.65
-1
105.03
0.00
105.03
36
69.27
68.78
-1
105.63
0.00
105.63
68.72
-1
105.03
0.00
105.03
37
69.2
69.32
1
0.00
1.54
106.35
69.31
1
0.00
1.53
105.75
38
68.86
69.32
1
0.00
1.54
106.24
69.33
1
0.00
1.53
105.64
39
69.02
68.92
1
0.00
1.54
105.72
68.91
1
0.00
1.53
105.12
40
68.62
69.14
1
0.00
1.54
105.97
69.13
1
0.00
1.53
105.37
41
68.8
68.66
1
0.00
1.54
105.35
68.64
1
0.00
1.53
104.76
42
69.6
68.90
1
0.00
1.54
105.63
68.90
1
0.00
1.53
105.03
43
69.92
69.78
1
0.00
1.54
106.86
69.83
1
0.00
1.53
106.25
44
70.04
70.08
1
0.00
1.54
107.35
70.15
1
0.00
1.53
106.74
45
69.52
70.20
1
0.00
1.54
107.53
70.26
1
0.00
1.53
106.92
46
69.58
69.57
1
0.00
1.54
106.73
69.58
1
0.00
1.53
106.13
47
69.88
69.73
1
0.00
1.54
106.83
69.75
1
0.00
1.53
106.22
48
69.95
70.10
1
0.00
1.54
107.29
70.16
1
0.00
1.53
106.68
49
69.6
70.19
1
0.00
1.54
107.39
70.26
1
0.00
1.53
106.79
50
70
69.73
1
0.00
1.54
106.86
69.75
1
0.00
1.53
106.25
51
70.01
70.24
1
0.00
1.54
107.47
70.30
1
0.00
1.53
106.86
52
70.17
70.23
1
0.00
1.54
107.49
70.29
1
0.00
1.53
106.88
53
70.29
70.48
1
0.00
1.54
107.73
70.56
1
0.00
1.53
107.12
54
69.85
70.63
1
0.00
1.54
107.92
70.72
1
0.00
1.53
107.30
55
69.58
69.99
1
0.00
1.54
107.24
70.02
1
0.00
1.53
106.63
56
69.67
69.60
1
0.00
1.54
106.83
69.56
-1
106.64
0.00
106.64
57
69.59
69.73
1
0.00
1.54
106.96
69.74
1
0.00
1.53
106.78
58
69.75
69.64
1
0.00
1.54
106.84
69.60
1
0.00
1.53
106.65
59
69.8
69.86
1
0.00
1.54
107.09
69.88
1
0.00
1.53
106.90
60
69.58
69.94
1
0.00
1.54
107.16
69.96
1
0.00
1.53
106.98
61
69.95
69.62
1
0.00
1.54
106.83
69.60
1
0.00
1.53
106.64
62
70.21
70.14
1
0.00
1.54
107.39
70.19
1
0.00
1.53
107.21
63
70.74
70.48
1
0.00
1.54
107.79
70.56
1
0.00
1.53
107.60
64
70.75
71.04
1
0.00
1.54
108.61
71.14
1
0.00
1.53
108.42
65
70.87
71.05
1
0.00
1.54
108.62
71.14
1
0.00
1.53
108.43
66
69.8
71.26
1
0.00
1.54
108.81
71.37
1
0.00
1.53
108.62
67
69.9
69.69
-1
108.81
0.00
108.81
69.64
-1
108.61
0.00
108.61
68
69.7
69.92
1
0.00
1.56
108.97
69.89
-1
108.61
0.00
108.61
69
69.45
69.62
-1
108.97
0.00
108.97
69.68
-1
108.61
0.00
108.61
D.5 Stock Price Datasets: Summary of Results
373
Table D.29 (continued) SVM Model Prediction Day Actual Predicted Decision 70
68.12
Cash
69.33
-1
108.97
Proposed EDIT2IFF Model Prediction
#Stock Market $ Predicted Decision 0.00
108.97
Cash
69.38
-1
108.61
#Stock Market $ 0.00
108.61
71
69
67.91
-1
108.97
0.00
108.97
67.79
-1
108.61
0.00
108.61
72
69.03
69.04
1
0.00
1.60
110.37
69.07
1
0.00
1.59
110.01
73
69.20
68.97
Sell
110.37
0.00
110.37
68.98
Sell
110.02
0.00
110.02
74
69.77
69.18
Sell
110.37
0.00
110.37
69.19
Sell
110.02
0.00
110.02
75
69.45
69.86
Buy
0.00
1.60
111.28
69.93
Buy
0.00
1.59
110.92
76
68.25
69.53
Buy
0.00
1.60
110.77
69.58
Buy
0.00
1.59
110.41
77
67.60
68.32
Buy
0.00
1.60
108.86
68.22
Sell
110.41
0.00
110.41
78
67.95
67.73
Buy
0.00
1.60
107.82
67.58
Sell
110.41
0.00
110.41
79
68.01
68.14
Buy
0.00
1.60
108.38
68.05
Buy
0.00
1.63
110.98
80
69.22
68.14
Buy
0.00
1.60
108.48
68.05
Buy
0.00
1.63
111.08
81
69.06
69.35
Buy
0.00
1.60
110.41
69.37
Buy
0.00
1.63
113.06
82
70.12
69.12
Buy
0.00
1.60
110.15
69.12
Buy
0.00
1.63
112.80
83
69.75
70.09
-1
110.15
0.00
110.15
70.18
1
0.00
1.63
114.53
84
69.96
69.64
-1
110.15
0.00
110.15
69.67
-1
114.53
0.00
114.53
85
70.17
69.88
-1
110.15
0.00
110.15
69.92
-1
114.53
0.00
114.53
86
70.04
70.08
-1
110.15
0.00
110.15
70.14
-1
114.53
0.00
114.53
87
69.68
69.95
-1
110.15
0.00
110.15
70.00
-1
114.53
0.00
114.53
88
69.84
69.62
-1.00
110.15
0.00
110.15
69.65
-1.00
114.53
0.00
114.53
89
69.42
69.80
-1.00
110.15
0.00
110.15
69.85
1.00
0.00
1.64
114.79
90
68.88
69.32
-1.00
110.15
0.00
110.15
69.33
-1.00
114.79
0.00
114.79
91
69.39
68.81
-1.00
110.15
0.00
110.15
68.75
-1.00
114.79
0.00
114.79
92
69.47
69.33
-1.00
110.15
0.00
110.15
69.35
-1.00
114.79
0.00
114.79
93
69.25
69.38
-1.00
110.15
0.00
110.15
69.41
-1.00
114.79
0.00
114.79
94
69.59
69.20
-1.00
110.15
0.00
110.15
69.22
-1.00
114.79
0.00
114.79
95
68.90
69.60
1.00
0.00
1.59
110.69
69.66
1.00
0.00
1.66
115.35
96
69.18
68.87
-1.00
110.69
0.00
110.69
68.89
-1.00
115.36
0.00
115.36
97
69.17
69.17
-1.00
110.69
0.00
110.69
69.20
1.00
0.00
1.67
115.82
98
68.95
69.14
-1.00
110.69
0.00
110.69
69.16
-1.00
115.82
0.00
115.82
374
A Appendix
Bank of Montréal – Cross Validation Performance Measures Table D.30 Bank of Montréal Cross Validation (CV) RSTB values for Testing Dataset
Buy_and_Hold profit =$92.91 CV1
CV2
CV3
CV4
CV5
Average
Std
NN
91.77
84.22
91.20
84.58
89.09
88.17
3.59
ANFIS
0.00
0.00
0.00
0.00
0.00
N/A
N/A
DENFIS
0.00
87.23
85.16
85.77
89.10
86.82
1.76
GFS
0.00
88.22
0.00
0.00
0.00
88.22
N/A
SVM
87.13
85.85
83.57
86.52
83.24
85.26
1.76
DIT2FRB
88.91
87.40
82.39
85.91
86.71
86.26
2.43
T1FF
86.91
84.63
91.17
84.23
83.24
86.04
3.17
T1IFF
84.99
84.63
91.17
84.23
83.57
85.72
3.09
T2FF
93.36
95.96
94.50
92.81
98.21
94.97
2.17
T2IFF
84.38
87.38
87.95
85.44
84.81
85.99
1.59
ET1FF
83.56
89.80
88.54
85.29
84.70
86.38
2.66
ET1IFF
85.50
92.85
88.37
86.45
87.17
88.07
2.87
EDIT2FF
86.07
84.88
84.89
87.59
85.01
85.69
1.17
EDIT2IFF
90.46
87.81
88.33
88.92
88.22
88.75
1.04
Table D.31 Bank of Montréal Cross Validation (CV) MAPE values for Training Dataset
NN
CV1
CV2
CV3
CV4
CV5
Average
Std
0.59
0.64
0.55
0.58
0.58
0.59
0.03
ANFIS
0.33
0.33
0.26
0.33
0.28
0.31
0.03
DENFIS
0.53
0.54
0.50
0.49
0.53
0.52
0.02
GFS
0.46
0.49
0.43
0.45
0.48
0.46
0.03
SVM
0.65
0.65
0.61
0.59
0.63
0.63
0.03
DIT2FRB
0.42
0.44
0.41
0.40
0.47
0.43
0.03
T1FF
0.64
0.65
0.54
0.56
0.63
0.61
0.05
T1IFF
0.64
0.65
0.54
0.56
0.63
0.60
0.05
T2FF
0.41
0.35
0.38
0.32
0.37
0.36
0.03
T2IFF
0.56
0.49
0.41
0.43
0.51
0.48
0.06
ET1FF
0.66
0.59
0.57
0.58
0.63
0.60
0.04
ET1IFF
0.65
0.61
0.57
0.57
0.58
0.60
0.03
EDIT2FF
0.45
0.44
0.45
0.40
0.45
0.44
0.02
EDIT2IFF
0.36
0.39
0.40
0.37
0.50
0.40
0.06
D.5 Stock Price Datasets: Summary of Results
375
Table D.32 Bank of Montréal Trust Cross Validation (CV) MAPE values for Validation Dataset CV1
CV2
CV3
CV4
CV5
Average
Std
NN
0.70
0.66
0.71
0.72
0.65
0.69
0.03
ANFIS
1.08
0.99
1.07
1.16
0.82
1.02
0.13
DENFIS
0.65
0.62
0.70
0.74
0.71
0.69
0.05
GFS
0.76
0.64
0.74
0.78
0.66
0.72
0.06
SVM
0.62
0.62
0.68
0.71
0.64
0.65
0.04
DIT2FRB
0.69
0.63
0.69
0.73
0.65
0.68
0.04
T1FF
0.61
0.61
0.69
0.68
0.63
0.64
0.04
T1IFF
0.63
0.61
0.69
0.68
0.64
0.65
0.03
T2FF
0.42
0.38
0.47
0.47
0.41
0.43
0.04
T2IFF
0.53
0.46
0.51
0.54
0.52
0.51
0.03
ET1FF
0.62
0.62
0.68
0.70
0.63
0.65
0.04
ET1IFF
0.65
0.62
0.69
0.70
0.71
0.67
0.04
EDIT2FF
0.47
0.45
0.53
0.52
0.48
0.49
0.04
EDIT2IFF
0.36
0.38
0.50
0.52
0.51
0.45
0.08
Table D.33 Bank of Montréal Trust Cross Validation (CV) MAPE values for Testing Dataset CV1
CV2
CV3
CV4
CV5
Average
Std
NN
0.91
0.91
0.97
0.93
0.90
0.92
0.03
ANFIS
2.08
2.55
1.94
2.02
3.96
2.51
0.85
DENFIS
1.20
0.92
0.84
0.82
0.93
0.94
0.15
GFS
1.82
1.52
1.27
1.52
2.61
1.75
0.52
SVM
0.84
0.86
0.87
0.86
0.86
0.86
0.01
DIT2FRB
0.86
0.89
0.84
0.92
0.85
0.87
0.03
T1FF
0.84
0.85
0.95
0.90
0.86
0.88
0.04
T1IFF
0.86
0.85
0.95
0.90
0.87
0.88
0.04
T2FF
0.90
0.99
0.91
0.90
0.93
0.93
0.04
T2IFF
0.85
0.82
0.86
0.86
0.84
0.85
0.02
ET1FF
0.86
0.95
0.88
0.86
0.85
0.88
0.04
ET1IFF
0.85
0.89
0.88
0.88
0.88
0.88
0.02
EDIT2FF
0.84
0.84
0.87
0.87
0.89
0.86
0.02
EDIT2IFF
0.90
0.85
0.86
0.87
0.88
0.87
0.02
376
A Appendix
Sun Life – Cross Validation Performance Measures Table D.34 Sun Life Cross Validation (CV) RSTB values for Testing Dataset
Buy_and_Hold profit =$95.56 CV1
CV2
CV3
CV4
CV5
NN
90.93
89.82
88.06
92.36
86.87
Average 88.62
Std 2.09
ANFIS
0.00
0.00
0.00
0.00
0.00
N/A
N/A
DENFIS
84.71
86.79
83.60
92.25
88.56
87.18
3.42
GFS
0.00
0.00
0.00
0.00
0.00
N/A
N/A
SVM
89.50
90.19
88.59
90.50
92.67
90.29
1.52
DIT2FRB
91.85
94.94
85.36
87.27
91.97
90.28
3.88
T1FF
90.20
92.82
87.33
91.02
87.90
89.85
2.26
T1IFF
87.24
90.62
87.07
89.20
88.74
88.57
1.47
T2FF
92.94
98.33
96.57
95.32
96.58
95.95
2.00
T2IFF
86.48
93.70
87.66
83.92
84.54
87.26
3.90
ET1FF
95.08
99.53
96.10
97.17
96.07
96.79
1.70
ET1IFF
87.24
90.62
81.02
89.20
90.59
87.73
4.00
EDIT2FF
87.10
92.31
87.49
88.31
88.62
88.76
2.08
EDIT2IFF
91.97
92.42
89.30
91.23
93.84
91.75
1.67
Table D.35 Sun Life Cross Validation MAPE for Training Dataset
NN
CV1
CV2
CV3
CV4
CV5
Average
Std
0.75
0.76
0.79
0.77
0.67
0.75
0.04
ANFIS
0.42
0.37
0.40
0.43
0.34
0.39
0.04
DENFIS
0.77
0.69
0.67
0.64
0.67
0.69
0.05
GFS
0.66
0.63
0.61
0.63
0.58
0.62
0.03
SVM
0.78
0.79
0.78
0.79
0.76
0.78
0.01
DIT2FRB
0.59
0.54
0.55
0.54
0.54
0.55
0.02
T1FF
0.77
0.79
0.78
0.78
0.75
0.77
0.02
T1IFF
0.76
0.77
0.75
0.75
0.74
0.75
0.01
T2FF
0.59
0.55
0.50
0.63
0.57
0.57
0.05
T2IFF
0.49
0.60
0.59
0.53
0.58
0.56
0.04
ET1FF
0.74
0.78
0.73
0.76
0.74
0.75
0.02
ET1IFF
0.77
0.77
0.67
0.79
0.73
0.74
0.05
EDIT2FF
0.68
0.66
0.62
0.54
0.66
0.63
0.05
EDIT2IFF
0.65
0.66
0.67
0.67
0.54
0.64
0.06
D.5 Stock Price Datasets: Summary of Results
377
Table D.36 Sun Life Cross Validation MAPE % for Validation Dataset
NN
CV1
CV2
CV3
CV4
CV5
Average
Std
0.80
0.82
0.81
0.85
0.95
0.85
0.06
ANFIS
1.14
1.25
1.21
1.25
1.40
1.25
0.09
DENFIS
0.86
0.83
0.88
0.85
0.97
0.88
0.05
GFS
0.74
0.80
0.80
0.85
0.98
0.83
0.09
SVM
0.81
0.79
0.80
0.78
0.88
0.81
0.04
DIT2FRB
0.87
0.79
0.86
0.85
0.93
0.86
0.05
T1FF
0.82
0.79
0.80
0.79
0.90
0.82
0.05
T1IFF
0.81
0.79
0.81
0.78
0.90
0.82
0.05
T2FF
0.66
0.60
0.59
0.67
0.74
0.65
0.06
T2IFF
0.52
0.62
0.62
0.54
0.75
0.61
0.09
ET1FF
0.81
0.81
0.83
0.80
0.90
0.83
0.04
ET1IFF
0.82
0.79
0.85
0.79
0.92
0.83
0.06
EDIT2FF
0.72
0.68
0.65
0.61
0.80
0.69
0.07
EDIT2IFF
0.75
0.69
0.74
0.68
0.70
0.71
0.03
Average
Std
Table D.37 Sun Life Cross Validation MAPE % for Testing Dataset CV1
CV2
CV3
CV4
CV5
NN
0.82
0.89
0.90
0.92
0.96
0.90
0.05
ANFIS
4.78
3.68
3.21
2.69
3.61
3.59
0.77
DENFIS
0.91
1.06
0.99
0.91
0.89
0.95
0.07
GFS
1.45
1.78
2.30
2.69
2.24
2.09
0.48
SVM
0.82
0.81
0.83
0.80
0.81
0.82
0.01
DIT2FRB
0.81
0.88
0.85
0.95
0.79
0.86
0.06
T1FF
0.78
0.82
0.88
0.82
0.84
0.83
0.03
T1IFF
0.80
0.81
0.81
0.80
0.82
0.81
0.01
T2FF
0.84
0.86
0.89
0.84
0.85
0.86
0.02
T2IFF
0.93
0.83
0.89
0.87
0.82
0.87
0.05
ET1FF
0.82
0.80
0.83
0.81
0.84
0.82
0.01
ET1IFF
0.81
0.81
1.09
0.82
0.85
0.87
0.12
EDIT2FF
0.81
0.83
0.87
0.82
0.84
0.83
0.03
EDIT2IFF
0.79
0.83
0.83
0.82
0.84
0.82
0.02
378
A Appendix
Enbridge – Cross Validation Performance Measures Table D.38 Enbridge Cross Validation (CV) RSTB values for Testing Dataset
Buy_and_Hold profit =$92.73
NN
CV1
CV2
CV3
CV4
CV5
Average
Std
85.60
90.31
90.61
87.54
0.00
88.52
2.39
ANFIS
0.00
0.00
0.00
0.00
0.00
0.00
0.00
DENFIS
81.14
85.78
91.10
0.00
91.33
87.34
4.86
GFS
0.00
0.00
0.00
0.00
0.00
N/A
N/A
SVM
88.56
91.33
94.83
87.92
87.65
90.06
3.04
DIT2FRB
0.00
0.00
97.21
0.00
0.00
97.21
N/A
T1FF
87.59
86.90
80.87
87.18
86.58
85.82
2.79
T1IFF
87.38
84.68
85.31
80.31
83.80
84.30
2.59
T2FF
76.91
90.48
91.24
90.45
81.96
86.21
6.44
T2IFF
81.17
91.46
91.94
90.80
86.98
88.47
4.52
ET1FF
77.95
0.00
81.86
87.84
85.64
83.32
4.35
ET1IFF
89.50
92.58
86.35
87.30
91.91
89.53
2.74
EDIT2FF
86.37
88.46
90.93
90.76
80.11
87.33
4.44
EDIT2IFF
93.87
98.61
97.71
96.94
93.41
96.11
2.33
Table D.39 Enbridge Cross Validation (CV) MAPE % for Training Dataset CV1 NN
0.84
CV2 0.74
CV3 0.80
CV4 0.79
CV5 0.64
Average
Std
0.76
0.08
ANFIS
0.36
0.48
0.40
0.43
0.35
0.40
0.06
DENFIS
0.63
0.75
0.72
0.67
0.68
0.69
0.04
GFS
0.58
0.60
0.65
0.64
0.58
0.61
0.03
SVM
0.83
0.80
0.83
0.83
0.77
0.81
0.03
DIT2FRB
0.57
0.56
0.63
0.63
0.55
0.59
0.04
T1FF
0.74
0.75
0.81
0.82
0.69
0.76
0.05
T1IFF
0.82
0.75
0.79
0.79
0.68
0.77
0.05
T2FF
0.45
0.46
0.53
0.64
0.40
0.49
0.09
T2IFF
0.64
0.56
0.54
0.48
0.44
0.53
0.08
ET1FF
0.79
0.75
0.83
0.83
0.71
0.78
0.05
ET1IFF
0.79
0.75
0.83
0.84
0.75
0.79
0.04
EDIT2FF
0.61
0.52
0.58
0.61
0.47
0.56
0.06
EDIT2IFF
0.55
0.36
0.64
0.57
0.43
0.51
0.11
D.5 Stock Price Datasets: Summary of Results
379
Table D.40 Enbridge Cross Validation (CV) MAPE % for Validation Dataset
NN
CV1
CV2
CV3
CV4
CV5
Average
Std
0.85
0.88
0.84
0.81
0.92
0.86
0.04
ANFIS
1.75
1.18
1.15
1.44
1.34
1.37
0.24
DENFIS
0.93
0.86
0.82
0.92
0.92
0.89
0.05
GFS
0.99
0.83
0.88
0.89
0.91
0.90
0.06
SVM
0.82
0.85
0.79
0.80
0.88
0.83
0.04
DIT2FRB
0.90
0.89
0.86
0.80
0.92
0.87
0.04
T1FF
0.82
0.84
0.77
0.79
0.86
0.82
0.04
T1IFF
0.82
0.83
0.78
0.78
0.90
0.82
0.05
T2FF
0.56
0.55
0.52
0.60
0.53
0.55
0.03
T2IFF
0.64
0.59
0.52
0.44
0.60
0.56
0.08
ET1FF
0.82
0.84
0.76
0.79
0.86
0.82
0.04
ET1IFF
0.84
0.86
0.76
0.78
0.91
0.83
0.06
EDIT2FF
0.65
0.61
0.57
0.59
0.65
0.61
0.03
EDIT2IFF
0.59
0.49
0.60
0.48
0.58
0.55
0.06
Table D.41 Enbridge Cross Validation (CV) MAPE % for Testing Dataset CV1
CV2
CV3
CV4
CV5
Average
Std
NN
1.02
1.06
1.10
1.10
1.79
1.21
0.33
ANFIS
3.18
1.71
1.90
1.80
2.63
2.24
0.64
DENFIS
1.15
1.21
1.17
1.27
1.15
1.19
0.05
GFS
1.33
1.36
1.35
1.48
1.54
1.41
0.09
SVM
1.04
1.04
1.07
1.09
1.09
1.07
0.02
DIT2FRB
1.27
1.17
1.11
1.20
1.32
1.21
0.08
T1FF
1.14
1.06
1.15
1.14
1.16
1.13
0.04
T1IFF
1.04
1.10
1.14
1.10
1.21
1.12
0.06
T2FF
1.09
1.19
1.07
1.11
1.10
1.11
0.04
T2IFF
1.08
1.04
1.02
1.07
1.07
1.05
0.03
ET1FF
1.09
1.19
1.07
1.08
1.06
1.10
0.05
ET1IFF
1.12
1.06
1.05
1.06
1.06
1.07
0.03
EDIT2FF
1.08
1.02
1.03
1.06
1.07
1.05
0.03
EDIT2IFF
1.09
1.10
1.05
1.07
1.13
1.09
0.03
380
A Appendix
Loblaw – Cross Validation Performance Measures Table D.42 Loblaw Cross Validation (CV) RSTB values for Testing Dataset
Buy_and_Hold profit =$100.43 CV1
CV2
CV3
CV4
CV5
Average
Std
NN
0.00
0.00
0.00
0.00
0.00
N/A
N/A
ANFIS
89.16
93.17
94.51
0.00
0.00
92.28
2.78
DENFIS
88.73
90.99
96.56
0.00
0.00
92.09
4.03
GFS
0.00
0.00
0.00
0.00
0.00
N/A
N/A
SVM
92.12
89.16
94.20
93.50
91.34
92.06
1.97
DIT2FRB
94.84
95.27
98.77
0.00
0.00
96.29
2.16
T1FF
84.12
88.50
96.65
90.00
94.59
90.77
4.98
T1IFF
93.21
99.75
96.16
88.69
91.19
93.80
4.31
T2FF
95.86
96.90
91.46
93.88
94.53
94.53
2.08
T2IFF
89.54
91.10
97.23
91.46
100.21
93.91
4.57
ET1FF
93.83
93.21
96.79
87.84
90.05
92.34
3.47
ET1IFF
93.52
90.82
98.63
88.43
92.78
92.84
3.79
EDIT2FF
92.55
92.36
88.14
92.80
92.07
91.58
1.94
EDIT2IFF
104.11
101.02
102.36
102.74
101.54
102.35
1.19
Table D.43 Loblaw Cross Validation (CV) MAPE % for Training Dataset CV1
CV2
CV3
CV4
CV5
Average
Std
NN
0.81
0.72
0.87
0.84
0.76
0.80
0.06
ANFIS
0.41
0.35
0.39
0.52
0.43
0.42
0.06
DENFIS
0.70
0.72
0.81
0.72
0.69
0.73
0.05
GFS
0.61
0.60
0.67
0.66
0.63
0.63
0.03
SVM
0.87
0.87
0.96
0.97
0.89
0.91
0.05
DIT2FRB
0.60
0.58
0.64
0.67
0.51
0.60
0.06
T1FF
0.76
0.74
0.84
0.96
0.77
0.81
0.09
T1IFF
0.75
0.75
0.84
0.91
0.88
0.83
0.07
T2FF
0.40
0.45
0.45
0.50
0.51
0.46
0.04
T2IFF
0.47
0.48
0.55
0.69
0.66
0.57
0.10
ET1FF
0.74
0.72
0.81
0.90
0.76
0.79
0.07
ET1IFF
0.83
0.75
0.81
0.88
0.88
0.83
0.06
EDIT2FF
0.76
0.50
0.53
0.58
0.49
0.57
0.11
EDIT2IFF
0.56
0.44
0.47
0.40
0.43
0.46
0.06
D.5 Stock Price Datasets: Summary of Results
381
Table D.44 Loblaw Cross Validation (CV) MAPE % for Validation Dataset CV1 NN
0.89
CV2 0.93
CV3 0.77
CV4 0.80
CV5 0.95
Average
Std
0.87
0.08
ANFIS
1.22
2.78
1.63
1.16
1.14
1.59
0.70
DENFIS
0.91
0.85
0.79
0.76
0.87
0.84
0.06
GFS
0.82
0.99
0.78
0.78
0.83
0.84
0.09
SVM
0.91
0.91
0.86
0.77
0.88
0.87
0.06
DIT2FRB
0.89
0.95
0.80
0.82
0.88
0.87
0.06
T1FF
0.85
0.90
0.74
0.78
0.87
0.83
0.07
T1IFF
0.87
0.93
0.75
0.77
0.88
0.84
0.08
T2FF
0.52
0.60
0.40
0.40
0.57
0.50
0.09
T2IFF
0.57
0.61
0.49
0.54
0.67
0.58
0.07
ET1FF
0.86
0.89
0.81
0.77
0.86
0.84
0.05
ET1IFF
0.88
0.96
0.81
0.78
0.88
0.86
0.07
EDIT2FF
0.81
0.64
0.48
0.49
0.55
0.59
0.14
EDIT2IFF
0.66
0.58
0.41
0.36
0.52
0.51
0.12
Table D.45 Loblaw Cross Validation MAPE % for Testing Dataset CV1
CV2
CV3
CV4
CV5
Average
Std
NN
0.87
0.96
1.07
1.62
1.54
1.21
0.35
ANFIS
2.40
3.55
4.20
6.47
2.71
3.86
1.62
DENFIS
1.07
0.97
1.06
1.55
1.48
1.23
0.27
GFS
1.61
1.65
1.93
1.55
1.35
1.62
0.21
SVM
0.82
0.85
0.88
0.83
0.83
0.84
0.03
DIT2FRB
0.95
0.96
0.92
1.14
1.47
1.09
0.23
T1FF
0.90
0.87
0.81
0.80
0.99
0.88
0.08
T1IFF
0.91
0.88
0.97
0.85
0.83
0.89
0.06
T2FF
1.06
0.92
0.99
1.03
1.00
1.00
0.05
T2IFF
0.93
0.88
0.88
0.86
0.90
0.89
0.03
ET1FF
1.03
0.91
0.90
0.91
0.93
0.94
0.05
ET1IFF
0.83
0.91
0.99
0.89
0.83
0.89
0.07
EDIT2FF
0.83
0.85
1.65
0.86
0.81
1.00
0.37
EDIT2IFF
0.85
0.84
0.90
0.90
1.02
0.90
0.07
382
A Appendix
Fuzzy Function Parameters of the Optimum Models of Stock Prices Datasets
From the experimental trials on five different stock price datasets, i.e., TD (Toronto Dominion Bank), BMO (Bank of Montreal), SLF (Sun Life Financial), LB (Loblaws), and EMB (Embridge) the following three proposed methodologies: (i) EDIT2IFF, (ii) DIT2FF, (iii) ET1FF, are determined as the optimum methodologies based on Robust Simulated Trading Benchmark (RSTB) performance measure. In particular, EDIT2IFF is the optimum methodology identified by experimental trials on TD, EMB and LB stock price datasets, DIT2FF is the optimum methodology identified by experimental trials on BMO stock price dataset and finally, ET1FF is the optimum methodology identified by experimental trials on SLF stock price dataset. The common characteristic of these optimum methodologies is that, each of them uses linear regression functions to identify the parameters of the interim and local fuzzy functions. In the following, we demonstrate a sample parameter set of optimum models from each of the three best methodologies using three corresponding stock price datasets. TD (Toronto Dominion Bank) Stock Price Dataset - Parameters of the Best Cross Validation Models of the First Optimum Methodology, EDIT2IFF
The parameters of the optimum models obtained from the application of EDIT2IFF are stored in collection tables, i.e., m-Col*, τ-Col* and Φ-Col*. One optimum collection table set, i.e., 〈m-Col*, τ-Col* and Φ-Col*〉, is identified as a result of EDIT2IFF methodology for each cross validation iteration dataset, i.e., five different collection table sets are identified in order to do inference on five different testing datasets of TD stock price dataset. Here, the structure of only one of the collection table sets will be demonstrated. m-Col* is a column matrix of the size of training data vectors that holds one optimum fuzziness value, m*, of an embedded model identified as the optimum model for each training vector. Hence the m-Col* table is a (120×1) matrix, where n=120 indicate the number of training vectors of the particular cross validation dataset of TD Stock Price Dataset. The τ-Col* collection table holds the interim fuzzy function structures, i.e., the optimum membership value transformations which are used as the only input variables to identify the interim fuzzy functions, hk(τik,ŵi,k) and their corresponding parameters, ŵi,k, k=1…,n, which are identified individually for each training data vector. The same interim fuzzy function structure holds for each cluster of an embedded model. Thus, the τ-Col* is a matrix of (120×1) dimensions. The Φ-Col* collection table on the other hand holds the local fuzzy function structures, i.e., the parameters, Ŵi,k, k=1…,120,i=1,…,c*, and structures, Φi,k of local fuzzy functions, f(Φik,Ŵi,k), of each embedded fuzzy function model identified for each kth training vector. Thus, if n=120 indicates the number of training vectors of TD Stock Price Dataset, and c* is the optimum number of clusters, then the Φ-Col* is a matrix of (120×6) dimensions, in which
D.5 Stock Price Datasets: Summary of Results
383
c*=7 is the total number of clusters of particular cross validation model presented here. The parameters of the optimum EDIT2IFF methodology from five different cross validation models of TD Stock Price dataset are as follows: Optimum Parameters of the EDIT2IFF Methodology Obtained from Cross Validation Trials on TD Stock Price Dataset.
1 2
Model Name
Fuzzy Clustering Type Regression Type # of clusters Fuzziness degree Optimum List of membership value transformations to be used as additional input variables κ (number of nearest training vectors for IFC) Alpha-cut
3 4 5 6 7 8
9
m-Col , τ-Col, Φ-Col tables
EDIT2IFF Improved Fuzzy Clustering LSE {5,7,8} [1.21,1.75] (μ ) {2} [0,0.1] To be displayed below
The m collection table, m-Col*, and the interim and local fuzzy function structure collection tables, τ-Col*, Φ-Col*, of one of the optimum models of the cross validation iteration trials in which the optimum number of clusters c*=7 is as follows:
(
⎡ m1 = 1.41 ⎤ ⎢ m = 1.31 ⎥ ⎢ 2 ⎥ 120×1 mCol * = ⎢ m3 = 1.36 ⎥ τ Col * ⎢ ⎥ M ⎢ ⎥ ⎢⎣ m120 = 1.31⎥⎦
)
(
( ΦCol )
⎡ Φ1,1 ,Wˆ1,1* ⎢ M =⎢ ⎢ * ⎢ Φ1,120 ,Wˆ1,120 ⎣
* 120×7
(
(
)
L
)
O L
⎡ (τ 1 , wˆ1 ) ⎤ ⎢ ⎥ ⎢ (τ 2 , wˆ 2 ) ⎥ 120×1 = ⎢ (τ 3 , wˆ 3 ) ⎥ ⎢ ⎥ M ⎢ ⎥ ⎢(τ , wˆ ) ⎥ ⎣ 120 120 ⎦
)
(Φ (Φ
ˆ* 7,1 , W7,1
)
M ˆ* 7,120 , W7,120
)
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
The fuzzy function parameters of the optimum model are identified with linear least squares estimation, i.e., LSE. Each row in the collection tables above correspond to the parameters of an optimum embedded model which is identified for
384
A Appendix
the corresponding training vector. Hence, the parameters of the optimum embedded model of a kth training vector is as follows: •
The kth cell in m-Col table holds the fuzziness parameter of the corresponding embedded model.
•
Each cell in τ Col *
(
)
120×1
ˆ k ) , k=1,…,120, holds the follow, i.e., (τ k , w
ing list of parameters for the ten inputs, one single output training dataset (we demonstrate the parameters of the first row of the τ Col120×1 collection table: ¾ Interim Fuzzy Function parameters, ŵk=1 , using LSE regression: wˆ k ,1 := wˆ 0, k ,1 + wˆ 1, k ,1 ( μ1k ) = −0.67 + 0.153 ( μ1k ) ← ( interim fuzzy function of cluster 1) wˆ k ,2 := wˆ 0, k ,2 + wˆ1, k ,2 ( μ 2 k ) = −0.58 − 0.49 ( μ2 k ) ← ( interim fuzzy function of cluster 2 ) M
wˆ k ,7 := wˆ 0, k ,7 + wˆ 1, k ,7 ( μ7 k ) = −0.844 + 1.1093 ( μ7 k ) ← ( interim fuzzy function of cluster 7 )
¾
•
Each interim matrix takes the form, i.e., τ k ,i1×2 = ⎣⎡1 μ k ,i ⎦⎤
(
Each row in ΦCol *
)
120×7
(
)
, i.e., Φ k ,Wˆk* , holds the following list of
parameters, where the fuzzy functions of the particular cluster i are identified with membership values as additional input variables to the original input variable set. In the following, we show the parameters of local
(
fuzzy functions of the first row (k=1) in ΦCol * ¾
)
120×7
:
Local Fuzzy Functions coefficients of each cluster, i=1..7, k=1: Wˆk ,1 := Wˆ0,1, k + Wˆ1,1, k ( μ1k ) + Wˆ2,1, k ( x1 ) + L + Wˆ11,1, k ( x10 ) ← ( local fuzzy function of cluster 1) Wˆk ,2 := Wˆ0,2, k + Wˆ1,2, k ( μ2k ) + Wˆ2,2, k ( x1 ) + L + Wˆ11,2, k ( x10 ) ← ( local fuzzy function of cluster 2 )
,
M Wˆk ,7 := Wˆ0,7, k + Wˆ1,7, k ( μ7k ) + Wˆ2,7, k ( x1 ) + L + Wˆ11,7, k ( x10 ) ← ( local fuzzy function of cluster 7 )
where Wˆ1, k := 0.00356 + 0.00228 ( μ1k ) + 0.00448 ( x1 ) + 0.035 ( x2 ) + L + 0.0084 ( x10 ) Wˆ2, k := −0.02 + 0.034 ( μ2 k ) − 0.017 ( x1 ) − 0.066 ( x2 ) + L + 0.073 ( x10 ) M Wˆ7, k := −0.28 − 0.28 ( μ7 k ) − 0.283 ( x1 ) − 0.2757 ( x2 ) + L − 0.2745 ( x10 )
μik indicate the membership value of kth input vector in each cluster i, and ¾ Input matrix of ith cluster holds Φ i ,k 1×12 = ⎡⎣1 μi ,1 xk ,1 L xk ,10 ⎤⎦
In this model, the local fuzzy functions take on the same structures, which may not be the case in every model. For instance, in the next demonstration, the fuzzy functions of each cluster take on different structures.
D.5 Stock Price Datasets: Summary of Results
385
BMO (Bank of Montreal) Stock Price Dataset - Parameters of the Best Cross Validation Models of the Second Optimum Methodology, DIT2FF
The parameters of the optimum models obtained from the application of DIT2FF are stored in collection table sets, i.e., m-Col*, and Φ-Col*. Since DIT2FF applies Fuzzy C-means Clustering (FCM) method, unlike Improved Fuzzy Clustering algorithm, the interim fuzzy function collection tables, τ-Col*, are not used. There is one collection table set for each cross validation trial, i.e., there are five different collection table sets in total in order to do inference on five different testing datasets of BMO stock price dataset. We use the optimum models of BMO stock prices dataset for demonstration. Here only one sample collection table structure will be shown. The m-Col* table, which holds the level of fuzziness values of each training data vector, is a (200×1) matrix, where n=200 indicate the number of training vector of the particular cross validation dataset of BMO stock price dataset. The Φ-Col* collection table holds the local fuzzy function structures, i.e., the optimum transformations of fuzzy functions, original input variables, and parameters, Ŵi,k, k=1…,200, of the optimum embedded fuzzy function models identified individually for each training data vector in each cluster. Thus, if n=200 indicates the number of training vectors of BMO Dataset, then the Φ-Col is a matrix of (200×3) dimensions, when c*=3 is the total number of clusters of the particular cross validation model, that is demonstrated here. The optimum parameters of the DIT2FF from five different cross validation models are summarized as follows: Optimum Parameters of the DIT2FF Methodology Obtained from Cross Validation Trials on BMO Stock Price Dataset.
1 2 3 4 5 6 7
8
Model Name Fuzzy Clustering Type Regression Type # of clusters Fuzziness degree Optimum Fuzzy Function structures based on membership value transformations Alpha-cut
m-Col , Φ-Col tables
DIT2FF Fuzzy c-Means Clustering LSE {3,5} [1.1,2.3] [(μ), (eμ)], [(μ) (log(1-μ))] [0,0.1] To be explained below
The fuzzy function parameters of the optimum model are estimated with linear regression method, i.e., LSE. The m collection table, m-Col*, and the fuzzy
386
A Appendix
function structure collection table, Φ-Col*, of the particular cross validation model, which identifies c*=3 is as follows:
(
⎡ m1 = 1.1 ⎤ ⎢ m = 1.46 ⎥ ⎢ 2 ⎥ * 200×1 mCol = ⎢ m3 = 1.34 ⎥ ⎢ ⎥ M ⎢ ⎥ ⎢⎣ m200 = 2.06 ⎥⎦
)
(
(
)
⎡ Φ1,1 ,Wˆ1,1* ⎢ 200×3 ΦCol * =⎢ M ⎢ * ⎢ Φ 3,200 ,Wˆ3,200 ⎣
)
(
(
L
)
O L
(Φ (Φ
ˆ* 3,1 ,W3,1
)
M
ˆ* 3,200 ,W3,200
)
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
)
Each row of in (ΦCol*)200×3 , i.e., Φ k ,Wˆk* , k=1,…,200, i=1,2,3, holds a list of parameters similar to the following structure: The local linear fuzzy function coefficients for any row, k, in ΦCol200×3 is as follow., Wˆ1,k := Wˆ0,1,k + Wˆ1,1,k ( μ1k ) + Wˆ2,1,k ( log(1 − μ1k ) / μ1k ) + Wˆ3,1,k ( x1 ) + L + Wˆ12,1,k ( x10 ) ← ( fuzzy function of cluster 1) Wˆ2,k := Wˆ0,2,k + Wˆ1,2,k ( μ 2 k ) + Wˆ2,2,k ( e μ1k ) + Wˆ3,2,k ( x1 ) + L + Wˆ12,2,k ( x10 ) ← ( fuzzy function of cluster 2 )
,
Wˆ3,k := Wˆ0,3,k + Wˆ1,3,k ( μ3 k ) + Wˆ2,3,k ( log(1 − μ3 k ) / μ3 k ) + Wˆ3,3,k ( x1 ) + L + Wˆ11,3,k ( x10 ) ← ( fuzzy function of cluster 3 )
where μik indicate the membership value of kth input vector in each cluster i. Φ1,k = ⎡⎣1 μ1,k
log(1 − μ1,k ) μ1,k
Φ 2,k = ⎡⎣1 μ2,k
e
Φ 3,k = ⎡⎣1 μ3,k
log(1 − μ3,k ) μ3,1
μ k ,2
xk ,1 L xk ,10 ⎤⎦
xk ,1 L xk ,10 ⎤⎦ xk ,1 L xk ,10 ⎤⎦
It should be noted that fuzzy function of each cluster, i, has a different structure, which is identified in the learning stage of DIT2FF methodology. The ΦCol200×3 table holds the parameters of the optimum fuzzy function structures similar to the sample presented above for each training vector k, k=1,…200. SLF (Sunlife) Stock Price Dataset - Parameters of the Best Cross Validation Models of the Third Optimum Methodology, ET1FF
The ET1FF applies Fuzzy C-means Clustering (FCM) algorithm and identifies type-1 fuzzy functions. The parameters of ET1FF are optimized with genetic algorithms. Since type-2 fuzzy functions and Improved Fuzzy Clustering (IFC) are not utilized in this model, interim fuzzy function parameters as well as collection tables are not required. The optimum parameters of five different cross validation experiments are presented as follows:
D.5 Stock Price Datasets: Summary of Results
387
Optimum Parameters of the ET1FF Methodology Obtained from Cross Validation Trials on SLF Stock Price Dataset 1
Model Name
2 Fuzzy Clustering Type 3
Regression Type
4
6
# of clusters Optimum Fuzzy Function structures based on membership value transformations Alpha-cut
7
Fuzziness degree
8
Fuzzy Function Parameters
5
ET1FF Fuzzy C-Means Clustering (FCM) LSE {7} [(μ), (eμ)], [(μ) (log(1-μ))] [0,0.1] [1.3,1.5] To be presented below
An evolutionary algorithm, i.e., genetic algorithms, is used to identify the optimum values of parameters. The optimum model is identified based on the average RSTB performance of the ET1FF methods on five cross validation datasets, therefore, some of the optimum parameters as shown above are presented as interval values. Nevertheless, in the next we demonstrate the parameters and fuzzy function structures of one of the optimum models of the cross validation datasets. Since, type-1 fuzzy functions are applied; each cluster has the same function structure as shown below. • • •
c*: optimum number of clusters =7 m* : optimum degree of fuzziness =1.3641 Optimum Local Fuzzy Function parameters:
Wˆ1 := Wˆ0,1 + Wˆ1,1 ( μ1 ) + Wˆ2,1 ( e μ1 ) + Wˆ3,1 ( x1 ) + L + Wˆ12,1 ( x10 ) ← ( fuzzy function of cluster 1)
Wˆ2 := Wˆ0,2 + Wˆ1,2 ( μ2 ) + Wˆ2,2 ( e μ2 ) + Wˆ3,2 ( x1 ) + L + Wˆ12,2 ( x10 ) ← ( fuzzy function of cluster 2 ) M
Wˆ7 := Wˆ0,7 + Wˆ1,7 ( μ7 ) + Wˆ2,7 ( e μ7 ) + Wˆ3,7 ( x1 ) + L + Wˆ12,7 ( x10 ) ← ( fuzzy function of cluster 7 )
e.g., Ŵ1= 0.17887+0.59479μ1-0.32983eμ1+ 0.88881x1+ 0.30631x5+ 0.2659 x6+ 0.0083306 x7+ 0.053667x13 -0.12532 x16 Ŵ2=0.25459+0.91971μ2 -0.47994 eμ2+ 0.8591 0.26081 x5+ 0.24019 x6 -0.0079002 x7+ 0.060066 x13 -0.097193 x16 Ŵ3=0.45596+1.1088μ3 -0.53618 eμ3+ 0.80768 0.24633 x5+ 0.42359 x6+ 0.028646 x7 -0.030395 x13 -0.33477 x16 Ŵ4=-0.1699-0.30094μ4 +0.17659 eμ4+ 0.93799 0.35528 x5+ 0.24726 x6 -0.00346 x7+ 0.056868 x13 -0.035836 x16 Ŵ5=-0.16343-0.63113μ5 +0.28429 eμ5+ 1.0337 0.37583 x5+ 0.34892v -0.042694 x7 -0.031018 x13 -0.099643 x16 Ŵ6=-0.30573-0.45222μ6 +0.3101 eμ6+ 0.94044 0.35324 x5+ 0.24479 x6 -0.005607 x7+ 0.053294 x13 -0.036137 x16 Ŵ7=0.069713+ 0.080636μ6 +0.014003 eμ7+ 1.005 x5+ 0.37315 0.27847 x6 -0.038268 x7+ 0.041968 x13 -0.11763 x 16.,
where μi indicate the membership values of input vectors in each cluster. The rest of the input variables, i.e., x2-x4 , x10 have very small coefficients close to zero. It should be noted that fuzzy functions of each cluster has the same structure, which is identified in the learning stage of ET1FF methodology.
388
A Appendix
D.6 Classification Datasets: Summary of Results Cancer Dataset – Cross Validation Performance Measures Table D.46 Cancer Dataset AUC values from testing dataset CV1
CV2
CV3
CV4
CV5
Average
std
LR
0.551
0.681
0.694
0.616
0.568
0.622
0.064
ANFIS
0.525
0.712
0.727
0.556
0.587
0.621
0.093
NN
0.594
0.723
0.667
0.564
0.628
0.635
0.062
SVM
0.596
0.731
0.688
0.712
0.584
0.662
0.068
FKNN
0.288
0.458
0.326
0.357
0.329
0.352
0.064
T1FF-C
0.608
0.777
0.689
0.636
0.614
0.665
0.070
T1IFF-C
0.614
0.727
0.698
0.712
0.622
0.675
0.053
DIT2FF-C
0.629
0.792
0.736
0.730
0.671
0.711
0.063
DIT2IFF-C
0.585
0.729
0.601
0.647
0.624
0.637
0.056
ET1FF-C
0.607
0.765
0.654
0.656
0.608
0.658
0.064
ET1IFF-C
0.594
0.722
0.716
0.685
0.662
0.676
0.052
EDIT2FF-C
0.604
0.772
0.683
0.629
0.635
0.665
0.066
0.594 0.716 EDIT2IFF-C AUC = Area under the ROC Curve.
0.706
0.655
0.643
0.663
0.050
Table D.47 Cancer Dataset Accuracy values from testing dataset CV1
CV2
CV3
CV4
CV5
Average
std
LR
0.725
0.738
0.763
0.688
0.725
0.728
0.027
ANFIS
0.588
0.725
0.738
0.600
0.688
0.668
0.070
NN
0.675
0.800
0.775
0.738
0.713
0.740
0.050
SVM
0.750
0.775
0.725
0.738
0.713
0.740
0.024
FKNN
0.275
0.275
0.275
0.625
0.275
0.345
0.157
T1FF-C
0.750
0.813
0.750
0.725
0.738
0.755
0.034
T1IFF-C
0.750
0.800
0.738
0.725
0.750
0.753
0.029
DIT2FF-C
0.613
0.738
0.773
0.763
0.575
0.692
0.092
DIT2IFF-C
0.713
0.725
0.625
0.613
0.663
0.668
0.050
ET1FF-C
0.725
0.788
0.663
0.650
0.763
0.718
0.060
ET1IFF-C
0.775
0.763
0.725
0.725
0.688
0.735
0.035
EDIT2FF-C
0.600
0.800
0.650
0.663
0.675
0.678
0.074
EDIT2IFF-C
0.613
0.575
0.588
0.688
0.600
0.788
0.788
D.6 Classification Datasets: Summary of Results
389
Diabetes Dataset – Cross Validation Performance Measures Table D.48 Diabetes Dataset AUC valued from testing dataset CV1
CV2
CV3
CV4
CV5
average
stdev
LR
0.880
0.735
0.595
0.670
0.730
0.722
0.105
ANFIS
0.707
0.553
0.603
0.707
0.623
0.639
0.067
NN
0.922
0.848
0.803
0.780
0.787
0.828
0.059
SVM
0.938
0.845
0.813
0.860
0.792
0.850
0.056
FKNN
0.630
0.513
0.433
0.467
0.513
0.511
0.074
T1FF-C
0.940
0.852
0.818
0.638
0.792
0.808
0.110
T1IFF-C
0.923
0.827
0.783
0.877
0.827
0.847
0.054
DIT2FF-C
0.935
0.848
0.758
0.833
0.825
0.840
0.063
DIT2IFF-C
0.940
0.863
0.857
0.893
0.855
0.882
0.036
ET1FF-C
0.898
0.828
0.777
0.865
0.802
0.834
0.049
ET1IFF-C
0.937
0.835
0.787
0.892
0.803
0.851
0.063
EDIT2FF-C
0.920
0.833
0.803
0.852
0.802
0.842
0.048
0.925 0.813 EDIT2IFF-C AUC = Area under the ROC Curve.
0.813
0.888
0.805
0.849
0.054
Table D.49 Diabetes Dataset Accuracy values from testing dataset 63
9
6
76
38
average
stdev
LR
0.780
0.780
0.580
0.680
0.700
0.704
0.083
ANFIS
0.800
0.660
0.640
0.680
0.660
0.688
0.064
NN
0.820
0.720
0.760
0.690
0.720
0.742
0.050
SVM
0.820
0.720
0.700
0.760
0.700
0.740
0.051
FKNN
0.400
0.400
0.400
0.400
0.400
0.400
0.000
T1FF-C
0.820
0.740
0.740
0.720
0.740
0.752
0.039
T1IFF-C
0.800
0.780
0.740
0.760
0.740
0.764
0.026
DIT2FF-C
0.840
0.760
0.680
0.760
0.760
0.760
0.057
DIT2IFF-C
0.880
0.800
0.760
0.780
0.740
0.792
0.054
ET1FF-C
0.800
0.780
0.700
0.780
0.700
0.752
0.048
ET1IFF-C
0.840
0.740
0.740
0.760
0.720
0.760
0.047
EDIT2FF-C
0.820
0.760
0.740
0.780
0.740
0.768
0.033
EDIT2IFF-C
0.820
0.760
0.760
0.760
0.760
0.772
0.027
390
A Appendix
Liver Dataset – Cross Validation Performance Measures Table D.50 Liver Dataset AUC values from testing dataset CV1
CV2
CV3
CV4
CV5
average
stdev
LR
0.725
0.658
0.675
0.672
0.622
0.670
0.037
ANFIS
0.672
0.582
0.678
0.680
0.775
0.677
0.068
NN
0.678
0.737
0.652
0.650
0.803
0.704
0.066
SVM
0.748
0.690
0.708
0.783
0.687
0.723
0.042
FKNN
0.300
0.210
0.400
0.220
0.320
0.290
0.078
T1FF-C
0.753
0.663
0.518
0.546
0.645
0.625
0.095
T1IFF-C
0.768
0.712
0.755
0.745
0.800
0.756
0.032
DIT2FF-C
0.817
0.788
0.787
0.837
0.815
0.809
0.021
DIT2IFF-C
0.768
0.698
0.742
0.760
0.780
0.750
0.032
ET1FF-C
0.760
0.670
0.630
0.705
0.778
0.709
0.062
ET1IFF-C
0.778
0.705
0.705
0.710
0.808
0.741
0.049
EDIT2FF-C
0.808
0.708
0.752
0.753
0.790
0.762
0.039
0.780 0.713 EDIT2IFF AUC = Area under the ROC Curve.
0.725
0.747
0.773
0.748
0.029
Table D.51 Liver Dataset Accuracy values from testing dataset CV1
CV2
CV3
CV4
CV5
average
stdev
LR
0.72
0.72
0.74
0.64
0.62
0.688
0.054
ANFIS
0.74
0.56
0.68
0.68
0.78
0.688
0.083
NN
0.7
0.7
0.7
0.54
0.74
0.676
0.078
SVM
0.74
0.64
0.72
0.66
0.68
0.688
0.041
FKNN
0.4
0.4
0.4
0.4
0.4
0.400
0.000
T1FF-C
0.76
0.66
0.7
0.7
0.76
0.716
0.043
T1IFF-C
0.74
0.68
0.7
0.76
0.68
0.712
0.036
DIT2FF-C
0.8
0.68
0.72
0.76
0.78
0.748
0.048
DIT2IFF-C
0.64
0.6
0.76
0.66
0.72
0.676
0.064
ET1FF-C
0.72
0.64
0.68
0.68
0.68
0.680
0.028
ET1IFF-C
0.74
0.66
0.74
0.7
0.74
0.716
0.036
EDIT2FF-C
0.7
0.72
0.72
0.7
0.7
0.708
0.011
EDIT2IFF-C
0.76
0.7
0.74
0.68
0.68
0.712
0.036
D.6 Classification Datasets: Summary of Results
391
Ion Dataset – Cross Validation Performance Measures Table D.52 Ion Dataset AUC values from testing dataset CV1
CV2
CV3
CV4
CV5
Average
Stdev
LR
0.837
0.803
0.730
0.796
0.746
0.782
0.044
ANFIS
0.917
0.847
0.948
0.840
0.912
0.893
0.047
NN
0.871
0.812
0.924
0.891
0.891
0.878
0.041
SVM
0.861
0.926
0.929
0.931
0.933
0.916
0.031
FKNN
0.504
0.630
0.840
0.706
0.691
0.674
0.122
T1FF-C
0.847
0.904
0.919
0.912
0.898
0.896
0.029
T1IFF-C
0.890
0.959
0.954
0.926
0.926
0.931
0.028
DIT2FF-C
0.950
0.966
0.964
0.964
0.957
0.960
0.007
DIT2IFF-C
0.831
0.954
0.954
0.878
0.908
0.905
0.053
ET1FF-C
0.845
0.903
0.879
0.878
0.913
0.883
0.027
ET1IFF-C
0.838
0.905
0.898
0.910
0.902
0.891
0.030
EDIT2FF-C
0.882
0.908
0.898
0.895
0.938
0.904
0.021
0.835 0.907 EDIT2IFF-C AUC = Area under the ROC Curve.
0.909
0.915
0.937
0.901
0.038
Table D.53 Ion Dataset Accuracy values from testing dataset CV1
CV2
CV3
CV4
CV5
Average
Stdev
LR
0.823
0.797
0.810
0.772
0.797
0.800
0.019
ANFIS
0.873
0.848
0.924
0.886
0.899
0.886
0.028
NN
0.861
0.848
0.886
0.848
0.911
0.871
0.027
SVM
0.797
0.810
0.823
0.835
0.810
0.815
0.014
FKNN
0.646
0.861
0.924
0.873
0.861
0.833
0.108
T1FF-C
0.797
0.848
0.823
0.797
0.810
0.815
0.021
T1IFF-C
0.835
0.861
0.873
0.861
0.848
0.856
0.014
DIT2FF-C
0.899
0.873
0.873
0.886
0.899
0.886
0.013
DIT2IFF-C
0.797
0.873
0.835
0.823
0.848
0.835
0.028
ET1FF-C
0.747
0.861
0.759
0.823
0.823
0.803
0.048
ET1IFF-C
0.759
0.835
0.848
0.835
0.823
0.820
0.035
EDIT2FF-C
0.810
0.861
0.861
0.785
0.835
0.830
0.033
EDIT2IFF-C
0.772
0.848
0.861
0.772
0.873
0.825
0.049
392
A Appendix
Credit Dataset – Cross Validation Performance Measures Table D.54 Credit Dataset AUC values from testing dataset cvi1
cvi2
cvi3
cvi4
cvi5
average
std
LR
0.911
0.903
0.893
0.844
0.773
0.865
0.058
ANFIS
0.841
0.732
0.817
0.763
0.666
0.764
0.070
NN
0.865
0.760
0.825
0.948
0.766
0.833
0.078
SVM
0.933
0.904
0.838
0.964
0.789
0.886
0.072
FKNN
0.662
0.584
0.609
0.672
0.429
0.591
0.098
T1FF-C
0.920
0.917
0.859
0.901
0.825
0.884
0.041
T1IFF-C
0.906
0.885
0.899
0.966
0.817
0.894
0.053
DIT2FF-C
0.893
0.903
0.867
0.974
0.825
0.892
0.055
DIT2IFF-C
0.945
0.927
0.873
0.982
0.830
0.911
0.060
ET1FF-C
0.929
0.903
0.886
0.946
0.804
0.894
0.055
ET1IFF-C
0.933
0.906
0.886
0.977
0.818
0.904
0.059
EDIT2FF-C
0.938
0.880
0.869
0.956
0.792
0.887
0.065
0.938 0.899 EDIT2IFF-C AUC = Area under the ROC Curve.
0.875
0.963
0.776
0.890
0.072
Table D.55 Ion Dataset Accuracy values from testing dataset cvi1
cvi2
cvi3
cvi4
cvi5
average
std
LR
0.840
0.800
0.800
0.820
0.740
0.800
0.037
ANFIS
0.840
0.680
0.780
0.800
0.720
0.764
0.064
NN
0.480
0.380
0.800
0.840
0.720
0.644
0.203
SVM
0.800
0.800
0.800
0.880
0.740
0.804
0.050
FKNN
0.820
0.440
0.440
0.820
0.680
0.640
0.191
T1FF-C
0.860
0.780
0.820
0.840
0.780
0.816
0.036
T1IFF-C
0.840
0.800
0.800
0.880
0.780
0.820
0.040
DIT2FF-C
0.860
0.780
0.840
0.920
0.760
0.832
0.064
DIT2IFF-C
0.880
0.820
0.800
0.940
0.800
0.848
0.061
ET1FF-C
0.880
0.800
0.780
0.840
0.780
0.816
0.043
ET1IFF-C
0.840
0.820
0.780
0.900
0.780
0.824
0.050
EDIT2FF-C
0.880
0.800
0.800
0.920
0.740
0.828
0.072
EDIT2IFF-C
0.860
0.800
0.780
0.900
0.760
0.820
0.058
D.6 Classification Datasets: Summary of Results
393
California Dataset – Cross Validation Performance Measures Table D.56 California Dataset AUC values from testing dataset CV1
CV2
CV3
CV4
CV5
Average
std
LR
0.85
0.81
0.85
0.88
0.86
0.848
0.025
ANFIS
0.74
0.71
0.76
0.73
0.78
0.746
0.026
NN
0.86
0.88
0.89
0.87
0.85
0.871
0.016
SVM
0.86
0.87
0.88
0.88
0.88
0.875
0.009
FKNN
0.69
0.70
0.69
0.70
0.71
0.698
0.009
T1FF-C
0.87
0.88
0.87
0.89
0.88
0.878
0.009
T1IFF-C
0.87
0.88
0.88
0.89
0.88
0.878
0.009
DIT2FF-C
0.89
0.90
0.90
0.90
0.90
0.898
0.003
DIT2IFF-C
0.86
0.87
0.88
0.88
0.88
0.874
0.008
ET1FF-C
0.89
0.89
0.87
0.89
0.89
0.885
0.010
ET1IFF-C
0.88
0.89
0.89
0.89
0.89
0.890
0.005
EDIT2FF-C
0.90
0.90
0.90
0.90
0.90
0.900
0.003
0.88 0.88 EDIT2IFF-C AUC = Area under the ROC Curve.
0.88
0.89
0.89
0.883
0.003
Table D.57 California Dataset Accuracy values from testing dataset CV1
CV2
CV3
CV4
CV5
Average
std
LR
0.77
0.73
0.77
0.80
0.78
0.769
0.025
ANFIS
0.70
0.68
0.70
0.69
0.73
0.701
0.019
NN
0.79
0.80
0.81
0.80
0.78
0.797
0.013
SVM
0.78
0.78
0.80
0.80
0.80
0.794
0.011
FKNN
0.87
0.87
0.89
0.86
0.91
0.879
0.018
T1FF-C
0.78
0.80
0.81
0.80
0.81
0.799
0.010
T1IFF-C
0.79
0.80
0.81
0.81
0.80
0.800
0.010
DIT2FF-C
0.82
0.82
0.82
0.82
0.82
0.820
0.002
DIT2IFF-C
0.78
0.79
0.81
0.80
0.80
0.795
0.011
ET1FF-C
0.81
0.81
0.80
0.81
0.82
0.810
0.006
ET1IFF-C
0.80
0.81
0.82
0.82
0.81
0.811
0.007
EDIT2FF-C
0.82
0.82
0.82
0.83
0.83
0.823
0.002
EDIT2IFF-C
0.80
0.80
0.81
0.81
0.81
0.803
0.004
394
A Appendix
Fuzzy Function Parameters of the Optimum Models of Classification Datasets
From the experimental trials on six different classification datasets, i.e., Ion (Ionosphere Dataset), Credit (Credit Scoring Dataset), Cancer (Breast Cancer Dataset), Liver (Liver Diseases Dataset), Diabetes, and California (California Housing Information Dataset), the following three proposed methodologies: (i) DIT2IFF, (ii) DIT2FF, (iii) EDIT2FF, are identified as the optimum methodologies based on Area Under The ROC (Receiver Operating Curve) Curve (AUC) performance measure. In particular, DIT2IFF is the optimum methodology identified by experimental trials on Diabetes and Credit datasets, DIT2FF is the optimum methodology identified by experimental trials on Liver, Ion, and Cancer datasets and finally, EDITFF is the optimum methodology identified by experimental trials on California dataset. Five different cross validation trials are executed on each dataset, therefore there are five different optimum models obtained from these optimum methodologies. The optimum models are identified based on the average AUC values over five cross validation model performances. In these trials, some of the optimum models identify linear, some of them identify non-linear fuzzy classification functions using logistic regression and support vector classification methods respectively. In the following, we demonstrate a sample optimum parameter set for DIT2IFF methodology on Diabetes dataset. Pima Diabetes Dataset - Parameters of the Best Cross Validation Models of the Optimum Methodology, DIT2IFF
The parameters of the optimum models obtained from the application of DIT2IFF are stored in collection tables, i.e., m-Col*, τ-Col* and Φ-Col*. One optimum collection table set, i.e., 〈m-Col*, τ-Col* and Φ-Col*〉, is identified from the application of DIT2IFF methodology on each cross validation iteration datasets, i.e., there are five different collection table sets obtained from each five different cross validation models in order to do inference on five different testing datasets of Pima Diabetes dataset. Here, only one sample collection table set structure is demonstrated. m-Col* indicates the collection table that holds the m* values of each optimum embedded T1IFF model identified for each training vector. Hence the m-Col* table is a (125×1) matrix, where n=125 indicate the number of training vectors in the corresponding cross validation dataset. The τ-Col* collection table holds the interim fuzzy classifier function parameters, i.e., the optimum membership value transformations used as input variables to identify interim fuzzy classifier function parameters, ŵi, k=1…,n, of the optimum embedded IFC model, which are identified individually for each training data vector from among the embedded models and each cluster uses the same interim fuzzy function in IFC algorithm. Thus, the τ-Col* is a matrix of (125×1) dimensions. The Φ-Col* collection table on the other hand holds the local fuzzy classifier function structures, i.e., the parameters, Ŵik, k=1…,125,i=1,…,c*, and structures, Φi,k of local fuzzy classifier functions to obtain the class label probabilities, p(yk=1|Φik,Ŵik), of each embedded fuzzy
D.6 Classification Datasets: Summary of Results
395
classifier function model identified for each kth training vector. Thus, if n=125 indicates the number of training vectors of Pima Diabetes Dataset, then the Φ-Col* is a matrix of (125×2) dimensions, in which c*=2 is the total number of clusters of the presented optimum model among five optimum models of cross validation trials. The optimum parameters of the DIT2IFF obtained from these five different cross validation models are as follows: Optimum Parameters of the DIT2IFF Methodology Obtained from Cross Validation Trials on Diabetes Dataset
1 2
7 8
Fuzzy Clustering Type Regression Type # of clusters Fuzziness degree Optimum List of membership value transformations to be used as additional input variables κ (number of nearest training vectors for IFC) Alpha-cut
9
m-Col , τ-Col, Φ-Col tables
3 4 5 6
DIT2IFF Improved Fuzzy Clustering LSE and SVM {2} [1.1,1.8] [(μ), (eμ)], [(μ) (log(1-μ))] {2} [0,0.1]
Model Name
To be displayed below
The m collection table, m-Col, and the interim and local fuzzy function structure collection tables, τ-Col*, Φ-Col*, of one of the optimum models of the cross validation iteration trials in which the optimum number of clusters c*=7, the alphacut=0.04, is as follows:
(
⎡ m1 = 1.1 ⎤ ⎢ m = 1.52 ⎥ ⎢ 2 ⎥ 125×1 mCol * = ⎢ m3 = 1.8 ⎥ τ Col * ⎢ ⎥ M ⎢ ⎥ ⎢⎣ m125 = 1.1⎥⎦
)
( ΦCol )
* 125×2
(
(
⎡ (τ1 , wˆ1 ) ⎤ ⎢ ⎥ ⎢ (τ 2 , wˆ 2 ) ⎥ 125×1 = ⎢ (τ 3 , wˆ 3 ) ⎥ ⎢ ⎥ M ⎢ ⎥ ⎢(τ , wˆ ) ⎥ ⎣ 125 125 ⎦
)
)
* ⎡ Φ1,1 ,Wˆ1,1 ⎢ M =⎢ ⎢ * ⎢ Φ1,125 ,Wˆ1,125 ⎣
(
(Φ )
(Φ
ˆ* 2,1 , W2,1
)
M ˆ* 2,125 , W2,125
)
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
The fuzzy function parameters of the optimum model are identified with
(
logistic regression (LR). Each cell in τ Col *
)
125×1
ˆ k ) , k=1,…,125, , i.e., (τ k , w
holds the following list of parameters for the ten inputs, one single output, i.e., the
396
A Appendix
class labels indicating the disease, training dataset (we demonstrate the parameters of the first row of the τ Col125×1 collection table: •
Interim Fuzzy Function parameters, ŵk=1 , m*=1.1, using LR regression:
(
) )
− ( −0.67 + 0.153 ( μ1 k ) ) pˆ1, k := wˆ 0,1, k + wˆ1,1, k ( μ1k ) = 1/ 1 + e ← ( interim fuzzy classifier function of cluster 1)
(
pˆ 2, k := wˆ 0,2, k + wˆ1,2, k ( μ2 k ) = 1/ 1 + e− ( −0.58 − 0.49 ( μ2 k )) ← ( interim fuzzy classifier function of cluster 2 )
•
Each interim matrix takes the form, i.e., τ i ,k1×2 = ⎡⎣1 μi ,k ⎤⎦
(
Each row in ΦCol *
(
)
125×2
)
, i.e., Φ k ,Wˆk* , holds the following list of parame-
ters, where the fuzzy classifier functions of the particular cluster i are identified with membership values as additional input variables onto the original input variable set. In the following, we show the parameters of local fuzzy functions of the
(
first row (k=1) in ΦCol * •
)
125×2
as identified with LR:
Local Fuzzy Functions coefficients of each cluster, i=1..2, k=1:
⎛ − (Wˆ0 ,1,k +Wˆ1,1,k ( μ1k ) +Wˆ2 ,1,k ( e μ1 k ) +Wˆ3,1,k ( x1 ) +L+Wˆ11,1,k ( x8 ) ) ⎞ ⎛ fuzzy classifier ⎞ Pi =1, k := 1 ⎜ 1 + e ⎟⎟ ← ⎜ ⎟ ⎜ function of cluster 1 ⎠ ⎝ ⎠ ⎝ ⎛ Wˆ +Wˆ1,2 ,k ( μ 2 k ) +Wˆ2 ,2 ,k ( ln(1− μ 2 k ) μ 2 k ) + ⎞ ⎞ ⎛ −⎜ ˆ0 ,2 ,k ⎟⎟ ⎜ ⎞ W ( x ) +L+Wˆ11,2 ,k ( x8 ) ⎠ ⎟ ← ⎛ fuzzy classifier Pi = 2, k := 1 ⎜ 1 + e ⎝ 3,2 ,k 1 ⎜ ⎟ ⎜ ⎟ ⎝ function of cluster 2 ⎠ ⎜ ⎟ ⎝ ⎠
where,
Pi =1, k := 1
⎛ −4.8607 − 4.9068( μ1k ) +1.945( e μ1k ) + 0.522( x1 ) ⎞⎞ ⎛ ⎜ −⎜ +2.34( x2 ) + 0.21( x3 ) + 0.31( x4 ) + 2.17 ( x5 ) + 3.24( x6 ) ⎟ ⎟ ⎜⎜ ⎟⎟ ⎟ ⎜ +1.033( x7 ) + 0.306( x8 ) ⎠⎟ ⎜1 + e ⎝ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
Pi = 2, k := 1
⎛ −4.184 + 0.82( μ 2 k ) − 0.213( ln(1− μ 2 k ) μ 2 k ) + ⎞ ⎞ ⎛ ⎜ − ⎜ 0.363( x1 ) + 0.21( x3 ) + 0.536( x4 ) −1.8984( x5 ) + ⎟ ⎟ ⎜ ⎟⎟ ⎜ 2.796( x6 ) + 0.81( x7 ) + 0.2436( x8 ) ⎠⎟ ⎜1 + e ⎝ ⎜ ⎟ ⎜⎜ ⎟⎟ ⎝ ⎠
,
where μik indicate the membership value of kth input vector in each cluster i, and
D.6 Classification Datasets: Summary of Results μ1,1
397
Φ i =1, k = ⎡1 μ1,1 ⎣
e
xk ,2
xk ,3
xk ,4
xk ,5
xk ,6
xk ,7
xk ,8 ⎤ ⎦
Φ i = 2, k = ⎡⎣1 μ2,1
ln(1 − μ2,1 ) μ2,1
xk ,1
xk ,2
xk ,3
xk ,4
xk ,5
xk ,6
xk ,1
xk ,7
xk ,8 ⎤⎦
In this model, the local fuzzy classifier functions take on different structures in different clusters.
D.7 Cluster Validity Index Graphs 0.8
2.5 m=1.3 m=2.0
m=1.3 m=2.0
0.6 cviFF-C
cviIFC-C
2 1.5 1
0.4 0.2
0.5 0
0 2
4
6 8 Number of Clusters
10
Fig. D.2. Diabetes Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}
2
4
6 8 number of clusters
10
Fig. D.3. Ion Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}
2 m=1.3 m=2.0
cviIFC-C
1.5 1 0.5 0
1
2
3
4 5 6 7 Number of Clusters
8
9
10
Fig. D.4 Liver Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}
398
A Appendix 2 m=1.3 m=2.0
0.8 m=1.3 m=2.0
1.5
cviIFC-C
cviIFC
0.6 0.4
0.5
0.2 0
1
1
3
5 Number of Clusters
7
0
9
Fig. D.5. Credit Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}
0
2
4 6 Number of Clusters
8
10
Fig. D.6. Cancer Dataset CVI Graphs of IFC-C method for two different fuzziness values; m={1.3,2.0}
1
1
0.8
0.8 Sensitivity
Sensitivity
D.8 Classification Datasets – ROC Graphs
0.6 0.4 LR AUC=0.78 DIT2FF AUC=0.96
0.2 0
0.5 1-Specificity
0.4
0
1
1
1
0.8
0.8
0.6 0.4 NN AUC=0.88 DIT2FF AUC=0.96
0.2 0
0
0.5 1-Specificity
ANFIS AUC=0.89 DIT2FF AUC=0.96
0.2
Sensitivity
Sensitivity
0
0.6
0
1
0.6 0.4 SVM AUC=0.91 DIT2FF AUC=0.96
0.2 1
0.5 1-Specificity
0
0
0.5 1-Specificity
1
Fig. D.7 Ionosphere Dataset– Receiver Operating Characteristic (ROC) Graphs of Optimum model - DIT2FF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines
1
0.8
0.8
0.6 0.4 LR AUC=0.86 DIT2IFF AUC=0.911
0.2 0
Sensitivity
Sensitivity
1
0
0.5 1-Specificity
0.6 0.4
0
1
1
1
0.8
0.8
0.6 0.4 NN AUC=0.83 DIT2IFF AUC=0.911
0.2 0
0
0.5 1-Specificity
399
ANFIS AUC=0.77 DIT2IFF AUC=0.911
0.2
Sensitivity
Sensitivity
D.8 Classification Datasets – ROC Graphs
0
0.5 1-Specificity
1
0.6 0.4 SVM AUC=0.89 DIT2IFF AUC=0.911
0.2 0
1
0
0.5 1-Specificity
1
Fig. 8. Credit Dataset – Receiver Operating Characteristic (ROC) Graphs of Optimum model - DIT2IFF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines.
LR AUC=0.62 DIT2FF AUC=0.73 0
0.5 1-Specificity
Sensitivity
0.5
0
Sensitivity
1
0.5 ANFIS AUC=0.62 DIT2FF AUC=0.73 0
1
1
1
0.8
0.8 Sensitivity
Sensitivity
1
0.6 0.4 NN AUC=0.64 DIT2FF AUC=0.73
0.2 0
0
0.5 1-Specificity
1
0
0.5 1-Specificity
1
0.6 0.4 SVM AUC=0.66 DIT2FF AUC=0.73
0.2 0
0
0.5 1-Specificity
1
Fig. D.9. Cancer Dataset – Receiver Operating Characteristic (ROC) Graphs of Optimum model - DIT2FF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines.
A Appendix
1
0.8
0.8
0.6 0.4 LR AUC=0.73 DIT2FF AUC=0.81
0.2
Sensitivity
0
0
0.5 1-Specificity
Sensitivity
1
0.6 0.4
0
1
1
1
0.8
0.8
0.6 0.4 NN AUC=0.70 DIT2FF AUC=0.81
0.2 0
0
0.5 1-Specificity
ANFIS AUC=0.68 DIT2FF AUC=0.81
0.2
Sensitivity
Sensitivity
400
0
0.5 1-Specificity
1
0.6 0.4 SVM AUC = 0.72 DIT2FF AUC=0.81
0.2 0
1
0
0.5 1-Specificity
1
1
0.8
0.8
0.6 0.4 LR AUC=0.72
0.2 0
Sensitivity
Sensitivity
1
0
0.5 1-Specificity
0.6 0.4 ANFIS AUC=0.64
0.2
DIT2IFF AUC=0.88
DIT2IFF AUC=0.88
0
1
0
1
1
0.8
0.8 Sensitivity
Sensitivity
Fig. D.10 Liver Dataset – Receiver Operating Characteristic (ROC) Graphs of Proposed DIT2FF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines
0.6 0.4 NN AUC=0.83
0.2 0
0
0.5 1-Specificity
1
1
0.6 0.4 SVM AUC=0.85
0.2
DIT2IFF AUC=0.88
0.5 1-Specificity
0
DIT2IFF AUC=0.88
0
0.5 1-Specificity
1
Fig. D.11 Diabetes Dataset- Receiver Operating Characteristic (ROC) Graphs of Proposed DIT2IF versus (a) Logistic Regression, (b) ANFIS, (c) Neural Networks, (d) Support Vector Machines