Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
5632
Petra Perner (Ed.)
Machine Learning and Data Mining in Pattern Recognition 6th International Conference, MLDM 2009 Leipzig, Germany, July 23-25, 2009 Proceedings
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editor Petra Perner Institute of Computer Vision and Applied Computer Sciences (IBal) Kohlenstr. 2, 04107 Leipzig, Germany E-mail:
[email protected]
Library of Congress Control Number: 2009930146
CR Subject Classification (1998): I.2, F.4, I.4, I.5, H.3 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN ISBN-10 ISBN-13
0302-9743 3-642-03069-6 Springer Berlin Heidelberg New York 978-3-642-03069-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12714995 06/3180 543210
Preface
There is no royal road to science, and only those who do not dread the fatiguing climb of its steep paths have a chance of gaining its luminous summits. Karl Marx A Universial Genius of the 19th Century
Many scientists from all over the world during the past two years since the MLDM 2007 have come along on the stony way to the sunny summit of science and have worked hard on new ideas and applications in the area of data mining in pattern recognition. Our thanks go to all those who took part in this year's MLDM. We appreciate their submissions and the ideas shared with the Program Committee. We received over 205 submissions from all over the world to the International Conference on Machine Learning and Data Mining, MLDM 2009. The Program Committee carefully selected the best papers for this year’s program and gave detailed comments on each submitted paper. There were 63 papers selected for oral presentation and 17 papers for poster presentation. The topics range from theoretical topics for classification, clustering, association rule and pattern mining to specific data-mining methods for the different multimedia data types such as image mining, text mining, video mining and Web mining. Among these topics this year were special contributions to subtopics such as attribute discretization and data preparation, novelty and outlier detection, and distances and similarities. The section “Aspects of Data Mining” presented work that considers automating the whole data mining process. The sections also cover data mining in medicine. Four papers were selected for the Best Paper Award. The final decision on the Best Paper was made during the conference and was based on the review results, the presentation style and the discussion. Please check www.mldm.de for information about the nominated papers and the final decision. We thank all members of the Institute of Applied Computer Sciences, Leipzig, Germany (www.ibai-institut.de) who handled the conference as secretariat. We appreciate the help and understanding of the editorial staff at Springer, and in particular Alfred Hofmann, who supported the publication of these proceedings in the LNAI series. Last, but not least, we wish to thank all the speakers and participants who contributed to the success of the conference. We would also like to encourage those who could not get their paper accepted to MLDM 2009 to work out the comments of the reviewers and resubmit their paper to our next conference. See you at MLDM 2011 (www.mldm.de).
July 2009
Petra Perner
International Conference on Machine Learning and Data Mining in Pattern Recognition MLDM 2009
Chair Petra Perner
Institute of Computer Vision and Applied Computer Sciences IBaI Leipzig, Germany
Program Committee Agnar Aamodt Jacky Baltes Max Bramer Horst Bunke Krzysztof Cios Christoph F. Eick Ana Fred Giorgio Giacinto Makato Haraguchi Robert J. Hilderman Tin Kam Ho Atsushi Imiya Horace Ip Abraham Kandel Dimitrios A. Karras Adam Krzyzak Longin Jan Latecki Tao Li Brian Lovell Mariofanna Milanova Thang V. Pham Maria da Graca Pimentel Petia Radeva Michael Richter Fabio Roli Linda Shapiro Sameer Singh
NTNU, Norway University of Manitob, Canada University of Portsmouth, UK University of Bern, Switzerland University of Colorado, USA University of Houston, USA Technical University of Lisbon, Portugal University of Cagliari, Italy Hokkaido University Sapporo, Japan University of Regina, Canada Bell Laboratories, USA Chiba University, Japan City University, Hong Kong University of South Florida , USA Chalkis Institute of Technology, Greece Concordia University, Montreal, Canada Temple University Philadelphia, USA Florida International University, USA University of Queensland, Australia University of Arkansas at Little Rock, USA University of Amsterdam, The Netherlands Universidade de São Paulo, Brazil Universitat Autonoma de Barcelona, Spain University of Calgary, Canada University of Cagliari, Italy University of Washington, USA Loughborough University, UK
VIII
Organization
David Steinberg Francesco Tortorella Patrick Wang
Tel Aviv University, Israel Universita' degli Studi di Cassino, Italy Northeastern University, USA
Additional Reviewers Indriyati Atmosukarto Jiun-Hung Chen Natalia Larios Dingding Liu Marta Penas Centeno Jia Wu Sara Rolfe Shulin Yang Mabel Raza Rosalia Tungaraza Katarzyna Wilamowska Tore Amble Richard Blake Tore Bruland Tor Gunnar Houeland Dung Manh Chu Francesco Fontanella Claudio Marrocco Mario Molinara Maria Teresa Ricamato Alessandra Scotto di Freca Paolo Simeone Davide Ariu Battista Biggio Luca Didaci Giorgio Fumera Gian Luca Marcialis Daniele Muntoni Ignazio Pillai Luca Piras Ajita Rattani Roberto Tronci
University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA University of Washington, USA NTNU, Norway NTNU, Norway NTNU, Norway NTNU, Norway University of Amsterdam, The Netherlands Universita' degli Studi di Cassino, Italy Universita' degli Studi di Cassino, Italy Universita' degli Studi di Cassino, Italy Universita' degli Studi di Cassino, Italy Universita' degli Studi di Cassino, Italy Universita' degli Studi di Cassino, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy University of Cagliari, Italy
Table of Contents
Attribute Discretization and Data Preparation Improved Comprehensibility and Reliability of Explanations via Restricted Halfspace Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Klaus Truemper
1
Selection of Subsets of Ordered Features in Machine Learning . . . . . . . . . O. Seredin, A. Kopylov, and V. Mottl
16
Combination of Vector Quantization and Visualization . . . . . . . . . . . . . . . . Olga Kurasova and Alma Molyt˙e
29
Discretization of Target Attributes for Subgroup Discovery . . . . . . . . . . . . Katherine Moreland and Klaus Truemper
44
Preserving Privacy in Time Series Data Classification by Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ye Zhu, Yongjian Fu, and Huirong Fu Using Resampling Techniques for Better Quality Discretization . . . . . . . . Taimur Qureshi and Djamel A. Zighed
53 68
Classification A Large Margin Classifier with Additional Features . . . . . . . . . . . . . . . . . . Xinwang Liu, Jianping Yin, En Zhu, Guomin Zhang, Yubin Zhan, and Miaomiao Li Sequential EM for Unsupervised Adaptive Gaussian Mixture Model Based Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bashar Awwad Shiekh Hasan and John Q. Gan
82
96
Optimal Double-Kernel Combination for Classification . . . . . . . . . . . . . . . . Feng Wang and Hongbin Zhang
107
Efficient AdaBoost Region Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Moed and E.N. Smirnov
123
A Linear Classification Method in a Very High Dimensional Space Using Distributed Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takao Kobayashi and Ikuko Shimizu
137
PMCRI: A Parallel Modular Classification Rule Induction Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frederic Stahl, Max Bramer, and Mo Adda
148
X
Table of Contents
Dynamic Score Combination: A Supervised and Unsupervised Score Combination Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Tronci, Giorgio Giacinto, and Fabio Roli ODDboost: Incorporating Posterior Estimates into AdaBoost . . . . . . . . . . Olga Barinova and Dmitry Vetrov
163 178
Ensemble Classifier Learning Ensemble Learning: A Study on Different Variants of the Dynamic Selection Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jo˜ ao Mendes-Moreira, Alipio Mario Jorge, Carlos Soares, and Jorge Freire de Sousa
191
Relevance and Redundancy Analysis for Ensemble Classifiers . . . . . . . . . . Rakkrit Duangsoithong and Terry Windeatt
206
Drift-Aware Ensemble Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank Rosenthal, Peter Benjamin Volk, Martin Hahmann, Dirk Habich, and Wolfgang Lehner
221
Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peipei Li, Xuegang Hu, Qianhui Liang, and Yunjun Gao
236
Association Rules and Pattern Mining Mining Multiple Level Non-redundant Association Rules through Two-Fold Pruning of Redundancies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corrado Loglisci and Donato Malerba
251
Pattern Mining with Natural Language Processing: An Exploratory Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Cristina Mendes and Cl´ audia Antunes
266
Is the Distance Compression Effect Overstated? Some Theory and Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen France and Douglas Carroll
280
Support Vector Machines Fast Local Support Vector Machines for Large Datasets . . . . . . . . . . . . . . . Nicola Segata and Enrico Blanzieri The Effect of Domain Knowledge on Rule Extraction from Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nahla Barakat and Andrew P. Bradley
295
311
Table of Contents
Towards B-Coloring of SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haytham Elghazel and Khalid Benabdeslem
XI
322
Clustering CSBIterKmeans: A New Clustering Algorithm Based on Quantitative Assessment of the Clustering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tarek Smaoui, Sascha M¨ uller, and Christian M¨ uller-Schloer Agent-Based Non-distributed and Distributed Clustering . . . . . . . . . . . . . . Ireneusz Czarnowski and Piotr J¸edrzejowicz An Evidence Accumulation Approach to Constrained Clustering Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Duarte Abdala and Xiaoyi Jiang Fast Spectral Clustering with Random Projection and Sampling . . . . . . . Tomoya Sakai and Atsushi Imiya How Much True Structure Has Been Discovered?: Validating Explorative Clustering on a Hold-Out Test Set . . . . . . . . . . . . . . . . . . . . . . F. H¨ oppner Efficient Clustering of Web-Derived Data Sets . . . . . . . . . . . . . . . . . . . . . . . Lu´ıs Sarmento, Alexander Kehlenbeck, Eug´enio Oliveira, and Lyle Ungar A Probabilistic Approach for Constrained Clustering with Topological Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khalid Benabdeslem and Jihene Snoussi
337 347
361 372
385 398
413
Novelty and Outlier Detection Relational Frequent Patterns Mining for Novelty Detection from Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michelangelo Ceci, Annalisa Appice, Corrado Loglisci, Costantina Caruso, Fabio Fumarola, Carmine Valente, and Donato Malerba
427
A Comparative Study of Outlier Detection Algorithms . . . . . . . . . . . . . . . . Charlie Isaksson and Margaret H. Dunham
440
Outlier Detection with Explanation Facility . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Mej´ıa-Lavalle and Atl´ antida S´ anchez Vivar
454
XII
Table of Contents
Learning Concept Learning from (Very) Ambiguous Examples . . . . . . . . . . . . . . . . . Dominique Bouthinon, Henry Soldano, and V´eronique Ventos
465
Finding Top-N Pseudo Formal Concepts with Core Intents . . . . . . . . . . . . Yoshiaki Okubo and Makoto Haraguchi
479
On Fixed Convex Combinations of No-Regret Learners . . . . . . . . . . . . . . . Jan-P. Calliess
494
An Improved Tabu Search (ITS) Algorithm Based on Open Cover Theory for Global Extremums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kemal Y¨ uksek and Serhat Cakaloglu The Needles-in-Haystack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katherine Moreland and Klaus Truemper
505 516
Data Mining on Multimedia Data An Evidence-Driven Probabilistic Inference Framework for Semantic Image Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spiros Nikolopoulos, Georgios Th. Papadopoulos, Ioannis Kompatsiaris, and Ioannis Patras Detection of Masses in Mammographic Images Using Simpson’s Diversity Index in Circular Regions and SVM . . . . . . . . . . . . . . . . . . . . . . . Andr´e Pereira Nunes, Arist´ ofanes Corrˆea Silva, and Anselmo Cardoso de Paiva Mining Lung Shape from X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vassili Kovalev, Aliaksandr Prus, and Pavel Vankevich A Wavelet-Based Method for Detecting Seismic Anomalies in Remote Sensing Satellite Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pan Xiong, Yaxin Bi, and Xuhui Shen Spectrum Steganalysis of WAV Audio Streams . . . . . . . . . . . . . . . . . . . . . . . Qingzhong Liu, Andrew H. Sung, and Mengyu Qiao Audio-Based Emotion Recognition in Judicial Domain: A Multilayer Support Vector Machines Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E. Fersini, E. Messina, G. Arosio, and F. Archetti Learning with a Quadruped Chopstick Robot . . . . . . . . . . . . . . . . . . . . . . . . Wei-Chung Lee, Jong-Chen Chen, Shou-zhe Wu, and Kuo-Ming Lin Dissimilarity Based Vector Space Embedding of Graphs Using Prototype Reduction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaspar Riesen and Horst Bunke
525
540
554
569 582
594 603
617
Table of Contents
XIII
Text Mining Using Graph-Kernels to Represent Semantic Information in Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teresa Gon¸calves and Paulo Quaresma A General Framework of Feature Selection for Text Categorization . . . . . Hongfang Jing, Bin Wang, Yahui Yang, and Yan Xu New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Walaa K. Gad and Mohamed S. Kamel
632 647
663
Aspects of Data Mining Learning Betting Tips from Users’ Bet Selections . . . . . . . . . . . . . . . . . . . . ˇ ˇ Erik Strumbelj, Marko Robnik Sikonja, and Igor Kononenko
678
An Approach to Web-Scale Named-Entity Disambiguation . . . . . . . . . . . . Lu´ıs Sarmento, Alexander Kehlenbeck, Eug´enio Oliveira, and Lyle Ungar
689
A General Learning Method for Automatic Title Extraction from HTML Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sahar Changuel, Nicolas Labroche, and Bernadette Bouchon-Meunier Regional Pattern Discovery in Geo-referenced Datasets Using PCA . . . . . Oner Ulvi Celepcikay, Christoph F. Eick, and Carlos Ordonez Memory-Based Modeling of Seasonality for Prediction of Climatic Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Nikovski and Ganesan Ramachandran
704 719
734
A Neural Approach for SME’s Credit Risk Analysis in Turkey . . . . . . . . . G¨ ulnur Derelio˘glu, Fikret G¨ urgen, and Nesrin Okay
749
Assisting Data Mining through Automated Planning . . . . . . . . . . . . . . . . . Fernando Fern´ andez, Daniel Borrajo, Susana Fern´ andez, and David Manzano
760
Predictions with Confidence in Applications . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Dashevskiy and Zhiyuan Luo
775
Data Mining in Medicine Aligning Bayesian Network Classifiers with Medical Contexts . . . . . . . . . . Linda C. van der Gaag, Silja Renooij, Ad Feelders, Arend de Groote, Marinus J.C. Eijkemans, Frank J. Broekmans, and Bart C.J.M. Fauser
787
XIV
Table of Contents
Assessing the Eligibility of Kidney Transplant Donors . . . . . . . . . . . . . . . . Francisco Reinaldo, Carlos Fernandes, Md. Anishur Rahman, Andreia Malucelli, and Rui Camacho Lung Nodules Classification in CT Images Using Simpson’s Index, Geometrical Measures and One-Class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . Cleriston Araujo da Silva, Arist´ ofanes Corrˆea Silva, Stelmo Magalh˜ aes Barros Netto, Anselmo Cardoso de Paiva, Geraldo Braz Junior, and Rodolfo Acatauass´ u Nunes Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
802
810
823
Improved Comprehensibility and Reliability of Explanations via Restricted Halfspace Discretization Klaus Truemper Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, U.S.A. Abstract. A number of two-class classification methods first discretize each attribute of two given training sets and then construct a propositional DNF formula that evaluates to True for one of the two discretized training sets and to False for the other one. The formula is not just a classification tool but constitutes a useful explanation for the differences between the two underlying populations if it can be comprehended by humans and is reliable. This paper shows that comprehensibility as well as reliability of the formulas can sometimes be improved using a discretization scheme where linear combinations of a small number of attributes are discretized. Keywords: Discretization, Logic, Explanation, Comprehensibility, Reliability.
1
Introduction
Suppose we are to explain the differences between two populations A and B. In the well-known approach assumed here, we take two training sets A and B from the populations A and B, discretize the attributes of the training sets, and construct a propositional disjunctive normal form (DNF) formula that evaluates to True for one of the two discretized training sets and to False for the other one. Example methods carrying out these steps in various ways are described in [1,2,4,8,10,11,17,18,19,24,25,47]. The formula can be viewed as a useful explanation of the differences between the populations if (1) the formula can be comprehended and interpreted by humans, and (2) it reliably predicts membership in the two populations. This paper shows that a certain discretization where linear combinations of a small number of attributes are discretized, may help in the construction of comprehensible and reliable explanations. We first discuss the size and comprehensibility of formulas.
2
Size and Comprehensibility of Formulas
Human comprehension of data or statements is an extensively covered topic of Neurology and Psychology. One of the key concepts is chunk, defined in [20] as P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 1–15, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
K. Truemper
a collection of concepts that are closely related and that have much weaker connections with other concurrently used concepts. The seminal paper [40] defines a “magical number seven, plus or minus two” of chunks as limit of short-term memory storage capacity. Subsequent work refined the main claim of [40]. For a detailed review, see [20], which argues for a “magical number 4” of chunks. In related research, [31] establishes a limit of 4 for the number of variables humans can process. In [32] an integrated treatment of working memory capacity and relational capacity is proposed that generalizes the above results. The reference concludes that working memory is limited to approximately 3-4 chunks, and that the number of variables involved in reasoning is limited to 4. We rely on these bounds in our quest for comprehensible explanations. 2.1
Formula Size
Let us define the formula size FS of the logic formulas of interest here. The formulas are always in disjunctive normal form (DNF). Such a formula is a disjunction of DNF clauses, which in turn are conjunctions of literals. In this paper, each literal is a linear inequality of the form (bt ·z > α) or (bt ·z < α), where z is a vector of attribute variables, b is a vector of constants, and α is a scalar. An example of a DNF formula is [(x < 5)∧(y > 7)]∨(z > 1)∨[(y < 1)∧(z < 0)], with DNF clauses [(x < 5) ∧ (y > 7)], (z > 1), and [(y < 1) ∧ (z < 0)]. Consider a literal (bt · z > α) or (bt · z < α) with k ≤ 4 nonzero entries bj in the vector b. In agreement with the conclusions of [32], we assume that humans can readily understand the interaction of the items represented by the k terms bj · xj with nonzero bj and convert that information into one chunk that allows further reasoning. For the case k ≤ 3, graphical representation of the corresponding halfspace in k-dimensional Euclidean space can assist in that task. We call the translation of (bt · z > α) or (bt · z < α) with k ≤ 4 nonzero bj into just one chunk elementary chunking. Any other chunking is considered non-elementary. The formula size FS of a DNF formula where each literal is of the form (bt · z > α) or (bt · z < α) with k ≤ 4 nonzero bj is defined to be equal to the number of literals in the formula. Thus, it is equal to the number of chunks that can be derived by elementary chunking. We handle Boolean variables w in the above format by encoding the literals w and ¬w by the literals (x > 0.5) and (x < 0.5), respectively, using a {0, 1} variable x. Each such term contributes 1 to the FS value of the formula. There are other ways to define formula size, using different notions of chunking. We have chosen the above definition mainly because elementary chunking is consistent with [32]. In the next section, we encounter example logic formulas that only contain literals of Boolean variables. As a matter of brevity and convenience, we skip translation to the format where literals are inequalities and use the formula directly. FS of such a formula is defined to be equal to the number of literals of the formula, which would be the FS value of the appropriately translated formula version.
Improved Comprehensibility and Reliability of Explanations
2.2
3
Comprehension Difficulty
The approach of this section is motivated by the experiments reported in [31]. We want a measure of difficulty of comprehension of DNF formulas when such formulas serve as explanations. We must first agree on what comprehension means. For our purposes, a reasonable definition is that the person who supposedly has comprehended a formula has indeed integrated the logic information implied by the formula into his/her relevant knowledge. For example, suppose a formula has the value True if a patient benefits from a certain medication, and has the value False otherwise. A physician has comprehended the formula if he/she has integrated the logic information into the treatment decision process. We determine the comprehension difficulty CD of a formula S experimentally follows. Suppose n persons are proficient in a knowledge domain X. Let S be a DNF formula that applies to X. We give each person the formula S and ask whether S is compatible with X. Mathematically, answering the question is equivalent to deciding if S ∧ X has a satisfying solution. Suppose that k persons answer the question correctly. If k ≤ n/2, then human decision making essentially is not better than a random choice of answers. Accordingly, we estimate that comprehension of S by humans is not possible, and define CD to be ∞. Assume that k > n/2, and let t be the average time required to find an answer to the question. Reasonably, we want CD to be proportional to t and inversely proportional to the fraction k/n − 0.5, which measures the improvement of accuracy over random choice. We also want CD to have the value t if the n answers are 100% accurate. The following definition of CD meets these goals. t/(2k/n − 1) if k/n > 0.5 CD = (1) ∞ otherwise 2.3
Prediction of Comprehension Difficulty
We may be tempted to predict CD using FS. The main result of this section is that such a prediction generally is not possible. Two examples below prove that claim. But later we also show that, under certain conditions, a reasonable prediction of likely comprehensibility can be made. In the first example, a formula S with large FS applies to a knowledge domain X such that CD is small. This result is not unexpected, since non-elementary chunking can have that effect. In the second example, a formula S with small FS applies to a knowledge domain X such that CD is large. This result can be expected for certain pathological situations. But the example shows that this situation can also occur in rather straightforward situations of S and X. Example 1. The domain X covers grading of one exam. For j = 1, 2, . . . , n, let the Boolean variable xj have the value True if question j of the exam is answered correctly, and the value False otherwise. The formula S is to have the value True if the entire exam has at least two questions answered incorrectly. Regardless of the specific encoding of this condition, the size FS of S must
4
K. Truemper
grow at least linearly in n. But it is easy to check if S ∧ X is satisfiable, since non-elementary chunking reduces that check to trivial counting. Thus, CD is small even when F S is large. Example 2. In [31] it is shown that understanding food preference questions can be very difficult for humans. The following rephrasing of the results of [31] produces an example of low FS and high CD. The questioning process of [31] concerns cakes in various forms and the related preferences by humans. We sketch the setting. There are four variables: process, flavor, type, and richness. Each variable can take on two values, as follows: process = fresh or frozen, flavor = chocolate or carrot, type = iced or plain, and richness = rich or lowfat. The knowledge domain X consists of bar graphs depicting the desirability of cakes with various features. For example, a value may be given for fresh cakes containing chocolate and covered with icing. In each experiment, a person must evaluate a set of statements. The person is told that all statements of the set are correct except for the last one, which allows for two choices. The person must decide which of the two choices is the correct one, using the bar graphs of the knowledge domain X. There are four types of sets of statements, called 2×2-way, 3-way, 2×3-way, and 4-way. The simplest case is of type 2×2-way. The most difficult one is of type 4-way. For the sake of brevity, we discuss only those two cases and omit the intermediate 3-way and 2×3-way cases. Here is an example set of statements for the 2×2-way case. “People prefer fresh to frozen cakes. The difference [in the rating] depends on the flavor (chocolate vs carrot ). The difference between fresh and frozen is (greater/smaller) for chocolate cakes than for carrot cakes.” The person must decide which of the two options “greater” and “smaller” in the last statement is consistent with the knowledge domain X. In each set of statements of the 4-way cases, differences of rating differences are mentioned and must be evaluated. An example set of statements is as follows. “People prefer fresh to frozen cakes. The difference [in the rating] depends on the flavor (chocolate vs carrot ), the type (iced vs plain), and the richness (rich vs lowfat ). The difference between fresh and frozen increases from chocolate cakes to carrot cakes. This increase is greater for iced cakes than for plain cakes. There is a (greater/smaller) change in the size of the increase for rich cakes than for lowfat cakes.” The number of cakes with various features listed in the bar graphs depends on the case. Specifically, for the 2×2-way case, each instance of the knowledge domain X lists 4 different cakes and their ratings. For the 4-way case, that number is 16. Let us rephrase the setting and task using formulas that represent the statements in a naive way. For example, let (difference 1(fresh, frozen) > 0) be the single literal of a formula that evaluates to True if for all cakes with identical features except for process, the rating for fresh is higher than for frozen. This formula encodes the first statement of both examples for the 2×2-way and 4-way
Improved Comprehensibility and Reliability of Explanations
5
cases. In a similar fashion, each remaining statement can be encoded by just one literal that uses a new difference term. An exception are statements such as “The difference [in the rating] depends on the flavor (chocolate vs carrot )”, which state a restriction that is implied by the bar graphs of the knowledge domain X and thus need not be translated. For the encoding of the last statement, one case of the two options is arbitrarily selected. An entire set of statements can thus be represented by a DNF formula S that consists of just one DNF clause whose literals encode the statements. FS is equal to 2 for the 2×2-way case and equal to 4 for the 4-way case. Thus, FS is small in both cases. Consistency of S with the knowledge domain X holds if and only if the encoding of the last statement uses the correct option. Reference [31] includes results for a number of experiments involving various situations. The average solution times are displayed in a graph. Visual inspection determines the average solution time for the 2×2-way case to be 25 sec and for the 4-way case to be 74 sec. Correctness counts of answers are provided for pairs of experiments. The average correctness rate implied by the counts is 1.00 for the 2×2-way case and 0.65 for the 4-way case. Using (1), the corresponding CD values are 25 and 247, respectively. The table below shows FS and CD for the two cases. Formula Size and Comprehension Difficulty Case FS 2×2-way 2 4-way 4
CD 25 247
Evidently, the increase of FS from 2 to 4 causes an almost 10-fold increase of CD from 25 to 247. Reference [31] sketches results of experiments for a 5-way case built from two 4-way cases. Since [31] does not provide a detailed description of the set of statements of the case, FS of the corresponding DNF formula cannot be determined. But extrapolating from the structure of the previous cases, we conclude that FS must still be small. Unfortunately, the average solution time is not provided, so CD cannot be computed. But an indication of the comprehension difficulty is the fact that the correctness rate k/n is 0.55, which is barely above the average performance of random choice. Thus, it seems fair to conclude that the 5-way case is close to or at the limit of human comprehension. A potential criticism of the above encoding of sets of statements in formulas S is that it only implicitly accounts for the crucial concept of rating differences. But that objection would not be raised if the knowledge domain X could be easily restated using the same literals. Indeed, the main point of the example is the following. If the knowledge domain X isn’t already expressed or cannot be easily restated in the terminology of the sets of statements or, equivalently, in terms of the literals of S, then chunking is difficult or impossible. Furthermore, in such cases the human brain has great difficulty processing more than just a few statements or formulas.
6
K. Truemper
Despite the above examples, we would like to make some use of FS when estimating CD. We do this next. 2.4
Comprehensibility Condition
We say that a formula directly uses concepts of knowledge domain X if each literal directly corresponds to some concept of X. Suppose two formulas directly use concepts of X. In light of the preceding discussion and results, it is reasonable to assume that, in the absence of non-elementary chunking, the formula with smaller size FS is easier to understand. We summarize that assumption. Monotonicity Assumption. Let knowledge domain X and formulas that directly use concepts of X be given, and suppose that non-elementary chunking is not possible. Then for any two formulas, the one with smaller FS value is easier to understand. During the construction of a formula S, we do not know whether non-elementary chunking is possible when S is evaluated in light of a knowledge domain X. It seems prudent that we conservatively assume that such chunking is not possible. Under the Monotonicity Assumption, we then should strive for a formula S that directly uses concepts of X and has smallest size among the formulas with that feature. Indeed, in agreement with [32], we postulate the following condition for comprehensibility. Comprehensibility Condition. Let knowledge domain X and a formula S that directly uses concepts of X be given. If FS is at most 4, then the formula S likely is comprehensible. On the other hand, if FS is larger than 4 and nonelementary chunking is not possible, then comprehensibility of the formula S is doubtful.
3
Construction of Comprehensible Formulas
This section summarizes a method for the construction of formulas that likely are comprehensible according to Section 2.4. The method uses SV (single-variable) and RHS (restricted-half-space) discretization. SV discretization has the customary form; one or more cutpoints are defined for each variable, and logic variables represent the occurrence of values in the various intervals. Specifically, for each cutpoint α of a variable xj , we define a logic variable wj,α that takes on the value True if xj > α, and the value False if xj < α. RHS discretization is an extension of SV discretization. Here, for a vector z of original variables, we select a cutpoint α and a vector b containing a few nonzero entries—shortly we make this notion precise—, and define a logic variable wb,α to take on the value True if bt · z > α, and the value False if bt · z < α. We assume that a learning machine M is given that derives classifying formulas from discretized data. The construction process uses SV and RHS discretization and learning machine M in a stagewise process. In stage 0, the learning
Improved Comprehensibility and Reliability of Explanations
7
machine M receives training data discretized by SV discretization and computes a formula as output. Stages 1, 2, . . . are like stage 0, except that SV discretization is replaced by a RHS discretization procedure that increases in complexity with the stage index. Thus, each stage produces an output formula. For stage p, p = 0, 1, . . . , denote by Mp the machine consisting of the relevant discretization procedure and the subsequently executed learning machine M. As we shall see in Section 5, Mp may produce formulas whose literals (bt · z > α) or (bt · z < α) have up to 2p nonzero bj . Since we want comprehensible formulas, we limit 2p to 4 in agreement with Section 2.1, and thus enforce p ≤ 2. From the formulas obtained by the machines Mp , p ≤ 2, we want to select one that likely is best according to comprehensibility and reliability. In the long version of this paper, we describe how this can be done using the Monotonicity Assumption of Section 2.4 and the theory of VC dimension [48]. Here, we can only list the main conclusion: The selected formula should have high accuracy on the training data and small formula size. The next section reviews SV discretization.
4
Review of SV Discretization
In the early developments of discretization methods, entropy combined with the minimum description length principle ([22,23,45]) and other schemes using information gain were the most widely used methods, with strong performance with regard to prediction accuracy; see for example [3,4,21,35]. Recent research has produced a number of new discretization algorithms ([12,13,37,50,44]). Using a generalized entropy function, [33] unifies and generalizes results. The above references cover research on univariate discretization, where each variable is treated by itself. In multivariate discretization, several variables are considered simultaneously, and a discretization of each variable is determined that takes into account the interaction of the variable with other variables. This more elaborate effort is undertaken with the expectation that the subsequently applied learning method will discover classifiers that in some sense are more meaningful. Representative references are [9,14,15,26,41,43]. Formally, the output of multivariate discretization has the same structure as that of univariate discretization. That is, the output consists of cutpoints for the variables, and the variables are discretized according to these cutpoints. A fundamental problem induced by discretization is the fact that near any cutpoint of a variable xj a small random change of the xj value may change the encoding. There are several ways to deal with this difficulty. In the first approach, the discretization assigns several cutpoints to a given variable for a fine-grained representation. This solution has a hidden cost, in the form of more complex and thus less comprehensible formulas. In a second, fuzzy approach, membership functions are introduced that permit membership in several intervals; see for example [6]. Here, too, formulas can become complex and incomprehensible. A third approach introduces an uncertainty interval around each cutpoint; see for example [11], where the intervals are dynamically decided
8
K. Truemper
during the learning process. The disadvantage of the third approach is that the learning method must be able to process discretized data where some entries are marked “unknown.” RHS discretization relies on the third approach. Specifically, the width of the uncertainty interval enclosing a given cutpoint for a variable xj should be large enough that random changes of the xj value turn with low probability a value below the uncertainty interval into one above that interval, and vice versa. We define the average uncertainty width uj of variable xj to be the average of the widths of the uncertainty intervals enclosing the cutpoints of xj .
5
RHS Discretization
We assume to have (1) an SV discretization method that outputs for each variable xj besides the cutpoints an average uncertainty width uj , and (2) a feature selection method (see [30,36,39]) that outputs the value of a reasonable importance measure. The RHS discretization process is as follows. First, the scheme applies the assumed SV discretization to get importance values for the variables. Second, the scheme creates new variables that are linear combinations of variables. Third, the method uses SV discretization to obtain cutpoints for the new variables. Effectively, the cutpoints constitute thresholds of linear combinations of original variables. We pause for a moment to point out that the use of linear combinations of variables followed by application of thresholds is central to well-known techniques of Machine Learning. For example, artificial neural nets (ANNs) use the two steps as fundamental building block. As a second example, support vector machines (SVMs) use the two steps in a typically high dimensional space into which the training data have been mapped. While ANNs and SVMs in general allow linear combinations of any number of variables, here we use just pairs of variables. The rule for selecting the pairs depends on the stage of the overall process. For the moment, we skip discussion of that aspect and focus on the definition of new variables from one pair of variables, say involving xk and xl . For j = k, l, suppose SV discretization has created just one cutpoint cj for xj , and let uj denote the associated uncertainty width. Let R be the rectangle in the Euclidean plane that has the four points given by (ck ± uk /2, cl ± ul /2) as corner points. The discretization rules implied by the cutpoints and their uncertainty intervals assign for any point in R the value “unknown” to both coordinates. By the derivation of the cutpoints and uncertainty intervals, that assignment is based on the behavior of xk by itself and of xl by itself. Maybe analysis of the interaction of xk and xl would support a more precise discretization where R is replaced by a smaller region. There are several ways to achieve this goal. Here, we opt for the subdivision of the rectangle R induced by its two diagonals. We explore the effect of that subdivision rather indirectly, as follows.
Improved Comprehensibility and Reliability of Explanations
9
For some constants α1 and α2 , the lines passing through the two diagonals of R are given by xk /uk + xl /ul = α1 xk /uk − xl /ul = α2
(2)
We could revise the discretization of xk and xl directly using these two lines. But more effective is the following approach, which momentarily considers α1 and α2 to be undecided, and which defines two variables y + and y − by y + = xk /uk + xl /ul y − = xk /uk − xl /ul
(3)
We add these two variables as new attributes to the training data and compute training values for them by inserting the training values for xk and xl into (3). Due to this expansion of the space of attributes, we call the variables y + and y − expansion variables and refer to the enlarged training data as expanded training data. The subsequent use of the expansion variables is decided by SV discretization applied to the expanded training data. That step decides cutpoints for the expansion variables that generally may make better use, so to speak, of the expansion variables than selection of α1 and α2 of (2) as cutpoints. In the general case of several cutpoints for each of the variables xj , we use the average uncertainty width as uj in the definition (3) of y + and y − . In the construction process of Section 3, all variables on hand at the end of a stage, including all expansion variables created so far, are candidates for pairs producing additional expansion variables for the next stage. Since stage 0 carries out SV discretization, the variables available at the end of stage 0 are just the original variables. Thus, all expansion variables created in stage 1 are derived from original variables and, by (3), are defined by equations with two nonzero coefficients. Inductively, if the defining equations of the expansion variables created in stage p ≥ 1 are rewritten in terms of the original variables, then the resulting equations have at most 2p nonzero coefficients. We restate this as follows. Let y be a new expansion variable of stage p, p ≥ 1. Suppose the literal (y < α) or (y > α) occurs in a formula learned from the training data on hand in stage p. In terms of the vector x of original variables, the variable y is defined by y = bt · x, where b has at most 2p nonzero entries, and any literal (y > α) or (y < α) can be rewritten as (bt · x > α) or (bt · x < α), respectively. In Section 2.1, we constrained the literals (bt · z > α) or (bt · z < α) to b vectors with k ≤ 4 nonzero bj so that elementary chunking of the literal is possible. In the present setting, literals produced via RHS discretization observe that constraint if the number p of stages satisfies 2p ≤ 4 and thus p ≤ 2. From now on, we impose that limit on p. Finally, we note that nominal attributes, where values are members of a nonnumerical set, do not participate in RHS discretization. The next section discusses an implementation.
10
6
K. Truemper
Implementation and Computational Results
We have added RHS discretization to an extension of the method of [46], which in turn is based on the discretization method of [7,42] and the Lsquare algorithm of [24,25]. We call the resulting method EXRHS (= EXplanations via RHS discretization). For present purposes, the details of EXRHS and the prior scheme of [46] it is based on, are not important. Suffice it to say that the prior scheme computes uncertainty intervals as part of SV discretization and establishes importance values of variables in a feature selection subroutine. As a result, the addition of RHS discretization is rather straightforward. We use data sets of UC Irvine Machine Learning Repository for tests of EXRHS. The repository has a list of 11 most popular data set. Of these, 6 data sets are well suited for tests of RHS discretization since they involve no or almost no nominal attributes and allow easy formulation of two-class classification problems. In order of popularity, the selected 6 data sets are Iris, Wine, Breast Cancer Wisconsin, Abalone, Yeast, and SPECT Heart. For each test, we split the given data set in a 50/50 ratio into training and testing data and then apply EXRHS. The 50/50 choice provides in all cases reasonable training and testing sets for evaluation of the effect of RHS discretization. The Iris data set has a total of 150 records of three classes represented by 50 records each. Goal is to explain, for each class, the difference between that class and the remaining two classes. Thus, there are three cases: Iris-1 of class 1 versus classes 2, 3; Iris-2 of class 2 versus classes 1, 3; and Iris-3 of class 3 versus classes 1, 2. The Wine data set has a total of 178 records covering three classes of wines with 59, 71, and 48 records, respectively. The goal is to explain, for each class, the difference between that class and the remaining two classes. Thus, there are three cases Wine-1, Wine-2, and Wine-3 analogously to the Iris data set. The Breast Cancer Wisconsin data set has 569 records covering benign and malignant breast tumor cases. Goal is to explain the difference between the two diagnoses. The Abalone data set has a total of 4,177 records. Each record represents an abalone instance. Goal is to decide the number of rings of the instance using various measurements. The number of rings ranges from 1 to 29. In [16], a 3-class classification problem is defined for the Abalone data set analogously to the Iris data set, except that here class 1 has the records with 1-8 rings, class 2 has those with 9 or 10 rings, and class 3 contains those with 11-29 rings. The three tests cases are called Abalone-1, Abalone-2, and Abalone-3. The Yeast data set contains 1,484 records. The cellular localization sites of proteins are to be determined. There are ten classes. For the tests, we select the three most frequently occurring classes, called CYT, NUC, and MIT. For each case, the class is to be separated from the remaining cases. This produces three test cases Yeast-1, Yeast-2, and Yeast-3. The SPECT Heart data set has 267 records. In contrast to the above data sets, all attributes are binary. The goal is prediction of a {0, 1} DIAGNOSIS variable. The original data set is given by training and testing sets of somewhat odd proportions. That is, the training data have 40 records each for the two DIAGNOSIS cases, but the testing data have just 15 cases for DIAGNOSIS = 0 and 172 cases for DIAGNOSIS = 1. For consistency
Improved Comprehensibility and Reliability of Explanations
11
with the above tests, we concatenate the original training and testing sets and then split the resulting set 50/50 into training and testing sets for use by EXRHS. Looking over the data sets, it seems reasonable to claim that the attributes directly represent concepts of the underlying knowledge domain. Thus, it is also reasonable to invoke the Monotonicity Assumption and Comprehensibility Condition. Thus, formulas with size FS ≤ 4 likely are comprehensible. On the other hand, if non-elementary chunking cannot be done, then comprehensibility of formulas with size FS > 4 is doubtful. In all tests, EXRHS runs with the default rules and settings, and no selection or tuning of parameters is used. Suppose the output formulas of stages 0, 1, and 2 have been computed. We do not use application-dependent rules to select the final formula from the output formulas, since we have no guidance for selection of such rules. In agreement with the earlier discussion, we aim for a final formula that likely is comprehensible by eliminating all output formulas whose size exceeds 4, then select the formula with highest training accuracy from the remaining formulas. If case of a tie, the formula with lower stage index is chosen. In Table 1, the results for the final formulas are summarized in the three columns under the heading Best of Size ≤ 4; the column FS has the formula size, and the remaining two columns have the training and testing accuracy. We contrast the final formulas with the output formulas of Stage 0 in three analogous columns under the heading Stage 0 Output. These output formulas are constructed using SV discretization, and no restriction is imposed on their size. The columns under Best of Size ≤ 4 in Table 1 show that EXRHS finds likely-to-be-comprehensible explanations in 12 out of 14 (= 86%) of the cases. Table 1. Summary for Formulas Best of Size ≤ 4 Stage 0 Output Accuracy (%) Accuracy (%) Case FS Train Test FS Train Test Iris-1 1 98 97 1 98 97 Iris-2 3 95 91 3 90 85 Iris-3 1 95 98 1 92 98 Wine-1 1 97 95 1 93 88 Wine-2 2 94 92 1 92 89 Wine-3 2 100 98 2 100 98 Breast 2 95 93 10 97 92 Abal.-1 2 80 79 5 81 79 Abal.-2 4 64 63 13 64 63 Abal.-3 1 70 71 6 72 72 Yeast-1 no size ≤ 4 17 61 60 Yeast-2 no size ≤ 4 10 68 64 Yeast-3 1 73 75 12 76 78 SPECT 2 81 72 2 81 72 Average 1.8 86.8 85.3 4.8 86.3 84.3 The results for Yeast-1 and Yeast-2 are not included in the averages
12
K. Truemper
The formula size FS is equal to 1 or 2 with two exceptions, where the size is 3 or 4. The average formula size is 1.8. It turns out that RHS discretization is used for 9 out of the 12 (= 75%) likely-to-be-comprehensible explanations. The 9 cases are split into 3 (= 33%) cases computed in stage 1, while the remaining 6 cases (= 67%) are determined in stage 2. Of the 14 formulas listed under Stage 0 Output, only 7 cases (= 50%) have formula size ≤ 4 and are deemed likely comprehensible. Unless non-elementary chunking is possible, comprehensibility of the remaining 7 formulas is doubtful. The formula size FS ranges from 1 to 17. The average formula size is 4.8 if the formulas produced by Yeast-1 and Yeast-2, for which Best of Size ≤ 4 has no counterpart, are ignored. The accuracy averages at the bottom of Table 1 show that the improvement in comprehensibility due to RHS discretization is accompanied by a small average training accuracy gain of 86.8−86.3 = 0.5% and a small average testing accuracy gain of 85.3 − 84.3 = 1.0%. Thus, RHS discretization produces not only likelyto-be-comprehensible formulas for 86% of the cases, but does so with a small training and testing accuracy gain rather than a loss. Test results reported for prior schemes that compute logic formulas often cite the number of computed rules. Since the formula size of a rule is at least 1, the number of rules of an explanation is a lower bound on the formula size of the explanation. That lower bound can be quite large. For example, some papers list 12, 26, 19, 42, 77, 27-109, average 18, average 6, average 16, ≥ 18 rules for two-class classification problems handled by various methods. Maybe RHS discretization can help to reduce those numbers. Table 2 provides total execution times for EXRHS in min:sec, based on runs on a 3.0GHz PC. When three cases are listed together on a line, the stated execution time is the total run time for those cases. In addition, column Attr. gives the number of attributes, not counting the class attribute, and column Train Recds has the number of training records. Table 2. Execution Times of EXRHS Case Iris(1-3) Wine(1-3) Breast Abal.(1-3) Yeast(1-3) SPECT
7
Attr.
Train Recds
4 13 30 8 8 22
75 89 285 2,089 742 134
Exec. Time (min:sec) 0:01 0:02 0:03 1:57 0:18 0:01
Extension
EXRHS has been extended to carry out Subgroup Discovery. The scheme is somewhat different from typical current Subgroup Discovery approaches [34,49],
Improved Comprehensibility and Reliability of Explanations
13
which often use an iterative investigation involving the steps of data mining, interpretation of results, and evaluation by experts; see for example [5,27,28,29,38]. In contrast, EXRHS typically determines a few highly significant subgroups without manual guidance. Each subgroup is characterized by a convex polyhedron that generally isn’t just an axis-parallel rectangle.
8
Summary
Using prior studies in Neurology and Psychology, the paper links comprehensibility of explanations given by logic formulas to the size of the formulas. That is, if a formula is not too large in a certain sense, then under certain assumptions we can expect it to be understood by humans. On the other hand, if formula size exceeds a certain bound, then under certain assumptions it may well be the case that the formula cannot be comprehended by humans. The paper introduces a restricted form of discretization called RHS discretization that can be computed with the aid of any number of prior discretization and feature selection methods. In computational tests using well-known data sets, it is shown that RHS discretization produces likely-to-be-comprehensible explanations for 86% of the cases. In contrast, when the traditional discretization approach via cutpoints is used, only 50% of the explanations have that feature. The improvement of comprehensibility is accompanied by a small average gain in accuracy instead of a loss. The methodology has been extended to cover Subgroup Discovery.
References 1. Abidi, S., Hoe, K.: Symbolic exposition of medical data-sets: A data mining workbench to inductively derive data-defining symbolic rules. In: Proceedings of the 15th IEEE Symposium on Computer-based Medical Systems (CBMS 2002) (2002) 2. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993) 3. An, A.: Learning classification rules from data. Computers and Mathematics with Applications 45, 737–748 (2003) 4. An, A., Cercone, N.: Discretization of continuous attributes for learning classification rules. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS, vol. 1574, pp. 509–514. Springer, Heidelberg (1999) 5. Atzmueller, M., Puppe, F., Buscher, H.-P.: Subgroup mining for interactive knowledge refinement. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds.) AIME 2005. LNCS, vol. 3581, pp. 453–462. Springer, Heidelberg (2005) 6. Au, W.-H., Chan, K.C.C., Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes for classification. IEEE Transactions on Knowledge and Data Engineering 18, 715–719 (2006) 7. Bartnikowski, S., Granberry, M., Mugan, J., Truemper, K.: Transformation of rational and set data to logic data. In: Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, Heidelberg (2006)
14
K. Truemper
8. Bay, S., Pazzani, M.: Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5, 213–246 (2001) 9. Bay, S.D.: Multivariate discretization of continuous variables for set mining. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (2000) 10. Boros, E., Hammer, P., Ibaraki, T., Kogan, A.: A logical analysis of numerical data. Mathematical Programming 79, 163–190 (1997) 11. Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering 12, 292–306 (2000) 12. Boull´e, M.: Khiops: A statistical discretization method of continuous attributes. Machine Learning 55, 53–69 (2004) 13. Boull´e, M.: MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning 65, 131–165 (2006) 14. Chao, S., Li, Y.: Multivariate interdependent discretization for continuous attribute. In: Proceedings of the Third International Conference on Information Technology and Applications (ICITA 2005)(2005) 15. Chmielewski, M.R., Grzymala-Busse, J.W.: Global discretization of continuous attributes as preprocessing for machine learning. International Journal of Approximate Reasoning 15, 319–331 (1996) 16. Clark, D., Schreter, Z., Adams, A.: A quantitative comparison of dystal and backpropagation. In: Proceedings of Seventh Australian Conference on Neural Networks (ACNN 1996) (1996) 17. Clark, P., Boswell, R.: Rule induction with CN2: Some recent improvements. In: Proceedings Fifth European Working Session on Learning (1991) 18. Cohen, W.W.: Fast effective rule induction. In: Machine Learning: Proceedings of the Twelfth International Conference (1995) 19. Cohen, W.W., Singer, Y.: A simple, fast, and effective rule learner. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (1999) 20. Cowan, N.: The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences 24, 87–185 (2001) 21. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Machine Learning: Proceedings of the Twelfth International Conference (1995) 22. Fayyad, U., Irani, K.: On the handling of continuous-valued attributes in decision tree generation. Machine Learning 8, 87–102 (1992) 23. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (1993) 24. Felici, G., Sun, F., Truemper, K.: Learning logic formulas and related error distributions. In: Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, Heidelberg (2006) 25. Felici, G., Truemper, K.: A MINSAT approach for learning in logic domain. INFORMS Journal of Computing 14, 20–36 (2002) 26. Friedman, N., Goldszmidt, M.: Discretizing continuous attributes while learning Bayesian networks. In: International Conference on Machine Learning (1996) 27. Gamberger, D., Lavraˇc, N.: Expert-guided subgroup discovery: Methodology and application. Journal of Artificial Intelligence Research 17, 501–527 (2002) 28. Gamberger, D., Lavraˇc, N., Krstaˇcic, G.: Active subgroup mining: a case study in coronary heart disease risk group detection. Artificial Intelligence in Medicine 28 (2003)
Improved Comprehensibility and Reliability of Explanations
15
ˇ 29. Gamberger, D., Lavraˇc, N., Zelezn´ y, F., Tolar, J.: Induction of comprehensible models for gene expression datasets by subgroup discovery methodology. Journal of Biomedical Informatics 37 (2004) 30. Guyon, I., Elisseef, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003) 31. Halford, G.S., Baker, R., McCredden, J.E., Bain, J.D.: How many variables can humans process? Psychological Science 16, 70–76 (2005) 32. Halford, G.S., Cowan, N., Andrews, G.: Separating cognitive capacity from knowledge: a new hypothesis. Trends in Cognitive Sciences 11, 236–242 (2007) 33. Jin, R., Breitbart, Y., Muoh, C.: Data discretization unification. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2007) (2007) 34. Kl¨ osgen, W.: EXPLORA: A multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park (1996) 35. Kohavi, R., Sahami, M.: Error-based and entropy-based discretization of continuous features. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996) 36. Koller, D., Sahami, M.: Toward optimal feature selection. In: International Conference on Machine Learning (1996) 37. Kurgan, L.A., Cios, K.J.: CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering 16, 145–153 (2004) 38. Lavraˇc, N., Cestnik, B., Gamberger, D., Flach, P.: Decision support through subgroup discovery: Three case studies and the lessons learned. Machine Learning 57, 115–143 (2004) 39. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17, 491– 502 (2005) 40. Miller, G.A.: The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review 63, 81–97 (1956) 41. Monti, S., Cooper, G.F.: A multivariate discretization method for learning Bayesian networks from mixed data. In: Proceedings of the Fourteenth Conference of Uncertainty in AI (1998) 42. Mugan, J., Truemper, K.: Discretization of rational data. In: Proceedings of MML 2004 (Mathematical Methods for Learning). IGI Publishing Group (2007) 43. Muhlenbach, F., Rakotomalala, R.: Multivariate supervised discretization, a neighborhood graph approach. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2002) (2002) 44. Perner, P., Trautzsch, S.: Multi-interval discretization for decision tree learning. In: Advances in Pattern Recognition. Springer, Heidelberg (2004) 45. Quinlan, J.: Induction of decision trees. Machine Learning 1, 81–106 (1986) 46. Riehl, K.: Data Mining Logic Explanations from Numerical Data. PhD thesis, Department of Computer Science, University of Texas at Dallas (2006) 47. Triantaphyllou, E.: Data Mining and Knowledge Discovery via a Novel Logic-based Approach. Springer, Heidelberg (2008) 48. Vapnik, V., Levin, E., Cun, Y.L.: Measuring the VC-dimension of a learning machine. International Journal of Human Computer Systems 6, 851–876 (2008) 49. Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Proceedings of First European Conference on Principles of Data Mining and Knowledge Discovery (1997) 50. Yang, Y., Webb, G.I.: Weighted proportional k-interval discretization for NaiveBayes classifiers. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS, vol. 2637. Springer, Heidelberg (2003)
Selection of Subsets of Ordered Features in Machine Learning O. Seredin1, A. Kopylov1, and V. Mottl2 1
Tula State University, 300600, Tula, pr. Lenina, 92, Russia
[email protected],
[email protected] 2 Computing Centre of the Russian Academy of Science, 117967, Moscow, Vavilova str., 40, Russia
[email protected]
Abstract. The new approach of relevant feature selection in machine learning is proposed for the case of ordered features. Feature selection and regularization of decision rule are combined in a single procedure. The selection of features is realized by introducing weight coefficients, characterizing degree of relevance of respective feature. A priori information about feature ordering is taken into account in the form of quadratic penalty or in the form of absolute value penalty on the difference of weight coefficients of neighboring features. Study of a penalty function in the form of absolute value shows computational complexity of such formulation. The effective method of solution is proposed. The brief survey of author’s early papers, the mathematical frameworks, and experimental results are provided. Keywords: machine learning, feature selection, ordered features, regularization of training, support vector machines, parametric dynamic programming.
1 Introduction The pattern recognition problem in the presence of a large amount of features (in comparing with training set size) known as the “curse of dimensionality”. There are two standard approaches to tackle the case, namely, by a priori restrictions impositions (decision rule regularization) or dimensionality reduction by most informative features selection. The approach of joining these two techniques is proposed in the paper. The selection of informative features in pattern recognition problem in the case of their ordering is considered. Feature ordering is typical for tasks of signal and image learning. Only one-dimensional ordering is accented in this work. Indeed, most of techniques for feature selection consider feature vector as non-ordered set of numbers, moreover a lot of methods accept hypothesis that features are independent. However, there is exists a number of tasks where features are consecutive measurements along the axis of some argument, for example, observation of some signal along time axis, components of a spectrum, etc. In previous articles the authors already proposed methods of decision rule regularization and methods of feature selection. In the early papers [1,2] the method of P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 16–28, 2009. © Springer-Verlag Berlin Heidelberg 2009
Selection of Subsets of Ordered Features in Machine Learning
17
regularization which takes into account a priori information about feature interrelation was described. At the same time the research of modality combination in data mining was developed, and actually the effective technique for informative feature selection was suggested [3-5]. Taking into account a priori information about one-dimensional ordering of features directly for a selection method requires development of modified procedures. Such attempt was done in [6] where the model of feature interrelation was represented as quadratic penalty on difference between the informative weights of neighbor features. In this paper we will investigate the new penalty criterion in the form of the absolute value function. It should be noted, that the method of potential functions is chosen as a theoretical background for suggested algorithms. The reason for such selection is high popularity of the method as the basis for support vector machine learning [7]. The paper has the following structure – in the second section the idea of the support vector machine learning will be briefly reminded. In the third section the effective feature selection procedure regardless to their relationship will be described. The fourth section focuses on the idea of the learning regularization for the case of ordered features. The next three sections are devoted to algorithms of regularized order-based feature selection. Experimental results are presented in the eighth section.
2 SVM – The Basis for Modifications We will develop methods of learning with respect to structural relations between features by insertion of additional penalties into existing well known criteria. The incorporated into the model coefficient of regularization will define the balance between “classical” approach and regularization based on a priory information. Let (x j , g j ), j = 1,..., N – will be a training set, where x = ( xi , i = 1,..., n) ∈ R n –
real-valued feature vector of recognition object, g = {±1} – index of classification; a = {ai , i = 1,..., n} ∈ R n – directional vector of the optimal separable hyperplane and b ∈ R is its shift defined as decision of well-known criterion [7]: N ⎧ n 2 a + C δ j → min(a1 ,..., an , b, δ1 ,..., δ N ), ∑ ∑ i ⎪ j =1 ⎪ i =1 ⎨ n ⎪ g ⎛ a x + b ⎞ ≥ 1 − δ , δ ≥ 0, j = 1,..., N . j ⎜ ∑ i ij j j ⎟ ⎪⎩ ⎝ i =1 ⎠
(1)
Here the non-negative parameter C and auxiliary variables δ j , j = 1,..., N , introduced for the case of linear non-separability of objects of two classes. Usually the task (1) is solved in the dual form: n ⎧N 1 N N ⎛ ⎞ ⎪∑ λ i − ∑∑ ⎜ g j g k ∑ xij xik ⎟ λ j λ k → max(λ1 ,..., λ N ), 2 ⎪ i =1 ⎠ j =1 k =1 ⎝ i =1 ⎨N ⎪ λ g = 0, 0 ≤ λ ≤ C 2 , j = 1, … , N , j j j ⎪⎩∑ j =1
(2)
18
O. Seredin, A. Kopylov, and V. Mottl
as the task of quadratic programming related to the non-negative Lagrange multipliers λ j , j = 1,..., N . The relation of coefficients of the directional vector of the optimal separable hyperplane and Lagrange multipliers is defined as follows: ai =
∑
j: λ j > 0
g j λ j xij , i = 1,..., n.
(3)
The simplicity of algorithmic and numerical realization of this method, the evident reference to the so-called support elements in the training set (only they, in fact, form the separable hyperplane), and good experimental results have made this formulation of the pattern recognition problem the most popular in recent times. These are the reasons for such criterion to be the basis for constructing our method of feature selection in the case of feature ordering or interrelation. It is necessary to note that formulation (1) is most simple one, so to speak the academic one from the number of criteria, joined by the common title of support vector machines. In this form the solution based on inner products between feature vectors of objects. The decision rule is linear in the initial feature space. There are exist formulations with another type of kernels, another kind of penalties for non-separable cases. There are discussions in literature about relationship of SVM and method of potential functions [3], methods of featureless pattern recognition [5]. For clarity of our reasoning we will rely on the canonical formulation of the problem (1).
3 Feature Selection Based on Potential Functions Combining The method of the potential functions (or kernels) combining in featureless pattern recognition and regression estimation was published in [3,4]. It has been shown that this technique can be transferred on the case where objects are presented by their features and can be efficiently applied as a non-iterative informative feature selection. The non-negative weights ri ≥ 0, i = 1,..., n , each of which is corresponds to component of the directional vector of the sought for separable hyperplane, are incorporated into the “classical” Vapnik’s (1) SVM criterion as it was proposed in [8]: n
∑ i =1
ai2 + 1/ μ n ⎛ 1 ⎞ + ∑ ⎜ + 1 + μ ⎟ ln ri + ri μ i =1 ⎝ ⎠ N
+C ∑ δ j → min(r1 ,..., rn , a1 ,..., an , b, δ1 ,..., δ N ),
(4)
j =1
here μ is non-negative parameter of selectivity [9]. We propose to solve the problem of minimizing criterion (4) using the method of Gauss-Seidel by separating the variables into two groups: first – ai , i = 1,..., n, b, δ j ≥ 0, j = 1,..., N , and second ri ≥ 0, i = 1,..., n , and implement step-by-step minimization criteria for one group of variables, with a fixed second one. While coefficients ri , i = 1,..., n are fixed the initial problem is actually reduced to the classical training problem by support vectors. The dual form of criterion related to
Selection of Subsets of Ordered Features in Machine Learning
19
non-negative Lagrange multipliers λ j , j = 1,..., N almost coincide with criterion (2) of SVM: N
∑λ i =1
i
−
n 1 N N ⎛ ⎞ g j g k ∑ ri xij xik ⎟ λ j λ k → max(λ1 ,..., λ N ). ∑∑ ⎜ 2 j =1 k =1 ⎝ i =1 ⎠
(5)
The difference between the last criterion and “classical” formulation is the presence of additional coefficients ri , i = 1,..., n in the matrix of quadratic form. The restrictions remain unchanged, while coefficients of the directional vector of the separable hyperplane are calculated using the rule: ai = ri ∑ j =1 g j λ j xij , i = 1,..., n . N
While parameters ai , i = 1,..., n, b, δ j ≥ 0, j = 1,..., N are fixed the calculation of weighted coefficients is utterly simple: ri =
ai2 + (1 μ) , i = 1,..., n. (1 μ) + 1 + μ
(6)
The stopping rule of the iterative process of learning can be defined, for example, on 1 n the condition of convergence of sequences ri , i = 1,..., n : ∑ ri step +1 − ri step < ε , ε > 0 . n i =1 It is necessary to make a reservation that introducing the notion of “informative feature” we do not have in mind the actual informational characteristic of feature like it was introduced for Akaike informational criterion or Shannon entropy criterion. We only suggest that for the whole set of measurable features there are exist subsets of features which adequate to either data analysis task. As a synonym of “informative feature” it is possible to consider term of “adequate feature” or “relevant feature”.
4 The Regularization of Signal Recognition: The Principle of Decision Rule Smoothness In early works [1] the approach to decision rule regularization was proposed by taking into account a priori information about the features ordering. To make the learning process prefer decision rules with smooth changing of coefficients of the directional vector of the separable hyperplane we propose to incorporate the additional quadratic penalty on difference of neighboring component to the criterion (1): n
∑a i =1
2 i
n
N
i=2
j =1
+α∑ (ai − ai −1 ) 2 + C ∑ δ j → min(a1 ,..., an , b, δ1 ,..., δ N ).
(7)
Here and further coefficient α ≥ 0 define the ratio of penalty on unsmoothness of ordering coefficients of the sought for optimal separable hyperplane. From the computational point of view both primal and dual tasks remain quadratic. The difference from the classical criterion is in incorporating the additional component
20
O. Seredin, A. Kopylov, and V. Mottl
J ′(a) = ∑ i = 2 (ai − ai −1 ) 2 . It is clear that such quadratic function can be written as n
J ′(a) = aT B a , where B (n × n) has the following form: 0 " 0 0 0⎞ ⎛ 1 −1 ⎜ ⎟ 1 2 1 0 0 0⎟ − − " ⎜ ⎜ 0 −1 2 " 0 0 0⎟ ⎜ ⎟ B( n × n) = ⎜ # # # % # # #⎟ ⎜ 0 0 0 " 2 −1 0⎟ ⎜ ⎟ 0 0 " −1 2 −1 ⎟ ⎜ 0 ⎜ 0 0 0 " 0 −1 1 ⎟⎠ ⎝
.
Therefore, it is more convenient to write the objective function in the problem of finding parameters of the optimal separable hyperplane (7) in the vector form: N
aT (I + αB)a + C ∑ δ j → min(a, b, δ1 ,..., δ N ) , j =1
under the same restrictions. It is obvious that matrix B is positive defined. The criterion in dual form also does not undergo changes, but the matrix of quadratic form will be slightly corrected: 1 N N ⎧N λ − ∑ ∑∑ ( g j gk xTj (I + αB)−1 xk ) λ j λ k → max(λ1 ,..., λ N ), j ⎪ 2 j =1 k =1 ⎪ j =1 ⎨N ⎪ λ g = 0, 0 ≤ 0 ≤ λ j ≤ C 2, j = 1,..., N . j j ⎪⎩∑ j =1 The directional vector of the optimal separable hyperplane will be calculated by following formula: a = (I + αB) −1 ∑ λ j g j x j . j :λ j > 0
5 Selection of Subsets of Ordering Features The feature selection procedure described in third section does not take into account the specificity of tasks of signal and image analysis. The classical formulation of the pattern recognition problem suppose that objects of recognition are represented by their features and it the order they where recorded does not matter. Roughly speaking, if somebody reorders components of feature vectors then the result of building of decision rule or the result of feature selection will not change. But we also draw attention to following fact: for some specific objects like signals and images peculiarity of their registration, namely, neighborhood of observations (samples/pixels) can be taken into account. Imposing such restrictions is called regularization (sometimes stabilizing) of decision rules of recognition. How to take into account such structural restrictions on a directional vector is shown in Section 4. In this section we will demonstrate how it is
Selection of Subsets of Ordered Features in Machine Learning
21
possible to join these two techniques: feature selection and assumption that on the set of features there are exist more or less informative groups. Let modify criterion (4), namely, we will add extra penalty on difference of neighboring weighted coefficients ri ≥ 0, i = 1,..., n under the previous constraints: n ⎡ ai2 + 1/ μ ⎛ ⎤ 1⎞ + ⎜ μ + 1 + ⎟ ln ri ⎥ +α∑ f (ri , ri −1 ) + μ⎠ ri i =1 ⎣ ⎝ ⎦ i=2 n
∑⎢
(8)
N
+C ∑ δ j → min(r1 ,..., rn , a1 ,..., an , b, δ1 ,..., δ N ). j =1
We propose to solve the problem of minimizing criterion (8) using the method of Gauss-Seidel by separating the variables into two groups: first – ai , i = 1,..., n, b, δ j ≥ 0 , j = 1,..., N , and second ri ≥ 0, i = 1,..., n , and implement step-by-step minimization criteria for one group of variables, with a fixed second one. There is no difficulty to certain that if coefficients ri , i = 1,..., n are fixed than solution in dual form is coincide with task (5). But finding just informative weights would not be so simple as (6). Therefore, for the search of weight coefficients at the each step of coordinate-wise optimization it is necessary to find the minimum of following criterion (here, for short, we introduce new notions ci = ai2 + 1/ μ , i = 1,..., n and d = μ + 1 + 1/ μ , remind that on this substep of iterative procedure values of ai , i = 1,..., n already found and fixed): n
⎡ ci
∑⎢ r i =1
⎣
i
n ⎤ + d ln ri ⎥ +α∑ f (ri , ri −1 ) → min(r1 ,..., rn ). ⎦ i=2
(9)
In the two next sections we will consider different ways of penalties on differences between weighted coefficients associated with neighboring ordered features of recognition object (for example, signal), namely we will consider penalties in the form of quadratic function and in the form of absolute value function. It is necessary to note, that in this approach the a priori information about feature ordering possess restrictions on weighted coefficients of feature informativeness, but not on components of directional vector of the separable hyperplane as, for example, in [2].
6 Feature Subset Selection with Taking into Account Quadratic Difference between Neighboring Weight Coefficients In this section we will consider situation where the penalty function is quadratic: (r − r ) 2 f (ri , ri −1 ) = i i −1 , i = 1,..., n . In this case the criterion (9) will turn into: ri ri −1 n
⎡ ci
∑⎢ r i =1
⎣
i
n ⎤ (r − r ) 2 + d ln ri ⎥ +α∑ i i −1 → min( r1 ,..., rn ). ⎦ i = 2 ri ri −1
(10)
22
O. Seredin, A. Kopylov, and V. Mottl
The search of minimum of (10) reduced to the solution of system of nonlinear equations for the parameters ri , i = 1,..., n :
⎧ c1 ⎛ r1 r2 ⎞ ⎪ − + d + α ⎜ − ⎟ = 0, r ⎝ r2 r1 ⎠ ⎪ 1 ⎪ ⎛ ri −1 + ri +1 ri r ⎞ ⎪ ci + + i ⎟ = 0, i = 2,..., n − 1, ⎨− + d + α ⎜ − r r r r i i −1 i +1 ⎠ ⎝ ⎪ i ⎪ ⎪ − cn + d + α ⎛⎜ − rn −1 + rn ⎞⎟ = 0. ⎪⎩ rn ⎝ rn rn −1 ⎠
(11)
Each equation in this system includes only 2-3 unknown variables. The method of simple iterations can be used to solve the problem.
7 Feature Subset Selection with Taking into Account Absolute Value of Difference between Neighboring Weight Coefficients Numerous experiments have shown that taking into account the interrelation between features in the form of a quadratic penalty «dilutes» an informative subarea in the space of the ordered features. To avoid this disadvantage, it was decided to use the absolute value function as the penalty on difference of weight coefficients. n ⎡ ci ⎤ ln d r + +α ∑ ∑ ln ri − ln ri −1 → min(r1 ,..., rn ). ⎢ i⎥ i =1 ⎣ ri ⎦ i=2 n
(12)
The search algorithm for optimum values for the coefficients of the direction vector remains the same, but the minimization of criterion concerning weight factors ri ≥ 0, i = 1,..., n , represents a new problem. Let us substitute variables: ui = ln ri , i = 1,… n then the criterion (12) can be rewritten in the following form: n
∑ ⎡⎣c e i =1
i
− ui
n
+ dui ⎤⎦ + α∑ ui − ui −1 → min(u1 , … un ).
(13)
i=2
The objective function in (13) is pair-wise separable, that is represented by the sum of functions of not more than two variables. Let us denote the functions of one variable in criterion (13) as ψ i (ui ) = ci e −ui + dui , and functions of two variables as
γ i (ui , ui −1 ) = α ui − ui −1 . Then the objective function in criterion (13) takes the following form: n
n
i =1
i=2
J (u1 , u2 , … , un ) = ∑ ψ i (ui ) + ∑ γ i (ui −1 , ui )
(14)
Selection of Subsets of Ordered Features in Machine Learning
23
For minimization of the objective function (14) we will take advantage of the procedure based on a principle of Dynamic Programming [11]. The procedure in this case is based on a recurrent decomposition of the initial problem of optimization of a function of n variables into a succession of n elementary problems, each of which consists in optimization of a function of only one variable. The elementary functions of one variable Ji (ui ) , to be minimized at each step of minimization of separable function are called here Bellman functions, as well as in the classical dynamic programming procedure. The procedure of dynamic programming finds a global minimum of pair-wise separable function in two passes, at first in forward direction, and then in the backward direction. On forward pass i = 1,..., n the Bellman functions are determined in accordance with forward recurrent relation Ji (ui ) = ψ i (ui ) + min ⎡⎣ γ i (ui −1 , ui ) + Ji −1 (ui −1 ) ⎤⎦ , i = 1,..., n . ui −1
(15)
Last Bellman function Jn (un ) directly shows, how the minimum value of criterion, as a whole, depends on value of the variable un and therefore, its optimum value can be found as uˆn = arg min Jn (un ) . Other elements of the sought for decision un
uˆi , i = n − 1,… ,1 , can be found by means of backward recurrent relation ui −1 (ut ) = arg min ⎣⎡ γ i (ui −1 , ui ) + Ji −1 (ui −1 ) ⎦⎤ , ui −1
(16)
which is the inverted form of forward recurrent relation (15). Application of this relation on the backward move is obvious: (17)
uˆi −1 = ui −1 (ut ) , i = n − 1,… ,1
Thus, regardless of the form of functions ψ i (ui ) and γ i (ui , ui −1 ) in pair-wise separable objective function, the algorithm of the dynamic programming finds the point of its global minimum, if, of course, such combination of values of variables exists within the area of their variation, executing the known number of operations, proportional to the number of variables. In the case of continuous variables, e.g. if ui ∈ R , a numerical realization of the dynamic programming procedure, is possible only if there exists a finitely parameterized function family J (u, q) concordant with node functions ψ i (ui ) and edge functions γ ( u , u ) in the sense that Bellman functions J (u ) belong to this family at i
i −1
i
i
i
each step. In this case, the forward pass of the procedure consists in a recurrent re-evaluating of parameters q i that completely represent the Bellman functions J (u ) = J (u , q ) . In particular, as is shown in [11], if the node and edge functions are i
i
quadratic, the Bellman functions will be quadratic too. The parametric representation is also possible in the case of using absolute value of difference of adjacent variables instead of quadratic node functions [12].
24
O. Seredin, A. Kopylov, and V. Mottl
It can be proven, in such a case, that if the node functions ψ i (ui ) and edge functions γ i (ui −1 , ui ) are convex, all the Bellman functions are also convex. As it is shown in [12], if the function γ i (ui −1 , ui ) in the objective function (14) has the form γ (u , u ) = α u − u , α > 0 , and the Bellman function J (u ) is convex funci
i
i −1
i
i −1
i −1
i −1
tion and is everywhere differentiable in the range of definition, the backward recurrent relation ui −1 (ui ) has the following form: ⎧ui−α −α −1 , ui ≤ ui −1 , ⎪ α ui −1 (ui ) = ⎨ui , ui−α −1 < ui < ui −1 , ⎪ α α ⎩ui −1 , ui ≥ ui −1 .
(18)
α where ui−α −1 and ui −1 can be obtained as the solution of equations d d ′ α Ji′−1 (ui−α [ J i −1 (ui−α [ J i −1 (uiα−1 )] = α respectively. −1 ) = −1 )] = −α and J i −1 (ui −1 ) = −α dui −1 duiα−1 One can easily see that the evaluated value of ui −1 is completely defined by the value α of adjacent variable ui in the range of ui−α −1 < ui < ui −1 , and is independent of it at the rest of the value area of the variable ui . It is just the fact that gives such a procedure the ability to preserve abrupt changes of the parameters, and accordingly not "dilute" an informative subarea in the space of the ordered features. When parameters ci and d are nonnegative, the functions ψ i (ui ) = ci e − ui + dui are convex, and there exists a parameterized Bellman functions family, concordant with node functions ψ i (ui ) and edge functions γ i ( ui −1 , ui ) . This fact gives us the possibility to apply described above non-iterative procedure. 1.
J1′(u1 ) = −c1 exp(−u1 ) + d , u1−α = − ln[(d + α) / c1 ] , u1α = − ln[(d − α) / c1 ] .
2.
For i = 2, … n , ⎧ −α, ui ≤ ui−α −1 ⎪ α < Ji′(ui ) = −ci exp(−ui ) + d + ⎨ Ji′−1 (ui ), ui−α u −1 i < ui −1 ; ⎪ α, ui ≥ uiα−1 ⎩ uiα : Ji′(uiα ) = α , ui−α : Ji′(ui−α ) = −α .
3.
4.
un : Jn′ (un ) = 0 . For i = n − 1,…1 , ui −1
−α ⎧ui−α −1 , ui ≤ ui −1 , ⎪ α = ⎨ui , ui−α −1 < ui < ui −1 , ⎪ α α ⎩ui −1 , ui ≥ ui −1 .
In accordance with the step 2 of the above algorithm, the derivatives of the Bellman functions are composed from the fragments of functions in the form of
Selection of Subsets of Ordered Features in Machine Learning
25
qk exp(−u ) + pk , where k is the number of the fragment. Therefore the boundaries of the fragments, as well as parameters qk and pk for each fragment k , constitute parameters of the Bellman function derivatives. The leftmost boundary of the fragments α coincides with ui−α −1 and the rightmost boundary will coincides with ui −1 . For the parametric representation of such functions it is necessary to keep in memory the boundary points of all intervals of the derivatives of Bellman functions.
8 Experimental Results For the experimental research of proposed algorithms test data were generated as described below. Two classes of recognition data were distributed near two centers. The centre of the first class is 100 artificial observations with values equal to zero. The centre of second class differs from the first one on interval from 70-th to 80-th samples. Second signal has values of 0.4 instead of 0 (Fig.1).
Fig. 1. Centers of first and second classes (left) and examples of recognition objects (right)
Objects of “entire assembly” were generated from two centers by adding uniform noise (zero mean value, –0.8, +0.8) (Fig. 1). Whole number of objects in entire assembly was equal to 6000 (up to 3000 in each class). Hundred of training sets with number of objects equal to 20, 40, 60, 80, 100 and 200 were picked out randomly. Remain objects were used as the test set. The examples of weight coefficients values ri , i = 1,...,100 for the SVM, “pure” feature selection, and selection of feature subsets with penalties in form of quadratic function and absolute value function are shown in Fig. 2. Experimental results in the form of average error rate on test sets for different sizes of training sets (20-200 objects) are shown in Fig.3. It is clear that adding of the regularization, based on the search of informative subarea in feature space (criterion (10) – dotted line, criterion(12) – dashed line) lead to better predicted properties of decision rule as standard SVM (solid line).
26
O. Seredin, A. Kopylov, and V. Mottl
Fig. 2. Examples of weight coefficients values (from the top): SVM (1), SVM with feature selection (4), SVM and feature subset selection taking into account quadratic difference of neighbor weight coefficients (10), SVM and feature subset selection taking into account absolute value difference of neighbor weight coefficients (12)
Fig. 3. Experimental results (error rate on test set vs. number of objects in training set)
The choice of optimal value for the depth of regularization α remains an open question. We used the procedure of cross validation to find the best value.
9 Conclusion Article shows a way to combine relevant feature selection and restrictions on such selection, reasonable for the solved task, in one criterion. The summarizing review of
Selection of Subsets of Ordered Features in Machine Learning
27
our early publications, which actually lead to the proposed idea, has done. The basic idea of the proposed approach is to formalize idea of taking into account the one-dimensional ordering of features, which is typical for the tasks of signals analysis. The criterion is constructed and the scheme of its numerical optimization is offered. It is necessary to provide comprehensive analysis of the behavior of the proposed algorithm of selection of subset of relevant features in both modeling and real data experiments. It also seems reasonable to extend the methodology to the case of the two dimensional ordering, that is especially important for image analysis tasks. Work is supported by grants of Russian Foundation for Basic Research 08-01-00695, 08-01-99003, 08-07-90700, 09-07-00394.
References 1. Seredin, O.S., Dvoenko, S.D., Krasotkina, O.V., Mottl, V.V.: Machine Learning for Signal Recognition by the Criterion of Decision Rule Smoothness. Pattern Recognition and Image Analysis 11(1), 87–90 (2001) 2. Seredin, O., Mottl, V.: Regularization in Image Recognition: the Principle of Decision Rule Smoothing. In: Proceedings of the Ninth International Conference Pattern Recognition and Information Processing, Minsk, Belarus, vol. II, pp. 151–155 (2007) 3. Mottl, V.V., Seredin, O.S., Krasotkina, O.V., Muchnik, I.B.: Fusing of potential functions in reconstructing dependences from empirical data. Doklady Mathematics 71(2), 315–319 (2005); From Doklady Akademii Nauk 401(5), 607–612 (2005) 4. Mottl, V.V., Seredin, O.S., Krasotkina, O.V., Muchnik, I.B.: Principles of multi-kernel data mining. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 52–61. Springer, Heidelberg (2005) 5. Mottl, V.V., Seredin, O.S., Dvoenko, S.D., Kulikowski, C.A., Muchnik, I.B.: Featureless pattern recognition in an imaginary Hilbert space. In: Proceedings of 16th International Conference Pattern Recognition, ICPR 2002, Quebec City, Canada, August 2002, vol. II, pp. 88–91 (2002) 6. Seredin, O., Mottl, V.: The selection of informative interrelated features in pattern recognition. Tavricheskiy Vestnik Informatiki i Matematiki 2, 180–185 (2008) (in Russian) 7. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) 8. Mottl, V., Tatarchuk, A., Sulimova, V., Krasotkina, O., Seredin, O.: Combining Pattern Recognition Modalities at the Sensor Level Via Kernel Fusion. In: Proceedings of 7th International Workshop Multiple Classifiers Systems, Prague, Czech Republic, pp. 1–12 (2007) 9. Tatarchuk, A., Mottl, V., Eliseyev, A., Windridge, D.: Selectivity Supervision in Combining Pattern-Recognition Modalities by Feature- and Kernel-Selective Support Vector Machines. In: Proceedings of the 19th International Conference on Pattern Recognition, Tampa, Florida, USA, December 7-11 (2008) 10. Seredin, O., Kopylov, A., Mottl, V., Pryimak, A.: Selection of subsets of interrelated features in pattern recognition problem. In: Proceedings of 9th International Conference on Pattern Recognition and Image Analysis: New Information Technologies, Nizhni Novgorod, vol. 2, pp. 151–154 (2008) 11. Mottl, V., Kopylov, A., Blinov, A., Kostin, A.: Optimization techniques on pixel neighborhood graphs for image processing. In: Graph-Based Representations in Pattern Recognition Computing, Supplement 12, pp. 135–145. Springer, Wien (1998)
28
O. Seredin, A. Kopylov, and V. Mottl
12. Kopylov, A.V.: Parametric dynamic programming procedures for edge preserving in signal and image smoothing. In: Proceedings of the 7th International Conference on Pattern Recognition and Image Analysis, St.Petersburg, October 18-23, vol. I, pp. 281–284 (2004) 13. Kopylov, A.V.: Dynamic programming for edge-preserving smoothing in signal and image analysis and pattern recognition with interrelated features. In: Proceedings of 9th International Conference on Pattern Recognition and Image Analysis: New Information Technologies, Nizhni Novgorod, vol. 1, pp. 325–328 (2008)
Combination of Vector Quantization and Visualization Olga Kurasova1,2 and Alma Molyt˙e1 1
Institute of Mathematics and Informatics, Akademijos str., 08663 Vilnius, Lithuania
[email protected],
[email protected] 2 Vilnius Pedagogical University, Studentu str., 08106 Vilnius, Lithuania
Abstract. In this paper, we present a comparative analysis of a combination of two vector quantization methods (self-organizing map and neural gas), based on a neural network and multidimensional scaling that is used for visualization of codebook vectors obtained by vector quantization methods. The dependence of computing time on the number of neurons, the ratio between the number of neuron-winners and that of all neurons, quantization and mapping qualities, and preserving of a data structure in the mapping image are investigated.
1
Introduction
Any set of objects may often be characterized by some features x1 , x2 , . . . , xn . A combination of values of all features characterizes a particular object Xi = (xi1 , xi2 , . . . , xin ) from the whole set X = {X1 , X2 , . . . , Xm }, where n is the number of features and m is the number of analyzed objects. As the number of features is often more than two, we deal with multidimensional data. So X1 , X2 , . . . , Xm are n-dimensional vectors. Often they are interpreted as points in the n-dimensional space Rn , where n defines the dimensionality of space. In fact, we have a matrix of numerical data X = {X1 , X2 , . . . , Xm } = {xij , i = 1, . . . , m, j = 1, . . . , n}. The rows of this matrix are vectors Xi = (xi1 , xi2 , . . . , xin ), i = 1, . . . , m, where xij is the jth component of the ith vector. The data matrix can be analyzed by various statistical methods. Especially when the amount of data is huge, the statistical methods are often not sufficient. In order to get more knowledge from the analyzed data, it is necessary to use data mining methods. A lot of data mining methods are developed for multidimensional data analysis: classification, clustering, visualization, etc. In this paper, we analyze clustering and visualization of multidimensional data. Clustering is useful for reducing the amount of data. Vector quantization is performed in clustering methods. Visual data mining aims at integrating a human in the data analysis process, applying human perceptual capabilities to the analysis of large data sets available in todays computer systems. When combining clustering and visualization it is possible to get more knowledge than by the methods used individually. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 29–43, 2009. c Springer-Verlag Berlin Heidelberg 2009
30
2
O. Kurasova and A. Molyt˙e
Vector Quantization and Visualization
Vector quantization is a classical signal-approximation method that usually forms a quantized approximation to the distribution of the input data vectors Xl ∈ Rn , l = 1, . . . , m, using a finite number of so-called codebook vectors Mi ∈ Rn , i = 1, . . . , N . Once the codebook is chosen, the approximation of Xl , l = 1, . . . , m, means finding the codebook vector Mi closest to Xl , usually in the Euclidean metric [1]. Vector quantization is used for data compression, missing data correction, etc. It can be used for data clustering, too. In that case, the codebook vectors are representatives of clusters. Some methods for vector quantization are based on neural networks: self-organizing map (SOM) [1], neural gas (NG) [2], learning vector quantization [1]. Here the neurons correspond to the codebook vectors. The self-organizing map (SOM) is a class of neural networks that are trained in an unsupervised manner using a competitive learning [1]. It is a well-known method for vector quantization. Moreover, the SOM is used for mapping a highdimensional space onto a low-dimensional one. The neural gas is a biologically inspired adaptive algorithm, proposed in [2]. It sorts the input signals according to how far away they are. The learning vector quantization is a supervised method for data classification. It is not analyzed in this paper. It is purposeful to visualize the codebook vectors obtained after quantization in order to get more knowledge from the analyzed data set. A large class of methods has been developed for multidimensional data visualization [3], [4]. The visual presentation of the data allowed us to see the data structure, clusters, outliers, and other properties of multidimensional data. In this paper, one of the most popular methods of visualization, i.e., multidimensional scaling [5] is used for visualization of codebook vectors. 2.1
Self-Organizing Map and Neural Gas
An array of vectors (codebook) M is formed in both the neural gas network and the self-organizing map. Here the codebook vectors are often called neurons. The array M is one-dimensional in the neural gas, M = {M1 , M2 , . . . , MN }, where Mi = (mi1 , mi2 , . . . , min ), i = 1, . . . , N , N is the number of codebook vectors. A self-organizing map is a two-dimensional grid. Usually, the neurons are connected to each other via a rectangular or hexagonal topology. The rectangular SOM is a two-dimensional array of neurons M = {Mij , i = 1, . . . , kx , j = 1, . . . , ky }, where ij ij Mij = (mij 1 , m2 , . . . , mn ), kx is the number of rows, ky is the number of columns, and the total number of neurons is N = kx × ky . The goal of these quantization methods is to change the values of codebook vectors (neurons) so that they would represent the properties of the vectors Xl , l = 1, . . . , m analyzed. At the end of learning, the codebook vectors become quantized vectors of the vectors Xl . The neural gas algorithm is as follows: 1. Selecting initial values: – the number of codebook vectors (neurons) N ; – values of the parameters λi , λf , Ei , Ef used in the learning rule;
Combination of Vector Quantization and Visualization
31
– the number of training epochs eˆ (each input vector Xl is passed to the network eˆ times, the number of training steps tmax = eˆ × m); – initial values of codebook vectors Mi = (mi1 , mi2 , . . . , min ), i = 1, . . . , N ; as usual they are initialized at random and mik ∈ (−0.5 · 10−5 , 0.5 · 10−5 ), k = 1, . . . , n. 2. Network learning: FOR t = 0 TO tmax FOR l = 1 TO m // at each learning step, the input vector Xl is passed to the neural network The Euclidean distances from the input vector Xl to each codebook vector Mi , i = 1, . . . , N , are calculated. The obtained distances M1 + Xl , M2 + Xl ,. . . , MN + Xl are sorted in an ascending order. A neuron set W1 , W2 , . . . , WN is obtained, Wk ∈ {M1 , M2 , . . . , MN }, k = 1, . . . , N : W1 + Xl ≥ W2 + Xl ≥, . . . , ≥ WN + Xl . The distance between Xl and W1 is minimal. The neuron W1 is called a winner. The neuron (vector) Wk , k = 1, . . . , N is adapted according to the learning rule: Wk (t + 1) = Wk (t) + E(t) · hλ · (Xl − Wk (t)), where E(t) = Ei (Ef /Ei )(t/tmax ) , hλ = e−(k−1)/λ(t) , λ(t) = λi (λf /λi )(t/tmax ) . END // end of passing of the input vectors END // end of learning The self-organizing map algorithm is as follows: 1. Selecting initial values: – the number of rows kx , a number of columns ky (the total number of codebook vectors is N = kx × ky ); – the number of training epochs eˆ (each input vector Xl is passed to the network eˆ times, the number of training steps tmax = eˆ × m); ij ij – initial values of codebook vectors Mij = (mij 1 , m2 , . . . , mn ), i = 1, . . . , kx , j = 1, . . . , ky ; as usual they are initialized at random and mij k ∈ (0, 1), k = 1, . . . , n. 2. Network learning: FOR t = 0 TO tmax FOR l = 1 TO m // at each learning step, the input vector Xl is passed to the neural network The Euclidean distances from the input vector Xl to each codebook vector Mij , are calculated. ˆ c with the minimal Euclidean distance to Xl is designated as a The vector M winner, here c = arg min{||Xl − Mij ||}. The neuron (vector) Mij is adapted i,j
according to the learning rule:
32
O. Kurasova and A. Molyt˙e
Mij (t + 1) = Mij (t) + hcij (t)(Xl − Mij (t)), where hcij is the so-called neighbourhood function, hcij (t) → 0, as t → ∞. END // end of passing of the input vectors END // end of learning Then the networks are trained, the quantization error EQE is calculated by the formula: m 1 ˆ c EQE = (1) Xl − M . m l=1
ˆ c is a winner for the vector Xl , M ˆ c = W1 in the neural gas method. where M 2.2
Multidimensional Scaling
Multidimensional scaling (MDS) refers to a group of methods that are widely used for dimensionality reduction and visualization (mapping into 2D or 3D space) of multidimensional data [5]. The starting point of the MDS is a matrix consisting of pairwise proximities of the data. In general, the proximities need not be distances in the mathematically strict sense. Let us take the vectors Xi = (xi1 , xi2 , ..., xin ), i = 1, . . . , m, whose dimensionality is equal to n, Xi ∈ Rn . The problem is to reduce the dimensionality n to p (p < n). It is needed to find lower-dimensional vectors Yi ∈ Rp corresponding to the vectors Xi , i = 1, . . . , m. If p = 2, we get two-dimensional vectors Y1 , Y2 , . . . , Ym ∈ R2 (Yi = (yi1 , yi2 ), i = 1, . . . , m), and can present them in Cartesian coordinate system as a scatter plot. Thus, we visualize the multidimensional data. Denote the distance between the vectors Xi and Xj by d(Xi , Xj ), and the distance between the corresponding vectors in a lower-dimensional space (Yi and Yj ) by d(Yi , Yj ). The goal of multidimensional scaling (MDS) is to find lower-dimensional vectors Yi , i = 1, . . . , m, such that the distances between the vectors in the lower-dimensional space were be as close to the original distances (or other proximities) as possible. If the square-error cost is used, the objective function (stress) to be minimized can be written as EMDS = wij (d(Xi , Xj ) − d(Yi , Yj )). (2) i<j
There exists a multitude of variants of MDS with different weight wij and optimization algorithms: simple gradient descent method, conjugate gradient, quasiNewton method, simulated annealing, combination of the genetic algorithm and quasi-Newtons descent algorithm [6], two-level minimization [7]. In this paper, we use the SMACOF (Scaling by majorization of a complicated function) algorithm for the stress function EMDS (2) minimization. It is one of the best optimization algorithms for this type of the minimization problem. This method is simple and powerful, because it guarantees a monotone convergence of the stress function [5]. This algorithm of MDS for finding two-dimensional vectors is summarized as follows:
Combination of Vector Quantization and Visualization
33
1. Set the initial values of the two-dimensional vector Y , as t = 0. 2. Compute the value of the stress function EMDS (Y (t)); here the twodimensional vectors Y are set in t = 0. 3. Increase the iteration counter t by one. 4. Compute the Guttman transform Y (t) by (3). 5. Compute EMDS (Y (t)) by (2). 6. If EMDS (Y (t − 1)) − EMDS (Y (t)) < or t is equal to the maximum number of iterations, then stop ( is a small positive constant), else go to Step 3. Formula (3) is called the Guttman transform: Y (t + 1) = m−1 B(Y (t))Y (t),
(3)
where B(Y (t)) has the elements d(X ,X ) m − d(Yii ,Yjj) , for i = j and d(Yi , Yj ) =0 bij = , bii = − bij . 0, for i = j and d(Yi , Yj ) = 0 j=1, j =i Quantitative Criteria of Mapping. A way of comparing the mapping results is to calculate and estimate the value of the stress function EMDS (2). Error ˆMDS (Stress-1) (4) is often used instead of EMDS , because the inclusion of E the normalized parameter (d(Xi , Xj ))2 gives a clear interpretation of the i<j
mapping quality that does not depend on the scale and the number of distances ˆMDS rather than the squared in an n-dimensional space. The reason for using E 2 ˆ2 ˆMDS error EˆMDS is that E is almost always very small in practice, so E MDS values are easier to discriminate [5]. 2 i<j (d(Xi , Xj ) − d(Yi , Yj )) ˆ EMDS = . (4) 2 i<j d(Xi , Xj ) The problem of objective comparison of the mapping results arises when the multidimensional data are visualized using various methods that optimize different criteria of the mapping quality. Some criteria are described and used for the comparison in [8]. In this paper, we use a topology preserving measure introduced by Konig in [9]. This measure was applied in [10]. The μ nearest vectors (neighbour) of each n-dimensional vector Xj , j ∈ (1, . . . , m), are denoted by NijX , i ∈ (1, . . . , μ), the μ nearest vectors of each twodimensional vector Yj , j ∈ (1, . . . , m), corresponding to Xj are denoted by NijY , i ∈ (1, . . . , μ). A topology preserving measure for the jth vector and the ith neighbour is calculated as follows: ⎧ 3, if NijX = NijY ⎪ ⎪ ⎨ 2, if NijX = NljY , l ∈ (1, . . . , μ), i =l ETijMP = , X Y 1, if N = N , t ∈ (μ + 1, . . . , ν), μ =ν ⎪ ij tj ⎪ ⎩ 0, else
34
O. Kurasova and A. Molyt˙e
The general measure ET P M is calculated as follows: ij 1 E 3μ × m i=1 j=1 T P M μ
ET P M =
m
(5)
Typical parameter settings are μ = 4 and ν = 10 [10]. The range of ET P M is between 0 and 1, where ET P M = 0 indicates a poor neighbourhood preservation, and ET P M = 1 indicates a perfect neighbourhood preservation. 2.3
Combination of Neural Gas and Self-Organizing Map with Multidimensional Scaling
After a large number of training steps, the neural gas network and the selforganizing map have been organized and n-dimensional input vectors X1 , X2 , . . . , Xm have been mapped - each input vector is related to the nearest neuron. Some neurons may remain unrelated with any vector of the set X = {X1 , X2 , . . . , Xm }, but there may occur neurons related with some input vectors. The neurons related with the input vectors are called neurons-winners. The number of neurons-winners r is usually smaller than that of all neurons N (r ≤ N ). In the case of the rectangular topology of SOM, we can draw a table with cells corresponding to the neurons. However, the table does not answer the question, how much the vectors of the neighbouring cells are close in the ndimensional space. A natural idea arises to apply the distance-preserving method to an additional mapping of the neurons-winners in the SOM. MDS-type methods may be used for this purpose. In [8], [11], some combinations of SOM and the MDS-type method are proposed and discussed. Neurons-winners, obtained by the neural gas method, may be visualized by MDS-type methods, too. The cross-entropy approach to visualize the neurons-winners, obtained by the neural gas, is proposed in [10]. The method is compared there with some other combinations. In this paper, we propose a combination of the neural gas and the self-organizing map with the multidimensional scaling based on SMACOF minimization. A scheme of mapping (visualization) of the neurons-winners by MDS is presented in Fig. 1.
Fig. 1. Scheme of visualization of the neuron-winners
Combination of Vector Quantization and Visualization
Table 1. Data sets No
Name
m
n
Number Description of classes
1 2 3 4 5 6
chainlink hepta target auto MPG rand clust10 rand clust5
1000 212 770 392 100 100
3 3 2 7 10 5
2 7 6 4 5 5
linear not separable clearly defined clusters outliers linear not separable random clusters random clusters
Fig. 2. Computing time of NG (left) and SOM (right)
35
36
O. Kurasova and A. Molyt˙e
Fig. 3. Number of vectors-winners in NG and SOM
3 3.1
Experimental Investigations Data Sets
Some data sets are used in the experimental investigations (see Table 1). Each data set has some specific characteristics. The first three data sets are taken from [12], the fourth data set is taken from [13], and the last two data sets are generated by us. 3.2
Dependence of Computing Time on the Number of Neurons
Computing time of the neural gas (NG) and the self-organizing maps (SOM) is investigated. Fig. 2 shows the dependence of the computing time on the number N of neurons. The curves are approximated by the exponential functions precisely enough (R2 ≥ 0.9), it means that the computing time exponentially increases when the number of neurons is increasing. Moreover, the computations in the NG run on much longer than in the SOM. Regarding the limit of the paper, the results of only four data are presented. Analogous results are for other data sets.
Combination of Vector Quantization and Visualization
37
Fig. 4. Quantization error (QE) of NG (left) and SOM (right)
3.3
Numbers of Neurons-Winners
It is of interest to investigate whereby method (NG or SOM) more neurons become winners. The ratios between the number of neurons-winners and all the neurons of the NG and SOM are presented in Fig. 3. It is shown that the ratios of NG are larger than that of the SOM: about 80% of the NG neurons become winners. If the numbers of neurons are large, only about 50% of the SOM neurons become winners. 3.4
Quality of Quantization and Visualization
Quantization error EQE (1) is calculated to estimate the quality of quantization. The quantization error decreases then the number of neurons is increasing (Fig. 4). As we see in Fig. 4, the quantization errors of the NG are smaller than that of the SOM then the number of neurons-winners is approximately equal. It means that the neural gas is more suitable for a vector quantization. In order to estimate the quality of mapping the neurons-vectors, two measures ˆMDS (4) (Stress-1) and topology preserving measure ET P M (5). are calculated: E
38
O. Kurasova and A. Molyt˙e
Fig. 5. Stress-1 values of mapping the NG and SOM neurons-winners
Fig. 6. Topology preserving measure (TPM) of mapping the NG and SOM neuronswinners
Combination of Vector Quantization and Visualization
39
Fig. 7. Mapping images of hepta data obtained by NG (left) and SOM (right)
ˆMDS are presented in Fig. 5. When the neurons-winners of the The values of E ˆMDS are approximately equal, the NG and SOM are mapped, the values of E dimensionality n of chainlink, hepta, and target data being small (n = 2 or ˆMDS n = 3). When the neurons-winners of the SOM are mapped, the values of E ˆMDS , if the neurons-winners of the NG are are smaller than the values of E mapped (auto MPG, rand clust10, rand clust5 data sets). The dimensionality n of these data is larger (n = 7, 5 or 10). The values of the topology preserving measure ET P M of mapping the NG and SOM neurons-winners are calculated and the results are presented in Fig. 6. Fig. 6 shows that the values of ET P M of mapping the NG neurons-winners are smaller than that of the SOM neurons-winners. It means that the SOM preserves neighbourhood better.
40
O. Kurasova and A. Molyt˙e
Fig. 8. Mapping images of chainlink data obtained by NG (left) and SOM (right)
Combination of Vector Quantization and Visualization
Fig. 9. Mapping images of target data obtained by NG (left) and SOM (right)
41
42
3.5
O. Kurasova and A. Molyt˙e
Mapping Images
ˆ 1, M ˆ 2, . . . , M ˆ r are When the neurons-winners that are n-dimensional vectors M mapped by multidimensional scaling, two-dimensional vectors Y1 , Y2 , . . . , Yr are obtained. Here r is the number of neurons-winners. Two-dimensional vectors may be presented in a scatter plot (Figs. 7-9). Numbers near points indicate order numbers of classes to which the points belong. Figs. 7-9 show how the mapping images change when the number of neurons-winners is growing. The data structure is visible even when the number r of neurons-winners obtained by the NG is small enough. If the number r of neurons-winners obtained by the SOM is larger, the data structure is visible, as well.
4
Conclusions
In this paper, some data mining methods are investigated. Two vector quantization methods (neural gas and self-organizing maps) are used for reducing the number of the data sets analyzed. The neurons-winners are visualized by multidimensional scaling. The dependence of computing time on the number of neurons, the number of neuron-winners, quantization and mapping qualities, and preserving of a data structure in the mapping image are investigated. The experiments show that: 1. The computing time of the NG and SOM exponentially increases when the number of neurons is increasing. Moreover, the computations in the NG run on much longer than in the SOM. 2. The ratios between the number of neurons-winners and all the neurons of the NG are larger than that of the SOM: about 80% of the NG neurons become winners. When the numbers of neurons are large, only about 50% of the SOM neurons become winners. 3. The quantization error decreases when the number of neurons is growing. The quantization errors of the NG are smaller than that of the SOM when the number of neurons-winners is approximately equal. It means that the neural gas is more suitable for vector quantization. 4. When the neurons-winners of the NG and SOM are mapped, the values of Stress-1 are approximately equal for the data whose dimensionality n is small. When the neurons-winners of the SOM are mapped, the values of Stress-1 are smaller for the data whose dimensionality n is larger. 5. The values of the topology preserving measure of mapping the NG neuronswinners are smaller than that of mapping the SOM neurons-winners. It means that the SOM preserves neighbourhood better. 6. The data structure is visible in the mapping image even when the number r of neurons-winners of the NG is small enough. If the number r of neuronswinners of the SOM is larger, the data structure is visible, as well.
Combination of Vector Quantization and Visualization
43
Acknowledgments. The research is partly supported by the Lithuanian State Science and Studies Foundation project ”Information technology tools of clinical decision support and citizens wellness for e.Health system, Info Health (No. B-07019)”.
References 1. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer Series in Information Science, vol. 30. Springer, Berlin (2001) 2. Martinetz, T.M., Schulten, K.J.: A Neural-Gas Network Learns Topologies. In: Kohonen, T., M¨ akisara, K., Simula, O., Kangas, J. (eds.) Artificial Neural Networks, pp. 397–402. North-Holland, Amsterdam (1991) 3. Chen, C., Hardle, W., Unwin, A.: Handbook of Data Visualization. Springer, Berlin (2008) 4. Dzemyda, G., Kurasova, O., Medvedev, V.: Dimension Reduction and Data Visualization Using Neural Networks. In: Maglogiannis, I., Karpouzis, K., Wallace, M., Soldatos, J. (eds.) Emerging Artificial Intelligence Applications in Computer Engineering. Real World AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. Frontiers in Artificial Intelligence and Applications, vol. 160, pp. 25–49. IOS Press, Amsterdam (2007) 5. Borg, I., Groenen, P.: Modern Multidimensional Scaling. Springer, New York (2005) ˇ 6. Mathar, R., Zilinskas, A.: On Global Optimization in Two-Dimensional Scaling. Acta Aplicandae Mathematicae 33, 109–118 (1993) ˇ ˇ 7. Zilinskas, A., Zilinskas, J.: Two Level Minimization in Multidimensional Scaling. Journal of Global Optimization 38(4), 581–596 (2007) 8. Bernataviˇcien˙e, J., Dzemyda, G., Kurasova, O., Marcinkeviˇcius, V.: Optimal Decisions in Combining the SOM with Nonlinear Projection Methods. European Journal of Operational Research 173, 729–745 (2006) 9. Konig, A.: Interactive Visualization and Analysis of Hierarchical Neural Projections for Data Mining. IEEE Transactions on Neural Networks 11(3), 615–624 (2000) 10. Est´evez, P.A., Figueroa, C.J., Saito, K.: Cross-Entropy Embedding of HighDimensional Data Using the Neural Gas Model. Neural Networks 18, 727–737 (2005) 11. Dzemyda, G., Kurasova, O.: Heuristic Approach for Minimizing the Projection Error in the Integrated Mapping. European Journal of Operational Research 171(3), 859–878 (2006) 12. Fundamental Clustering Problem Suite, http://www.uni-marburg.de/fb12/ datenbionik/data/ 13. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html
Discretization of Target Attributes for Subgroup Discovery Katherine Moreland1 and Klaus Truemper2 1
2
The MITRE Corporation, McLean, VA 22102, U.S.A. Department of Computer Science, University of Texas at Dallas Richardson, TX 75083, U.S.A.
Abstract. We describe an algorithm called TargetCluster for the discretization of continuous targets in subgroup discovery. The algorithm identifies patterns in the target data and uses them to select the discretization cutpoints. The algorithm has been implemented in a subgroup discovery method. Tests show that the discretization method likely leads to improved insight. Keywords: Subgroup Discovery, Logic, Classification, Feature Selection.
1
Introduction
The task of subgroup discovery is defined as follows. Given a data set that is representative of a particular population, find all statistically interesting subgroups of the data set for a given target attribute of interest. Target attributes may be binary, nominal, or continuous. Many subgroup discovery algorithms can handle only binary target attributes, and continuous target attributes must be discretized prior to application of these algorithms. In this paper, we present a new algorithm called TargetCluster for that task as well as a new subgroup discovery method. There are three main goals of target discretization. First, clusters should be densely populated since then they are likely to represent similar cases. Second, clusters should be clearly distinct since two clusters located close together may actually correspond to a similar target group. Finally, isolated points that do not convincingly fall into a cluster should be effectively skipped since they are unlikely part of an interesting target group. Let’s begin by evaluating how well-suited current approaches for unsupervised discretization meet these goals. The approaches can be classified as being one of three types: 1) equal-width-intervals, 2) equal-frequency-intervals, or 3) clustering. The first two cases simply divide the attribute into a series of equal width or frequency intervals. Consider the following case for a target attribute. In Figure 1, both the equal-width and equal-frequency methods identify clusters with lower density that are located very close to neighboring clusters. Existing clustering approaches do not satisfy the third goal of target discretization P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 44–52, 2009. c Springer-Verlag Berlin Heidelberg 2009
Discretization of Target Attributes for Subgroup Discovery
45
Fig. 1. Example target discretization using equal-width- and equal-frequency-intervals
since they assign all points to clusters with the exception of outliers. We conclude that current unsupervised discretization methods are not well suited for target discretization. We propose a new method for target discretization that simultaneously achieves the above three goals of target discretization. The algorithm uses a dynamic programming approach. The next section provides details.
2
Solution Algorithm for Cluster Problem
Let the values of the target attribute to be discretized be v1 , v2 , . . . , vN in ascending order. We separate clusters of these values using intervals containing roughly log2 N consecutive values. We say “roughly” since the actual number of values can be somewhat less. We cover details later, when we also motivate the selected size of intervals. Two intervals are disjoint if they do not contain a common value. A collection of intervals is acceptable if the intervals are pairwise disjoint. Suppose we have a collection of acceptable intervals. If we color the values vi occurring in the intervals red and the remaining values green, then there are green subsequences separated by red subsequences. The values in each green subsequence define a cluster. A collection of intervals is feasible if it is acceptable and if each cluster contains at least two values vi . We want a valuation of feasible collections of intervals. To this end, we define the width wk of the kth cluster, say containing values vi , vi+1 , . . ., vj , to be wk = vj − vi
(1)
The score of a feasible collection c is sc = max wk k
(2)
If the score is small, then all clusters have small width, an attractive situation. Let C be the set of feasible c. An optimal collection c∗ ∈ C has minimum score
46
K. Moreland and K. Truemper
among the c ∈ C, and thus is the most attractive one. Therefore, the score of c∗ is defined via sc∗ = min sc (3) c
We determine an optimal collection c∗ by dynamic programming as follows. For m ≥ 0 and 1 ≤ n ≤ N , we compute optimal solutions for the restricted case of m intervals and for the sequence of values v1 , v2 , . . . , vn . We denote the optimal score under these restrictions by sm n . The stages of the dynamic programming recursion are indexed by m, where m = 0 means that no interval is used at all. In a slight change of notation, denote the width of a cluster starting at vl and terminating at vr by wl,r Since each cluster of a feasible c contains at least two values vi , the scores s0n for the stage m = 0 are given by ∞, if n = 1 0 sn = (4) w1,n , otherwise For statement of the recursive formula for any stage m ≥ 1, let a be any acceptable interval for the sequence v1 , v2 , . . . , vn . Referring again to the coloring of the values, this time based on the values in a being red and other values being green, define the largest v-value that is smaller than all red values to have index al , and define the smallest v-value that is larger than all red values to have index ar . Differently stated, the green values val and var bracket the red values of the interval a. With these definitions, the recursive equation for stage m ≥ 1 can be stated as follows. m−1 sm , war ,n } (5) n = min max{sal a
We still must discuss the choice of the size of the intervals. Suppose the interval is chosen to be very small. Then random changes of values may move a value from one cluster across the interval into another cluster, an unattractive situation. We want to choose the interval large enough so that such movement becomes unlikely. In [2], this situation is investigated for the classification of records into two classes. For the case of N records, the conclusion of the reference is that the interval should have roughly log2 N data points. In the present context, that size can lead to odd choices, given that values may occur in multiplicities. Using some experimentation, the following modification leads to reasonable results. First, a tolerance is defined as = (var − val )/10 (6) Take any value q ≤ N − 1 such that p = q − log2 N ≥ 2. Declare the values vi satisfying vp + ≤ vi ≤ vq − to be the points of an interval associated with index q. By having q take on all possible values, we obtain all intervals of interest for the optimization problem. There is one more refinement of the method based on experimentation. We do not want to obtain clusters that are not dense, where density of a cluster with values vi , vi+1 , . . . , vj is defined to be di,j = wi,j /(j − i + 1)
(7)
Specifically, we do not want to have clusters that have less density than the cluster containing the entire sequence v1 , v2 , . . . , vN . Thus, we declare the width
Discretization of Target Attributes for Subgroup Discovery
47
of any cluster whose density di,j is less than d1,N to be equal to ∞. We omit the simple modification of the above optimization formulas (4) and (5) that accommodate the change.
3
Implementation
We created a subgroup discovery algorithm using the EXARP learning algorithm of [7], which in turn utilizes the Lsquare algorithm of [3,4,8]. Each attribute is of one of three possible types: (1) target attribute only, (2) explanation attribute only, or (3) both target and explanation attribute. The goal is to explain the target attributes using the explanation attributes. The explanations correspond to the potentially interesting subgroups for each target. If an attribute is labeled both a target and explanation attribute, it may be explained since it is a target, and it may be part of explanations for other targets. Subgroup discovery proceeds as follows. The data set is evenly divided into a training and testing set. We set the testing data aside, and for the moment use only the training data. Each target is discretized as described in Section 2. For a given target, this yields j cutpoints with j uncertain intervals, for j = 1, 2, . . . For each such j, we use the intervals to partition the data into two sets, Atrain and Btrain , in two ways. In the first case, for a given interval, records with a target value below the interval make up the Atrain set while records with target value above the interval comprise the Btrain set. If a record’s target value falls within the interval, the record is considered uncertain and is not used in computation of explanations. The second case enumerates all pairs of consecutive intervals, k1 and k2 . For each pair, records with a target value less than interval k1 or larger than interval k2 make up the Atrain set while records with target value larger than interval k1 and less than interval k2 make up the Btrain set. Analogous to the first case, records with a target value that falls within interval k1 or k2 are uncertain and not used to compute the explanations. The testing data is partitioned into an A and B set in the same fashion. The EXARP method of [7] is called to compute two DNF logic formulas that separate the Atrain and Btrain sets using only the explanation attributes. One formula evaluates to True for all records of the Atrain set while the other evaluates to True for records of the Btrain set. Each clause of the resulting formulas describes a potentially interesting target subgroup. Using the discretized testing data, we calculate the significance of subgroups somewhat differently from current methods [1,5,6]. Suppose a formula evaluates to True on the A records and let T be a clause of the formula. We define the subgroup S associated with clause T to be S = {x ∈ A | clause T evaluates to True}
(8)
Let n be the number of records in S. A random process is postulated that constructs records of the A set using the clause T . We skip the details, but mention the following. Let p be the probability that the random process produces at least n records that satisfy the clause. Define q to be the fraction of set B
48
K. Moreland and K. Truemper
for which the clause correctly has the value False. Then the clause significance is given by s = (1 − p + q)/2 (9) To reduce the number of significant explanations in the output to the user, logically equivalent explanations are grouped together in equivalence classes, as follows. Each attribute appearing in a clause is one of four possible types: (1) the attribute has low values (e.g., x < 2.5), (2) high values (e.g., x > 7.5), (3) low or high values (e.g., x < 2.5 | | x > 7.5), or (4) neither low nor high values (e.g., 2.5 < x < 7.5). Two explanations are logically equivalent if they meet the following criteria: (1) The explanations describe subgroups for the same target attribute, (2) the explanations contain the same attributes, and (3) each attribute must be of the same type in both explanations. The explanation in the equivalence class with the highest significance is chosen as the representative for the class. The representative explanations constitute the interesting subgroups presented to the user.
4
Computational Results
For testing of the TargetCluster algorithm, we use the Abalone, Heart Disease, and Wine sets from the UC Irvine Machine Learning Repository as well as two data sets supplied by H. Bickel, Technical University Munich, Germany, and S. K¨ ummel, Charit´e Berlin, Germany. The data set from Technical University Munich concerns Dementia and is called SKT; it contains binary and continuous attributes. The data set from Charit´e Berlin is called Cervical Cancer and contains only continuous attributes. Thus, we have 5 data sets called Abalone, Cervical Cancer, Heart Disease, SKT, and Wine. Table 1 summarizes the data sets. We compare the number and significance of subgroups obtained via TargetCluster discretization with the results found via a combined equal-width-intervals Table 1. Summary of Data Sets Data Set Abalone
No. of No. of Attribute Type Rec’s Attr’s Target Only Explanation Only number of measurements, 4177 9 rings gender
Both none
Cervical Cancer 109
11
none
none
serum levels
Heart Disease
303
14
none
age
patient measurements
SKT
183
20
9 test scores, overall score
patient measurements
none
Wine
178
14
none
none
color intensity, hue, intensity levels
Discretization of Target Attributes for Subgroup Discovery
49
Fig. 2. Abalone Data Set Table 2. TargetCluster vs EWF Significance of Highest Number of Subgroups Identified Ranking Subgroup Significance ≥ 0.98 TargetTargetData Set EWF EWF BOTH TOTAL Cluster Cluster Abalone 0.97 0.98 0 1 0 1 Cervical Cancer 0.95 0.99 0 3 0 3 Heart Disease 0.99 0.98 2 1 0 3 SKT 0.93 0.98 0 1 0 1 Wine 0.99 0.99 2 7 2 11 Average 0.966 0.984 0.8 2.6 0.4 3.8
and equal-frequency-intervals method (EWF). In both cases, the discretization data are input to the subgroup discovery method described in Section 3. The graphs of Figures 2 - 6 illustrate the number of subgroups identified when EWF and TargetCluster are used and different significance thresholds are applied. Except for the Heart Disease case, TargetCluster leads to more subgroups with high significance than EWF. Table 2 shows the significance value of the highest ranking subgroup produced via each of the discretization methods. In all cases, TargetCluster discretization leads to the discovery of at least one additional subgroup with significance ≥ 0.98. For instance, using the Cervical Cancer data set, 3 subgroups with significance ≥ 0.98 are identified with TargetCluster that are not identified using EWF.
50
K. Moreland and K. Truemper
Fig. 3. Cervical Cancer Data Set
Fig. 4. Heart Disease Data Set
On average, TargetCluster results in subgroups with higher significance than EWF. For the SKT data set, EWF does not result in any subgroup with significance higher than 0.93. However, the TargetCluster method leads to a subgroup with significance of 0.98. On average, TargetCluster results in 2.6 additional subgroups per data set not produced when EWF is used.
Discretization of Target Attributes for Subgroup Discovery
51
Fig. 5. SKT Data Set
Fig. 6. Wine Data Set
Based on these findings, the TargetCluster method likely is a useful additional discretization tool for subgroup discovery.
5
Summary
This paper describes the TargetCluster algorithm for discretization of continuous targets in subgroup discovery. The algorithm is based on a clustering approach
52
K. Moreland and K. Truemper
on the real line. TargetCluster has been combined with a subgroup discovery method. In tests, the use of TargetCluster resulted in a number of additional interesting subgroups when compared with standard unsupervised discretization methods.
References 1. Atzmueller, M., Puppe, F., Buscher, H.-P.: Exploiting background knowledge for knowledge-intensive subgroup discovery. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005) (2005) 2. Bartnikowski, S., Granberry, M., Mugan, J., Truemper, K.: Transformation of rational and set data to logic data. In: Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, Heidelberg (2006) 3. Felici, G., Sun, F., Truemper, K.: Learning logic formulas and related error distributions. In: Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, Heidelberg (2006) 4. Felici, G., Truemper, K.: A MINSAT approach for learning in logic domain. INFORMS Journal of Computing 14, 20–36 (2002) 5. Kl¨ osgen, W.: Subgroup discovery. In: Handbook of Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (2002) 6. Lavraˇc, N., Flach, P., Kavsek, B., Todorovski, L.: Adapting classification rule induction to subgroup discovery. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002) (2002) 7. Riehl, K., Truemper, K.: Construction of deterministic, consisten, and stable explanations from numerical data and prior domain knowledge. Working paper (2008) 8. Truemper, K.: Design of Logic-based Intelligent Systems. Wiley, Chichester (2004)
Preserving Privacy in Time Series Data Classification by Discretization Ye Zhu1 , Yongjian Fu1 , and Huirong Fu2 1
Cleveland State University, Cleveland OH. 44115, USA
[email protected],
[email protected] 2 Oakland University, Rochester, MI 48309, USA
[email protected]
Abstract. In this paper, we propose discretization-based schemes to preserve privacy in time series data mining. Traditional research on preserving privacy in data mining focuses on time-invariant privacy issues. With the emergence of time series data mining, traditional snapshot-based privacy issues need to be extended to be multi-dimensional with the addition of time dimension. In this paper, we defined three threat models based on trust relationship between the data miner and data providers. We propose three different schemes for these three threat models. The proposed schemes are extensively evaluated against public-available time series data sets [1]. Our experiments show that proposed schemes can preserve privacy with cost of reduction in mining accuracy. For most data sets, proposed schemes can achieve low privacy leakage with slight reduction in classification accuracy. We also studied effect of parameters of proposed schemes in this paper.
1 Introduction Privacy has been a serious issue with the popularity of data mining. To ease privacy concerns, data mining researchers have proposed privacy-preserving data mining[2]. In privacy-preserving data mining, data is mined in a way such that its privacy is protected. Many approaches have been proposed for privacy-preserving data mining. To protect privacy, some modify data by adding noises, generalization, transformation, and so on. Others split data among multiple parties and they cooperate to find global patterns without sharing data. Mining of time series data has been an interesting topic due to the abundance of time series data. Though many privacy-preserving data mining approaches have been successful, they are either ineffective or inapplicable for time series data such as stock prices, sales, and weather. Privacy is an important issue in time series data mining, which few solutions have been proposed yet. Due to the temporal dimension of time series data, its privacy goes beyond traditional privacy of data. In time series data mining, we need to preserve privacy not only in
This work was partly supported by the National Science Foundation under Grant No. CNS0716527. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 53–67, 2009. c Springer-Verlag Berlin Heidelberg 2009
54
Y. Zhu, Y. Fu, and H. Fu
time domain, i.e., the data values, but also in frequency domain, e.g., periodicity. For example, a store wants to protect not only its sales data, but also the periodicity of its sales, which could reveal sales promotions. Moreover, many existing privacy preserving techniques developed for time series data cannot be directly applied for time series data. For example, if random noise added to a time series is independent of the time series, the noise can be easily filtered out to reveal the time series. In this paper, we propose to use discretization for preserving privacy in time series data mining. In our approach, a time series is discretized such that its domain is segmented into a set of ranges. A representative value is selected for each range. Each individual value is then mapped into a range and replaced by a corresponding representative value. This is similar to generalization where a value is replaced by its range. We choose to use representatives instead of ranges so that the modified data is still a time series. Extensive experiments have been conducted using real data sets. The rest of the paper is organized as follows: In Section 2, we review related work in privacy preserving and time series data mining. Section 3 identifies privacy issues in time series data mining. We describe the system model and the threat model in Section 4. In Section 5, we present discretization-based schemes to preserve privacy in time series data mining. We evaluate proposed schemes in Section 6 in terms of privacy leakage and mining accuracy. In Section 7, we discuss related issues. We conclude our paper in Section 8.
2 Related Work In this section, we first briefly summarize research in privacy preserving data mining. We then discuss related work in privacy preserving in time series data mining. Many approaches have been proposed for privacy preserving data mining. These approaches can be broadly classified into two categories. The first category of approaches can be labeled as data modification approaches, in which data is modified to protect privacy and mining is performed on modified data. Modification techniques in this category include adding random noise [2,3], transformation [4], aggregation [5], generalization [6,7,4,8], randomization [9,10,11], and suppression [8,4]. The second category of approaches does not modify data. Instead, data is split among multiple parties, who uses secure multi-party computing to find global patterns without sharing local data, therefore protecting privacy. Most of them use encryption as well. We call the second category data distribution approaches. Secure multi-party computing was introduced as a technique for privacy preserving decision tree construction in [12]. It had been used for association rule mining [13,14], clustering with k-means[15,16], and learning Bayesian network structures [17]. Some recent papers address specifically privacy preserving in time series data mining. A privacy preserving algorithm for mining frequent patterns in time series data has been proposed by Silva and Klusch [18]. They assume that time series data is distributed among multiple parties. Each time series is first segmented, then discretized into a sequence of symbols. A frequent pattern is a sequence which occurs frequently in the time series. The algorithm uses encryption and secure multi-party computing to find global frequent patterns without breaching the privacy of individual parties.
Preserving Privacy in Time Series Data Classification by Discretization
55
Time series data privacy has been studied by Papadimitriou et. al. [19]. They argue that time series data has unique characteristics in terms of privacy. In order to preserve privacy, they propose two perturbation methods based on Fourier and wavelet transformations respectively. It is shown that white noise perturbation does not preserve privacy while the proposed methods are effective. In [20], we also point out that privacy of time series data is more than just the identity of data. We argue that existing privacy preserving techniques such as aggregation and white noise are ineffective for preserving privacy in time series data. We present a privacy attack that can separate individual time series from aggregate data or data with white noise. Though our paper addresses privacy issues in time series data mining, our approach is different from the previous papers. We propose to use discretization to preserve privacy instead of secure multi-party computing as proposed in [18]. Besides, [19] and [20] do not consider the effects of privacy preserving on data mining results. We investigate the relationship between privacy and classification accuracy.
3 Privacy Issues of Time Series In this section, we identify privacy issues for time series data in addition to traditional privacy issues in data mining. Time series data from a data source can be regarded as a time-domain signal. All the characteristics of a time-domain signal can be potentially regarded as private information by the data provider. Below, we list common characteristics in time series data that a data provider may need to keep confidential. – Amplitude: Amplitude indicates the strength of a signal. This is the same as in traditional privacy research. – Average: The average signal strength over time. For example, for a series of sales data, average amplitude indicates the average sales. – Peak and trough: Peak and trough indicate extreme situations. The information is usually considered confidential as it may disclose extreme changes in underlying causes such as difficulties in money flow. – Trend: By observing trends of time series data, an adversary may predict future changes of time series data. Thus trend information should be protected from competitors as well. – Periodicity: Periodical changes in time series data indicate existence of periodically changing factors. For sales data of a store, the factor can be periodical changes in marketing strategies such as promotions which are usually regarded as confidential information for stores. Unlike the previous characteristics which are in time domain, periodicity is in frequency domain.
4 System Model and Threat Model In this section, we describe the system model and the threat model.
56
Y. Zhu, Y. Fu, and H. Fu
Fig. 1. Privacy-Preserving Time Series Data Classification Model
4.1 System Model Based on discretization approaches, the privacy-preserving data classification system is as shown in Figure 1. In this model, data providers first discretize their own time series data to protect privacy and then transmit discretized time series data to the data miner. The data miner constructs classification models and builds the classifier. Data miner can use the classifier to analyze available time series data. Data miner can also use the classifier to analyze data provided by classification customer. The classification customer can be one of the data providers since classification based on models built on time series data provided by more data providers is generally more accurate or more interesting to data providers. The classification customer can also be an outsider, meaning not one of the data providers, but the classification customer wants to analyze data against models built from data contributed by other data providers. The data miner may provide the classification service to “outside” customers for profit. 4.2 Threat Model In this paper, we assume the participants of classification systems are semi-honest. In other words, the participants will follow the protocols strictly. But the participants can record the communication. Especially the data miner can record the data supplied by data providers and analyze these data for other purpose such as searching sensitive information from these data.
Preserving Privacy in Time Series Data Classification by Discretization
57
Based on the trust relationship between data providers and the data miner, the threat models can be divided into following three categories: – Low-Trust Threat Model: In this model, data providers are willing to share discretized time series data with the data miner. Data providers are not interested in data analysis through models built by the data miner. Data providers in this model mainly care their data privacy: They are willing to share discretized if their privacy can be preserved. – High-Trust Threat Model: In this model, data providers are willing to share discretized time series data and are interested in utilizing models built by the data miner. Data providers in this model care about both their data privacy and accuracy of the models partially built from the shared data. – Full-Trust Threat Model: In this model, data providers have full trust on the data miner. Data provides are willing to share original time series data with the data miner without discretization. The data miner may sell the models built from the data supplied by data providers so that an “outsider” can obtain the models and classify data by herself. The data miner cares about accuracy of the models and has the responsibility to protect data providers’ privacy. To protect data providers’ privacy, the data miner will discretize the original time series data collected from data providers and build the models from discretized data. So the system model corresponding to this threat model is slightly different from the one shown in Figure 1: The discretization is done at data miner’s side. All three models are valid in real-world applications. For example, low-trust threat model can be used when data providers are competitors to each other. High-trust models can be used when data providers wants to do data mining based on global data for accuracy. Full-trust model can be used when data providers have full trust on the data miner. The full-trust relationship can be found between company branches and the company headquarter: Company branches provide data for mining at the company headquarter.
5 Preserving Privacy by Discretization In this section, we first formally generalize discretization method from previous definitions and then introduce our schemes. 5.1 Generalization of Discretization Discretization is a process to convert continuous values to discrete values. More formally, the discretization process can be described as follows: First the possible value space is partitioned into N consecutive intervals: [−∞, p1 ), [p1 , p2 ), [p2 , p3 ), · · ·, [pN −1 , +∞], where pi > pj , if i > j. The ith data sample in original time series s(t) is discretized as d(i) = vm , if s(i) ∈ [pm−1 , pm ] ,
(1)
where d(i) denotes the ith sample in the discretized time series d(t), m is the index of the interval, vm is the discretized value and vm ∈ [pm−1 , pm ]. In the language of data
58
Y. Zhu, Y. Fu, and H. Fu
compression, index m is also called as discretization level, the vector [v1 , v2 , · · · , vN ] is called as codebook. Obviously partition parameters p1 , · · · , pN −1 and discrete values v1 , · · · , vN determine discretization performance. In previous studies, these parameters are mostly chosen to minimize discretization error. The discretization error is usually defined as mean square error between original time series and discretized time series, i.e., δ=
l
(si − di )2 .
(2)
i=1
In our research, the discretization parameters are chosen for preserving privacy and minimizing classification error. 5.2 Privacy-Preserving Schemes Based on Discretization We introduce privacy-preserving schemes for all the three threat models below.
Algorithm 1. Scheme for Low-Trust Threat Model Data Miner Side: 1. Broadcast a request message containing classification accuracy limit (δlimit ) to data providers ; 2. Wait for messages containing maximum variance from all the data providers; 3. Select maximum variance reported by data providers and calculate N , the number of discretization levels; 4. Broadcast the number of discretization levels (N ) to data providers; 5. Waiting for data messages from data providers; 6. Upon receiving a data message, forward the message to model construction component as shown in Figure 1. Data Provider Side: input : privlimit , the maximum acceptable privacy leakage of the data provider 1. Upon receiving a request message from the data miner, send back the maximum variance of original time series; 2. Wait for the message containing the number of discretization levels (N ) from the data miner; 3. Calculate minimum possible privacy leakage (privmin ) based on δlimit and original time series; 4. if privmin > privlimit 5. Goto 1; 6. else 7. Discretize Original Time Series to achieve minimum possible privacy leakage (privmin ); 8. Send a data message containing the codebook and discretization levels of each time series to the data miner; 9. end if
Preserving Privacy in Time Series Data Classification by Discretization
59
Scheme for Low-Trust Threat Model. Based on the system model shown in Figure 1, the communication protocol between the data miner and data providers is shown in Algorithm 1. Before explaining details of the scheme, we would like to first introduce metrics and parameters used in the scheme. The first metric to be introduced is privacy leakage. It is defined as absolute value of the correlation between an original time series and its corresponding discretized time series, i.e., privleak (s(t), d(t)) = |corr(s(t), d(t))| (3) where corr() denotes correlation function. The privacy leakage is within the range [0, 1]. A larger value of the metric privl eak means less effectiveness in privacy preserving. As introduced in Section 3, all the characteristics about time series can be potentially regarded as private information. So our privacy leakage metric is based on correlation, a general metric used to measure similarity. The important parameter for data providers is the parameter privlimit , the maximum acceptable privacy leakage. The larger privlimit is, the more private information is allowed to be disclosed. So a data provider caring a lot about data privacy can choose a small privlimit to protect her privacy. The important parameter for data miner is discretization error limit δlimit . It is used by the data miner to ensure accuracy of models built on discretized time series, i.e., classification accuracy. Without δlimit , data providers may choose discretization parameters only to minimize privacy leakage so that the classifier built on the discretized time series may cause a large classification error. Before discretizing time series data and sending discretized data to the data miner, data providers need to exchange parameters with the data miner to ensure both classification accuracy and privacy protection. The exchange begins with broadcast of discretization error limit δlimit from the data miner to data providers. Data providers will respond the request message with the maximum variance of their own original time series. The data miner selects the overall maximum variance among maximum variances reported by data providers and calculates the number of discretization levels (N ). Then the data miner broadcasts N to data providers. After receiving N , the data provider calculate the minimum possible privacy leakage given the constraints of N and δlimit . The calculation will be introduced in the rest of this section. If the minimum possible privacy leakage is less than privlimit , one data provider discretizes her own time series data to minimize privacy leakage. Otherwise, the data provider stops and waits for request message from the data miner. A data provider shares discretized time series with the data miner by transmitting the codebook [v1 , v2 , · · · , vN +1 ] and discretization levels [m1 , m2 , · · · , ml ] of each sample in original time series. The discretized time series can be recovered at the data miner’s side from the codebook and discretization levels. Calculation of Discretization Level N : The data miner needs to calculate the discretization level based on the maximum variance σ 2 and discretization error limit δlimit . Before introducing the details of the calculation, we would like to introduce the following theorem.
60
Y. Zhu, Y. Fu, and H. Fu
Theorem 1. The minimum number of discretization levels required to achieve the dis√ 2 2σ cretization error limit δlimit is δlimit . Proof. From rate distortion theorem of a Gaussian source [21], the rate distortion √ funcσ2 2σ2 tion is R = 12 log δlimit . So the number of discretization level required is at least δlimit for all possible distributions. The minimum number of discretization levels in Theorem 1 can not be directly used in our scheme since it can only be achieved by very long time series according to the rate distortion theorem. In our scheme, the number of discretization levels is chosen σ2 heuristically: N = r δlimit . The scaling factor r is larger than 1. We will analyze the scaling factor in Section 6. Calculation of Minimum Possible Privacy Leakage: Finding minimum possible privacy leakage can be formulated as an optimization problem. The cost function is min
{[p1 ,p2,···, pN −1 ],[v1 ,v2 ,···vN ]}
k
privleak (si (t), di (t))
(4)
i=1
where k is number of time series to be sent to the data miner. The optimization is over all the possible discretization parameters. The constraint of the optimization problem is k 1 (si (t) − di (t))2 ≤ δlimit . k i=1
(5)
Algorithm 2. Scheme for High-Trust Threat Model Data Miner Side: 1. Broadcast a request message containing the number of discretization levels (N ) and the local classification accuracy threshold (Tacc ); 2. Waiting for data messages from data providers; 3. Upon receiving a data message, forward the message to model construction component as shown in Figure 1. Data Provider Side: 1. Upon receiving a request message from the data miner, optimize cost function (4) with the constraint of local classification accuracy threshold (Tacc ) ; 2. if the optimization problem is not solvable; 3. Goto 1; 4. else 5. Discretize Original time series with optimal discretization parameters found in Step 1; 6. Send a data message containing the codebook and discretization levels of each time series to the data miner.
Scheme for High-Trust Threat Model. The communication protocol of the scheme for high-trust threat model is shown in Algorithm 2. In this scheme, a local classification accuracy threshold (Tacc) is broadcast from the data miner to data providers. Data providers optimize discretization of time series in terms of privacy leakage. The
Preserving Privacy in Time Series Data Classification by Discretization
61
constraint to the optimization problem is the local classification accuracy threshold. In other words, data providers are required to guarantee that mining on discretized local times series can achieve classification accuracy Tacc . Formally, the optimization problem can be defined as follows: (a) The cost function is as shown in (4). (b) The constraint is Pacc ≥ Tacc where Pacc denotes the classification accuracy on discretized local time series. Scheme for Full-Trust Threat Model. The communication protocol for this scheme is simple: First the data miner broadcast request a message to data providers. Data providers respond the request message with original time series since data providers fully trust the data miner. After receiving original time series from data providers, the data miner will discretize all time series. The discretization parameters are chosen to optimize the cost function 4. The constraint of local classification accuracy in scheme for high-trust model changes to the constraint on global classification accuracy, i.e., classification accuracy on all discretized time series in this scheme. The data miner will build the classifier from the discretized time series. Comparing the scheme for high-trust threat model and this scheme, the major differences are: (a) In the scheme for high-trust threat model, discretization is done by data providers. In this scheme, discretization is done by the data miner. (b) In the scheme for high-trust threat model, optimization is run on time series owned by each data provider. In this scheme, optimization is run on all the time series.
6 Experiment Results In this section, we investigate the performance of proposed schemes. We chose 15 1 time series data sets from the UCR Time Series Classification/Clustering data collection [1]. Each data set has a training set and a test set. These data sets are quite different in terms of size, length of time series, number of classes, and one-nearest-neighbor (1-nn) Euclidean distance. Unless stated otherwise, our experiments are conducted using all 15 data sets and the results are averaged across all data sets. In our experiments, we simulate three data providers 2 . The time series in each class are randomly and evenly assigned to each data provider. Each experiment result is averaged over 10 different random combinations. For fair comparison, the number of discretization levels is set to be 16 for all the experiment results. In our experiments, the optimization problems in schemes are solved by the simulated annealing method [22]. 6.1 Overall Performance Table 1 shows the performance of all three schemes. In the table, NA means not applicable. In other words, it means that the optimization problem has no solutions. 1
2
Five data sets are not included in our experiments because of not enough time series of one or more classes. We assume there are three data providers so that all selected data sets have at least one time series in each class for each data provider. We get similar results in experiments of large data sets with more data providers.
62
Y. Zhu, Y. Fu, and H. Fu Table 1. Scheme Performance for Low-Trust, High-Trust, and Full-Trust Models
Original Low-Trust Classification Accuracy3 Classification Privacy Accuracy Leakage Beef 0.53 0.36 0.32 Adiac 0.61 NA NA CBF 0.85 0.37 0.20 Coffee 0.75 0.63 0.94 ECG200 0.88 0.52 0.11 FaceAll 0.71 0.42 0.54 Fish 0.78 NA NA Gun Point 0.91 0.56 0.31 Lighting2 0.75 0.45 0.05 Lighting7 0.58 0.23 0.21 OliverOil 0.87 0.41 0.15 OSULeaf 0.52 NA NA SwedishLeaf 0.79 NA NA Synthetic Control 0.88 0.55 0.44 Trace 0.76 0.25 0.04 Average 0.75 0.41 0.22 Dataset
High-Trust
Full-Trust
Classification Privacy Accuracy Leakage 0.47 0.55 0.31 0.96 NA NA 0.56 0.11 NA NA 0.38 0.56 NA NA 0.76 0.97 0.45 0.05 0.16 0.33 0.70 0.96 NA NA NA NA 0.81 0.94 0.66 0.93 0.58 0.73
Classification Privacy Accuracy Leakage 0.43 0.12 NA NA 0.69 0.39 0.61 0.06 0.71 0.17 0.57 0.39 0.63 0.48 0.74 0.33 0.61 0.19 0.47 0.46 0.70 0.25 0.41 0.52 0.63 0.63 0.71 0.47 0.61 0.11 0.61 0.34
For the scheme proposed for low-trust model, optimization solutions may not be found when classification accuracy limit δlimit and the number of discretization levels N are too small: So that it is impossible to discretize time series with a given classification accuracy limit. For the schemes proposed for high-trust and full-trust models, optimization solutions may not be found when classification accuracy threshold Tacc is too high: So that it is impossible to achieve the required classification accuracy with certain number of discretization levels. Datasets with NA results for any of these three models are excluded from average calculation in Table 1. Experiments results in the table are obtained with following parameters4: – Scheme for Low-Trust Model: The parameter δlimit is calculated with the scaling factor r to be 20 as described in the previous section. The parameter privlimit is set to one meaning optimization is over all possible discretization parameters. We set the parameter privl imit to one because we would like to investigate the tradeoff between privacy leakage and classification accuracy. In practice, data providers can choose their own privlimit to determine whether to share data with the data miner. – Scheme for High-Trust Model: Since different dataset has different original classification accuracy, we set local classification accuracy threshold Tacc to be 80 percent of original classification accuracy. – Scheme for Full-Trust Model: Similar as high-trust model, the classification accuracy threshold is set to be 80 percent of original classification accuracy. 4
Further experiments on these parameters are described in the rest of this section.
Preserving Privacy in Time Series Data Classification by Discretization
63
From Table 1, we can make following observations: – Proposed schemes can preserve privacy at the cost of reduction in classification accuracy. For several data sets, the privacy leakage can reach less than 0.1. Performance in terms of privacy leakage and classification accuracy for different data sets can vary a lot. For example, for the Beef data set, with cost of 6 percent reduction in classification accuracy, it is possible achieve 0.55 privacy leakage using the scheme for the high-trust model. For the Gun point data set, 0.97 privacy leakage is at the cost of 25 percent reduction in classification accuracy. The difference is mainly because of different amount of redundancy in different data sets: If redundancy is high in one data set, it is possible to have less reduction in the distance between different cluster after discretization, i.e., less reduction in classification accuracy. – The scheme for low-trust model can protect privacy when classification accuracy limit is large. On average, privacy leakage is 0.22 for the scheme proposed for lowtrust model. But its average classification accuracy is 0.41, about 34 percent less than original classification accuracy. – For the scheme proposed for high-trust model, privacy leakage can achieve 0.73 with the cost of 17 percent reduction in classification accuracy. – In general, the scheme for full-trust model performs better than the scheme for high-trust model in terms of both privacy leakage and classification accuracy: The scheme for high-trust model can achieve 0.34 privacy leakage with the cost of 14 percent reduction in classification accuracy. The performance difference between these two scheme because the scheme for full-trust model optimizes discretization over all time series and the scheme for high-trust model only optimizes discretization over local time series. – The scheme for high-trust model can not guarantee the classification accuracy is 80 percent of original classification accuracy. For example, the scheme for hightrust model can get classification accuracy of 0.31 for Adiac dataset, almost half of original classification accuracy. It is because local optimization performed by data providers in this scheme can only guarantee the classification accuracy of discretized local time series. The scheme for full-trust model can guarantee classification accuracy to be at least 80 percent of original classification accuracy because of the global optimization. 6.2 Effect of Scaling Factor r In this set of experiments, we study the scaling effect r for the scheme proposed for low-trust model. We use the coffee data set from the UCR time series database [1] in this set of experiments. Similar results are observed on other data sets in the time series database. We fix the number of discretization levels to be 16. So the change in scaling factor r reflects the change of the parameter δlimit . Experiment results are shown in Figure 2. From Figure 2(a), we can observe that privacy leakage decreases when the scaling factor r increases. A larger scaling factor means larger classification accuracy limit δlimit . In turn, it means larger range for the min operator in the cost function (4). So the minimal information leakage decreases when the scaling factor r increases. Figure 2(b) shows that classification accuracy fluctuates between 0.63 and 0.78. The fluctuation is because optimization in the scheme for
64
Y. Zhu, Y. Fu, and H. Fu
0.95
0.8
0.9
0.78 0.76
Classification Accuracy
Privacy Leakage
0.85
0.8
0.75
0.7
0.65
0.72 0.7 0.68 0.66
0.6
0.55 20
0.74
0.64
30
40
50
60 70 Scaling Factor (r)
80
90
0.62 20
100
(a) Privacy Leakage
30
40
50
60 70 Scaling Factor (r)
80
90
100
(b) classification accuracy Fig. 2. Effect of Scaling Factor (r)
1.00 0.90
0.75
Scheme for HighŦTrust Model Scheme for FullŦTrust Model
0.80
Scheme for HighŦTrust Model Scheme for FullŦTrust Model
0.70
Classification Accuracy
Privacy Leakage
0.70 0.60 0.50 0.40
0.65
0.60
0.30 0.20
0.55
0.10 0.00 0.08
0.15
0.23
0.30 0.38 0.45 0.53 Classification Accuracy Threshold (Tacc)
(a) Privacy Leakage
0.60
0.68
0.75
0.50 0.08
0.15
0.23
0.30 0.38 0.45 0.53 Classification AccuracyThreshold (Tacc)
0.60
0.68
0.75
(b) classification accuracy
Fig. 3. Effect of classification accuracy Threshold Tacc
low-trust model focuses on privacy leakage only. The original classification accuracy for coffee data set is 0.75. So we can observe that it is possible have classification accuracy slightly better than the original classification accuracy when the scaling factor r is large. 6.3 Effect of Classification Accuracy Threshold Tacc In this set of experiments, we focus on the classification accuracy threshold Tacc used in schemes for high-trust and full-trust models. The data set used in this set of experiments is again the coffee data set from the UCR time series database [1]. Similar experiment results can also be observed for other data sets.
Preserving Privacy in Time Series Data Classification by Discretization
65
From Figure 3, we can make following observations: – In Figure 3(a), the privacy leakage for both schemes increases as the classification accuracy threshold increases. It is because a larger classification accuracy threshold means tighter constraints on the optimization in both schemes for high-trust and full-trust models. So less privacy leakage can be achieved when the classification accuracy threshold is small. – We can also observe that for coffee data set, the privacy leakage approaches zero when the classification accuracy is 40 percent less than original classification accuracy. The cost of almost perfect privacy preservation is about 20 percent reduction in classification accuracy. – In terms of privacy leakage, the scheme for full-trust model outperforms the scheme for high-trust model. Simply because optimization in scheme for full-trust model is over time series contributed by all data providers and optimization in scheme for high-trust model is over local time series. – The scheme for high-trust model can guarantee overall classification accuracy is above classification accuracy threshold. For the scheme proposed for full-trust model, the overall classification accuracy can be lower than local classification accuracy threshold. For example, when local classification accuracy threshold is 0.65, the schemes for high-trust model and full-trust model have overall classification accuracy of 0.56 and 0.67 respectively. It is caused by the difference between the global optimization in scheme for full-trust model and the local optimization in scheme for high-trust model.
7 Discussion We focus on privacy leakage and classification accuracy in our investigation of discretization-based schemes to preserve privacy in time series data mining. Another obvious benefit of discretization-based schemes is data compression. We use 16 discretization levels in our experiments to discretize time series data. In other words, four bits are enough to represent each value in times series. Although the proposed schemes need extra bits to send codebooks, the schemes can save a large amount of bits in data storage and data transmission in comparison with mining with original time series data, especially for long time series. The scheme for low-trust model has the parameter privlimit , the maximal acceptable privacy leakage for data providers. This parameter can also be used in the scheme for high-trust model so that data providers can determine whether to share discretized time series based on this parameter.
8 Conclusion In this paper, we propose discretization-based schemes to preserve privacy in time series data mining. Based on trust relationship between the data miner and data providers, we defined three threat models. We propose three different schemes for these three threat models. The proposed schemes are extensively evaluated against public-available time
66
Y. Zhu, Y. Fu, and H. Fu
series data sets [1]. Our experiments show that proposed schemes can preserve privacy with cost of reduction in mining accuracy. For most data sets, proposed schemes can achieve low privacy leakage with slight reduction in classification accuracy. We also studied effect of parameters of proposed schemes in this paper.
Acknowledgment We thank Professor Keogh for providing us the data sets used in our experiments.
References 1. Keogh, E., Xi, X., Wei, L., Ratanamahatana, C.A.: The ucr time series classification/clustering homepage (2006), http://www.cs.ucr.edu/˜eamonn/time_series_data/ 2. Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: SIGMOD Conference, pp. 439– 450 (2000) 3. Evfimievski, A.V., Srikant, R., Agrawal, R., Gehrke, J.: Privacy preserving mining of association rules. In: SIGKDD, pp. 217–228 (2002) 4. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 279–288. ACM, New York (2002) 5. Zhang, N., Zhao, W.: Privacy-preserving data mining systems. Computer 40(4), 52–58 (2007) 6. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: ICDE 2005: Proceedings of the 21st International Conference on Data Engineering, Washington, DC, USA, pp. 217–228. IEEE Computer Society, Los Alamitos (2005) 7. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Workload-aware anonymization. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 277–286. ACM, New York (2006) 8. Fung, B.C.M., Wang, K.: Anonymizing classification data for privacy preservation. IEEE Trans. on Knowl. and Data Eng. 19(5), 711–725 (2007); Fellow-Philip S. Yu 9. Du, W., Zhan, Z.: Using randomized response techniques for privacy-preserving data mining. In: SIGKDD, pp. 505–510 (2003) 10. Huang, Z., Du, W., Chen, B.: Deriving private information from randomized data. In: SIGMOD Conference, pp. 37–48 (2005) 11. Zhu, Y., Liu, L.: Optimal randomization for privacy preserving data mining. In: SIGKDD, pp. 761–766 (2004) 12. Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Bellare, M. (ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 36–54. Springer, Heidelberg (2000) 13. Vaidya, J., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: SIGKDD, pp. 639–644 (2002) 14. Kantarcioglu, M., Clifton, C.: Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Trans. Knowl. Data Eng. 16(9) (2004) 15. Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: SIGKDD, pp. 206–215 (2003) 16. Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: SIGKDD, pp. 593–599 (2005)
Preserving Privacy in Time Series Data Classification by Discretization
67
17. Wright, R.N., Yang, Z.: Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: SIGKDD, pp. 713–718 (2004) 18. da Silva, J.C., Klusch, M.: Privacy-preserving discovery of frequent patterns in time series. In: Industrial Conference on Data Mining, pp. 318–328 (2007) 19. Papadimitriou, S., Li, F., Kollios, G., Yu, P.S.: Time series compressibility and privacy. In: VLDB 2007: Proceedings of the 33rd international conference on Very large data bases, pp. 459–470. VLDB Endowment (2007) 20. Zhu, Y., Fu, Y., Fu, H.: On privacy in time series data mining. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS, vol. 5012, pp. 479–493. Springer, Heidelberg (2008) 21. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, New York (1991) 22. Ingber, L.: Adaptive simulated annealing (asa). Technical report, Pasadena, CA (1993)
Using Resampling Techniques for Better Quality Discretization Taimur Qureshi and Djamel A. Zighed University of Lyon 2 Laboratory ERIC - 5, Avenue Pierre Mendes 69676 Bron Cedex - France {taimur.qureshi,abdelkader.zighed}@eric.univ-lyon2.fr
Abstract. Many supervised induction algorithms require discrete data, however real data often comes in both discrete and continuous formats. Quality discretization of continuous attributes is an important problem that has effects on accuracy, complexity, variance and understandability of the induction model. Usually, discretization and other types of statistical processes are applied to subsets of the population as the entire population is practically inaccessible. For this reason we argue that the discretization performed on a sample of the population is only an estimate of the entire population. Most of the existing discretization methods, partition the attribute range into two or several intervals using a single or a set of cut points. In this paper, we introduce two variants of a resampling technique (such as bootstrap) to generate a set of candidate discretization points and thus, improving the discretization quality by providing a better estimation towards the entire population. Thus, the goal of this paper is to observe whether this type of resampling can lead to better quality discretization points, which opens up a new paradigm to construction of soft decision trees. Keywords: Bootstrap, discretization, resampling.
1
Introduction
Discretization is a general purpose preprocessing method that can be used for data exploration or data preparation in data mining. While they are critical in the case of decision tree methods, discretization methods can also be used for bayesian classifiers as in [14] and others. Many real-world classification algorithms are hard to solve unless the continuous attributes are discretized and the choice of discretization technique has important consequences on the induction model used. In the top-down induction of decision trees family, a simple unsupervised discretization procedure divides the range of a continuous variable into equalwidth intervals or equal-frequency intervals. While, supervised methods use information quality or statistical quality based measures to determine the interval P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 68–81, 2009. c Springer-Verlag Berlin Heidelberg 2009
Using Resampling Techniques for Better Quality Discretization
69
boundary points. Y.Yang et al [12] and Liu et al. [9] classified discretization methods from into different viewpoints: supervised vs. unsupervised, static vs. dynamic, global vs. local, top-down vs. bottom-up, Parameterized vs. Unparameterized, Disjoint vs. Non-disjoint, Univariate vs. Multivariate, primary vs. composite and Split vs. Merge. Out of these, Top-down methods as FUSBIN and MDLPC [5,6] start with one interval and split intervals in the process of discretization. While, bottom-up methods like FUSINTER[5] and Chi-Merge [4] split completely all the continuous values of the attribute and merge intervals in the process of discretization. Apart from these local discretization methods, global discretization [19], such as binning, produce a mesh over the entire ndimensional continuous instance space, where each feature is partitioned into regions independent of the other attributes. In this article, we focus on local topdown and bottom-up strategies for providing comparisons in terms of quality and prediction rates [1]. Our goal is to find a way to produce better discretization points. Previously, various studies have been done to estimate the discretization points from learning samples taken from the population. Because of inaccessibility of entire populations, we usually try to estimate statistical processes such as discretization from samples rather than the population. Significantly, in [1], a set of learning samples are used to approximate the best discretization points of the whole population. They argue that the learning sample is just an approximation of the whole population, so the optimal solution built on a single sample set is not necessarily the global one. Taking this point into consideration, in this paper we try to provide a better estimate toward the entire population. Our interpretation of the above problem leads us to use a resampling approach [2] to determine better distributions of the discretization points. In [18], the authors use various approaches including bootstrap to decrease discretization variance as preprocessing for decision trees. In this paper, we focus on obtaining discretization points where each candidate point has a higher probability to be the ”‘better”’ discretization point toward the whole population. By doing so, we attempt to improve on the predication accuracy of discretization and better estimation of the discretization points of the entire population, thus, treating the discretization problem in the statistical area with new results. We use ordinary bootstrap [11] as a method for resampling in our approach which tries to improve on the above mentioned problem. We argue that the recent increase in processing power of computers has allowed us to use extensive resampling analysis in order to find better estimates of the larger population. In this paper we focus on supervised discretization, however, unsupervised discretization can also be applied in the same way. In our approach, we estimate the discretization point distribution over an attribute Xi by repeatedly resampling and performing discretization on each bootstrap sample using any of the discretization approaches and thus, creating a histogram density function of the obtained candidate discretization points. Then, we apply a moving average filter to this function with a window size ws and try to obtain distinct regions (peaks) in the distribution most likely to be the exact discretization points towards the
70
T. Qureshi and D.A. Zighed
entire population. From this distribution we extract discretization points using two variant techniques explained later. The number of intervals of the discretization is determined by using a most frequent interval method. To conclude, we compare our approach with different top-down and bottom-up strategies (without resampling) in terms of discretization quality (defined later) and we see how close we get to an optimal solution (discretization points). This solution is build by applying a dynamic programming based technique known as fishers algorithm on a finite and known data set. In section 2, we lay out the framework for discretization and define the various terminologies. We also discuss the various data sets used in our experiments. In 3, we briefly describe various existing top-down and bottom-up discretization strategies used in our paper. Then, in section 4, we give a detailed illustration of our proposed scheme and then in 5, we show our results by applying the explained techniques to 10 benchmarking data sets. We also compare different discretization strategies to ours and at the end we conclude with observations, deductions and proposals for future work.
2 2.1
Terminologies and Definitions Framework and Formulation
Let X(.) be an attribute value on the real set . For each example ω of a learning set Ω, X(ω) is the value taken by the attribute X(.) at ω. The attribute C(.) is called the endogenous variable or class and is usually symbolic and if an example belongs to a class c, we have C(ω) = c. We also suppose that C(ω) is known for all ω of the learning sample set Ω. Thus, we try to build a model, denoted by Φ, such that ideally we have: C(.) = Φ(X1 (.), ..., Xp (.))). The discretization of X(.) consists in splitting the domain Dx of continuous attribute X(.), into k intervals Ij , j = 1, ...., k, with k ≥ 1. We denote Ij = [dj−1 , dj ] with the dj s called the discretization points which, are determined by taking into account the particular attribute C(.). The purpose of the method is to build a model which can predict class C(.) relative to X(.). 2.2
Resampling
We use an Ordinary Bootstrap method [11] for resampling in which the learning and test sets overlap. In this, a prediction rule is built on a bootstrap sample and tested on the original sample, averaging the misclassification rates across all bootstrap replications gives the ordinary bootstrap estimate. 2.3
Smoothing
We use a moving average filter as a smoothing technique which is the unweighted mean of the previous n data points.
Using Resampling Techniques for Better Quality Discretization
2.4
71
Quality of Discretization
We use three factors to analyze the quality of the obtained discretization. Prediction Accuracy. The goal of the discretization is to make the class C(.) predictable by an attribute X(.). To measure this prediction rate we define a notion of prediction accuracy of the achieved discretization as follows: The discretization of the attribute Xj from a sample Ωs , provides k intervals denoted Iij ; i = 1, .., k. For each ω taken from the test sample Ωt we denote Iij the interval to which it belongs after discretization of the sample Ωs . The point ω will be labeled C(ω) = c∗ if the majority of the points in the Ωt that are in Iij have the class c∗ . This corresponds to a bayesian decision rule with a matrix of symmetrical costs and prior probabilities of the classes estimated by the proportion of the individuals belonging to each class in the Ωt . We measure the quality of the discretization by the rate of good predictions: τj =
ˆ card{ω∈Ωt /C(ω)=C(ω) } card{Ωt }
We denote by τj the good prediction rate resulting from the discretization of Xj obtained by applying a method on the sample Ωs . Complexity. In measuring the complexity of a discretization we take into account the number of intervals Inumber obtained in the resulting discretization. Large number of intervals increase the complexity of the induction algorithm that use the discretized input and also because a large number of intervals are likely to increase the discretization bias and variance. Higher discretization variance effects the quality of discretization even if the prediction accuracy is high. This property is explained by Yang et al [12]. Robustness. We introduce a concept of robustness as previously used in [14]. This is equal to the prediction accuracy in terms of the training sample divided by the predication accuracy of the whole population (which is known in our experiments). This measures the degree of accurate estimation of the population from a small training sample.
3
Evaluated Discretization Techniques
We have evaluated 7 different discretization methods. Out of these 3 methods use top-down technique such as (MDLPC, Fusbin, BalancedGain) and 3 use bottomup (Fusinter,Chimerge,MODL), while 1 is based on an optimal algorithm (Fisher). A brief description of these methods is given below, while we describe 2 of these methods in detail. The MDLPC method is a greedy top-down split method, whose evaluation criterion is based on the Minimum Description Length Principle [16]. At each
72
T. Qureshi and D.A. Zighed
step of the algorithm, the MDLPC evaluates two hypotheses (to cut or not to cut the interval) and chooses the hypothesis whose total encoding cost (model plus exceptions) is the lowest. The Balanced-Gain method exploits a criterion similar to the Gain-Ratio criterion [10]: it divides the entropy-based Information Gain criterion by the log of the arity of the partition in order to penalize excessive multi-splits. Fusbin is a top-down method whereas, the Fusinter method is a greedy bottom-up method. Both exploit an uncertainty measure sensitive to the sample size. Its criterion employs a quadratic entropy term to evaluate the information in the intervals and is regularized by a second term in inverse proportion of the interval frequencies. The Chi-Merge [4] method is a greedy bottom-up merge method that locally exploits the chi-square criterion to decide whether two adjacent intervals are similar enough to be merged. MODL [14] algorithm is based on a bayesian approach. It defines a criterion which is minimal for the bayes optimal discretization. Fisher’s algorithm is a dynamic programming algorithm that tries to find the optimal partition given a criterion, introduced by W.Fisher [21]. The criterion that is used in this paper for Fisher’s algorithm is based on Fusinter’s uncertainty principle, which is presented in detail by Zighed et al [1].
4 4.1
Proposed Scheme Using Resampling Assumptions
1. In each interval, the distribution of the class values is defined by the frequencies of the class values in this interval. 2. The attributes the conditionally independent of each other given the class. 3. Thus, we discretize one attribute at a time with the class (in two dimensions). 4.2
Our Approach
Earlier, we argued that the learning sample is just an approximation of the whole population, so the optimal discretization solution built on a single sample set is not necessarily the global one. Thus, our approach is based on finding a better discretization estimate toward the entire population in terms of discretization quality (as discussed above), using a sample selected randomly from that population and then resampling it. We use a ordinary bootstrap based resampling technique to achieve this objective. These two solutions are explained below: Resampling based Smoothed Discretization Based Approach (RSD). This technique is carried out in three phases. In the first phase we estimate a discretization point distribution over an attribute Xi by repeated resampling n times and performing discretization on each bootstrap sample Ωbs using an entropy based MDLPC method and thus, creating a histogram density function of the resulting candidate points as shown in fig 1a. Note that we can use any discretization method in place of MDLPC but in our experiments we happen to use
Using Resampling Techniques for Better Quality Discretization
73
Fig. 1. a) Discretization point frequency distribution of variable 1 of the waveform data where the x-axis represents the attribute’s continuous values and the y-axis represents their frequencies. b). The distribution of fig 1a after smoothing with a ws = 7 and the resulting peaks.
this technique. During the discretization performed on each bootstrap sample, we record the number of intervals for each sample and build the histogram density function for the number of intervals obtained for n bootstraps. Then from this interval distribution we select the most probable interval obtained denoted as Ibest . This is illustrated in figure 2. In the second phase we smooth over the discretization point distribution function by applying a moving average filter with a window size ws. The resulting smoothed curve is shown in fig 1b. We can see distinct regions (or peaks) where the probability of the candidate points to be the exact discretization point is higher. As shown in fig 1b, the straight line that runs parallel with the x-axis is called the threshold parameter T , which is set as the median of the obtained frequency values. By taking into account this threshold T we define peaks Preg as the regions which lie above this threshold line. The reason for defining such a threshold is to help to elaborate only the regions of higher probabilities (most frequent). The ws is determined as follows: We start by setting ws = 3 and we
74
T. Qureshi and D.A. Zighed
Fig. 2. Number of intervals (interval frequencies) distributed in an example data
calculate the number of peaks Preg obtained. We continue to increment the ws until the the number of peaks Preg obtained approach the most probable interval number Ibest as above. In the third phase, we extract candidate discretization points from this smoothed function. For this phase we have two variations; (1) applying a voting procedure to the peaks(fig 1b) or (2) averaging the peaks. 1. From each peak Preg we select or vote the most probable or frequent point as shown in fig 1b. 2. For each peak Preg we take the average of all the points that lie in the peak region and select the point on the x-axis that corresponds to that average.
5
Results - Analysis and Comparisons
In our experimentations we apply our approaches RSD1 and RSD2 using the MDLPC discretization obtained from n = 100 bootstrap samples and try to build a near optimal solution. Then, we compare this solution to MDLPC original and other 6 discretization methods defined above. The MODL, MDLPC and BalancedGain methods have an automatic stopping rule and do not require any parameter setting. For the Fusinter criterion, we use the regularization parameters recommended in [17]. For the ChiMerge method, the significance level is set to 0.95 for the chi-square test threshold. For Fisher’s optimal algorithm we choose the criteria used in FusInter [1]. In order to find the add-value of our resampling based discretization techniques and compare them with the above mentioned top-down and bottom-up strategies, we measure the quality of discretization and comparing the strategies considering 10 different datasets taken from U.C. Irvine repository [15] shown in figure 3. The 10 data sets of fig 2 have 85 continuous attributes denoted as (X1 (ω), ..., X8 5(ω)) and a label C(ω). Note that the waveform dataset contains significant noise as compared to the others. We used a 10-fold cross validation to measure
Using Resampling Techniques for Better Quality Discretization
75
Fig. 3. Data sets and their summary
Fig. 4. Comparison of accuracy between the methods
the geometric mean of number of intervals μInumber , accuracy rate, and robustness from discretization of all the 85 variables using all the 7 methods discussed. Then we calculated RSD1 and RSD2 as described above. The resulting measures and their results are explained in the following subsections along with figures 4 to 8. Predication Accuracy. Figure 4 shows the cumulative accuracy rates of all the discretization methods plotted against the 85 continuous attributes. It shows
76
T. Qureshi and D.A. Zighed
Fig. 5. Bi-criteria evaluation of the methods for the accuracy and number of intervals, using datasets geometric means
the curve of RSD1 and RSD2, being above all the curves of the other methods including MDLPC from which our methods are originally built (by resampling and obtaining selected discretization points) as explained earlier. As we have plotted the curves using all the attributes so the graphical representation might not be very clear, thus we summarize this comparison in the table of fig 7. However, we note that for attributes 5 to 26 there is a lot of variations between the accuracy curves of the methods. These attributes belong to the waveform dataset and contain significant noise, causing the most of the discretization methods to deteriorate. Here, RSD1 and RSD2 perform significantly much better than the others as compared to the other attributes where at times the difference is small. Bi-criteria Evaluations. In multi-criteria analysis, a solution dominates (or is non-inferior to) another one if it is better for all criteria. A solution that cannot be dominated is Pareto optimal: any improvement of one of the criteria causes a deterioration on another criterion. The Pareto surface is the set of all the Pareto optimal solutions. In order to analyze both the accuracy and robustness results, we report the dataset geometric means on a two-criteria plan in Figure 5a, with the accuracy on the x-coordinate and the robustness on the y-coordinate. Similarly, we report the accuracy and the number of intervals in Figure 5b. Each point in these figures represents the summary of all the experiments. The multi-criteria figures are thus reliable and informative: they allow us to clearly differentiate the behavior of almost all the methods. Accuracy is certainly the most important parameter to distinguish a discretization method so we have grouped it in both the analysis of fig 5 and 6.
Using Resampling Techniques for Better Quality Discretization
77
Fig. 6. Bi-criteria evaluation of the methods for the accuracy and robustness, using datasets geometric means
Fig. 7. Comparison of the Robustness between all the methods
Complexity - Number of Intervals. In this section we compare the number of intervals generated by each method. Here, we point out that the number of intervals produced by RSD1 and RSD2 are the same so we don’t show them separately in fig 5. In illustrated in fig 5, RSD outperforms the other methods in terms of accuracy and complexity. The number of intervals are significantly better than MDLPC. Chimerge, Balancedgain are outperformed by Fusbin, Fusinter, MDLPC and MODL. Chimerge shows a relatively bad result.
78
T. Qureshi and D.A. Zighed
Fig. 8. Comparison of the critical area between all the methods
Robustness. The robustness is an interesting criterion that allows to estimate whether the performance on the sample train data is a good prediction of the performance on the whole data set (Population). The higher is the robustness, the most accurate will be a ranking of attributes based on their sample accuracy. This can be critical in the case of classifiers such as decision trees that incorporate an attribute selection method to build the next node of the tree. We calculated a higher mean robustness of RSD in comparison to other methods, with upto 6 to 7 percent better performance in comparison with Balanced-gain and Chi-merge, while 2 percent better performance in case of MDLPC, MODL and FUSINTER as shown in fig 8. This property exhibits the quality of our method as in a better estimation of the discretization points of the entire data set from a smaller data sample. Fig 6 clearly shows that RSD1 and RSD2 clearly outperforms all the other methods in accuracy and robustness. MDLPC and MODL are very close to each other with Fusinter not far behind. Balanced-gain has the worst overall robustness but chimerge has the worst accuracy. Comparison Summary. Instead of comparing the mean accuracy of all the methods, we measure the critical area t∗ of the difference of each method. The methods are compared two by two according to the following statistical procedure. Let u and v be two methods to compare. We form the difference Γuv between the rates of well ordered elements of the methods u and v. This difference is a random attribute which is roughly normal with parameters (μ, σ). We conclude that u is better than v if μ is significantly superior to 0. We have n = 85 * 10 observations. The estimated mean value μ and mean standard deviation σ are: 11,21 1 uv uv u v μuv = 850 j=1,s=1 γjs ; where, γjs = γjs − γjs σuv =
1 850
11,21 j=1,s=1
∗ ∗ uv − μ γjs uv and t is: t =
μuv√ σuv / n
> t1−α
Using Resampling Techniques for Better Quality Discretization
79
with t1−α the critical value at the rate α of a Student’s law with (n−1) degrees of freedom. Since, n is large, we have for α = 0.05, t1−α/2 = 1.96. The computed t∗ results are reported in figure 8. Positive values of t∗ indicate that the method in the row is better than the method in the column. Aside from Chi-Merge method whose results are relatively bad, all the other methods have relatively smaller differences. However, RSD1 and RSD2 significantly, report much better results and draw much closer to Fishers optimal results. Time Complexity. In terms of time complexity among those methods MDLPC seemed to be the best with a much lesser time complexity in comparison to Fisher’s optimal algorithm which is the most computationally intensive. The time complexity of RSD1 depends mainly on the number of bootstrap samples generated and discretized. The more bootstrap samples, the better the discretization quality. This is a trade-off between time complexity and quality. But with vast improvements in computing speeds, we argue that quality could be a much valuable commodity. Discretization Variance. By building a discretization point distribution by bootstrap and using techniques such as smoothing and averaging we try to lower the discretization variance that is obvious in the distribution of the discretization points. This significantly effects classifiers such as decision trees. In addition, this work has contributed to a fuzzy or soft discretization [20] technique which tries to cater to this problem of high discretization variance, where we smooth the edges of discretization boundaries.
6
Conclusion
The learning sample is an approximation of the whole population, so the optimal discretization built on a single sample set is not necessarily the global optimal one. Our Resampling based approaches tends to give a better discretization estimate in terms of achieving better discretization quality. Applying our schemes, significantly improves the discretization quality and thus, aiming to arrive nearer to a global optimal solution. Except for Chi-Merge and Balancedgain, the other methods provide small variations in terms of accuracy. MDLPC performs the best in terms of number of intervals and time complexity. RSD1 and RSD2 are computation intensive but seem to provide much better discretization quality and also perform well for noisy data. As future work, we shall apply this discretization approach in the context of decision trees, to see whether it improves the global performance or not. We are also aiming to apply this approach in leading to potential discretization points in the context of fuzzy or soft discretization in decision trees. But, at the same time carrying out this approach needs to answer some other questions such as the optimal number of bootstrap samples, optimal window size and time complexity.
80
T. Qureshi and D.A. Zighed
References 1. Zighed, D.A., Rabas´eda, S., Rakotomalala, R.: Discretization Methods in Supervised Learning. Encyclopedia of Computer Science and Technology 40, 35–45 (1998) 2. Wehenkel, L.: An Information Quality Based Decision Tree Pruning Method. In: Valverde, L., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 1992. LNCS, vol. 682. Springer, Heidelberg (1993) 3. Fayyad, U.M., Irani, K.B.: On the Handling of Continuous-Valued Attributes in Decision Tree Generation. Machine Learning 8, 87–102 (1992) 4. Kerber, R.: Discretization of Numeric Attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 123–128. MIT Press, Cambridge (1992) 5. Zighed, D.A., Rakotomalala, R., Rabas´eda, S.: Discretization Method for Continuous Attributes in Induction Graphs. In: Proceeding of the 13th European Meetings on Cybernetics and System Research, pp. 997–1002 (1996) 6. Fayyad, U.M., Irani, K.: Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027. Morgan Kaufmann, San Mateo (1993) 7. Zighed, D.A., Rickotomalala, R.: A Method for Non Arborescent Induction Graphs. Technical Report, Laboratory ERIC, University of Lyon 2 (1996) 8. Ching, J.Y., Wong, A.K.C., Chan, K.C.C.: Class-dependent discretization for inductive learning from continuous and mixed mode data. IEEE Trans. on Pattern Analysis and Machine Intelligence 17(7), 641–651 (1995) 9. Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6(4), 393–423 (2002) 10. Quinlan, J.R.: Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996) 11. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman and Hall, Boca Raton (1998) 12. Yang, Y., Webb, G.I.: Discretization for naive-bayes learning: managing discretization bias and variance. Technical Report 2003/131, School of Computer Science and Software Engineering, Monash University (2003) 13. Hsu, C.N., Huang, H.J., Wong, T.T.: Why discretization works for naive Bayesian classifiers. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 309–406 (2000) 14. MODL: A Bayes optimal discretization method for continuous attributes. Journal of Machine Learning, 131–165 (2006) 15. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1998), http://www.ics.uci.edu/~ mlearn/MLRepository.html 16. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978) 17. Zighed, D.A., Rabaseda, S., Rakotomalala, R.: Fusinter: a method for discretization of continuous attributes for supervised learning. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6(33), 307–326 (1998) 18. Geurts, P., Wehenkel, L.: Investigation and reduction of discretization variance in decision tree induction. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS, vol. 1810, pp. 162–170. Springer, Heidelberg (2000)
Using Resampling Techniques for Better Quality Discretization
81
19. Chmielewski, M.R., Grzymala Busse, J.W.: Global discretization of continuous attributes as preprocessing for machine learning. In: Third International Workshop on Rough Sets and Soft Computing, pp. 294–301 (1994) 20. Peng, Y., Flach, P.: Soft Discretization to Enhance the Continuous Decision Tree Induction. In: Giraud-Carrier, C., Lavrac, N., Moyle, S. (eds.) Integrating Aspects of Data Mining, Decision Support and Meta-Learning, September 2001. ECML/PKDD 2001 workshop notes, pp. 109–118 (2001) 21. Fischer, W.D.: On grouping for maximum of homogeneity. Journal of the American Statistical Association 53, 789–798 (1958)
A Large Margin Classifier with Additional Features Xinwang Liu1, Jianping Yin1, En Zhu1, Guomin Zhang1, Yubin Zhan1, and Miaomiao Li2 1
School of Computer Science, National University of Defense Technology, Changsha, 410073, Hunan, China
[email protected] 2 College of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650216, Yunnan, China
[email protected]
Abstract. We consider the problem of learning classifiers from samples which have additional features that are absent due to noise or corruption of measurement. The common approach for handling missing features in discriminative models is first to complete their unknown values, and then a standard classification algorithm is employed over the completed data. In this paper, an algorithm which aims to maximize the margin of each sample in its own relevant subspace is proposed. We show how incomplete data can be classified directly without completing any missing features in a large-margin learning framework. Moreover, according to the theory of optimal kernel function, we proposed an optimal kernel function which is a convex composition of a set of linear kernel function to measure the similarity between additional features of each two samples. Based on the geometric interpretation of the margin, we formulate an objective function to maximize the margin of each sample in its own relevant subspace. In this formulation, we make use of the structural parameters trained from existing features and optimize the structural parameters trained from additional features only. A two-step iterative procedure for solving the objective function is proposed. By avoiding the pre-processing phase in which the data is completed, our algorithm could offer considerable computational saving. We demonstrate our results on a large number of standard benchmarks from UCI and the results show that our algorithm can achieve better or comparable classification accuracy compared to the existing algorithms. Keywords: Large Margin Framework, Incremental Missing Features Learning, Support Vector Machine, Kernel Method.
1 Introduction The task of learning classifiers with samples which have additional features and some of these features are absent due to measurement noise or corruption has been widely used in real world applications. For example, in fingerprint recognition, new samples may contain many additional features which can characterize fingerprints from other aspects. Another example is in Intrusion Detection field where many additional P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 82–95, 2009. © Springer-Verlag Berlin Heidelberg 2009
A Large Margin Classifier with Additional Features
83
features are contained in new samples as the diverse means of attack. However, the set of additional features may vary among samples in real world tasks, i.e. some of additional features are missing. Common methods for classification with absent features assume that the features exist but their values are unknown. The approach usually taken is a two-step procedure known as imputation. First the values of missing features are filled during a preprocessing phase, and then a standard classification procedure is applied to the completed data [1-2]. Imputation makes sense when the features are known to exist, while their values are absent due to noise, especially the values of the features are missing at random (when the missing pattern is conditionally independent of the unobserved features given in the observations), or missing completely at random (when it is independent of both observed and unobserved measurements). In the common practice of imputation application, absent features in continuous are often filled with zero, with the average of all of the samples, or using the k nearest neighbors (kNN) of each sample to find a filled value of its missing features. Another imputation method builds probabilistic generative models of the features using maximum likelihood or algorithms such as expectation maximization (EM) to find the most probable completion [3]. Such model-based methods allow the designer to introduce prior knowledge about the distribution of features, and are extremely useful when such knowledge can be explicitly modeled [4]. These methods have been shown to work very well for missing at random (MAR) data settings, because they assume that the missing features are generated by the same model that generates the observed features. However, model-based approaches can be computationally expensive and time consuming, and require significant prior knowledge about the domain. What’s more, they may produce meaningless completions for non-existing features, which will likely decrease classification performance. Motivated by reference [5], we propose a Large-Margin Learning Algorithm to tackle with data which have additional features and some of these features are missing. In this algorithm, incomplete data can be classified directly without any completing the missing features using a max-margin learning framework. Moreover, according to the theory of optimal kernel function, we proposed an optimal kernel function which is a convex composition of a set of linear kernel function to measure the similarity between additional features of two samples. We formulate an objective function, based on the geometric interpretation of the margin, which aims to maximize the margin of each sample in its own relevant subspace. In this formulation, we make use of the structural parameters trained from existing features and only optimize the structural parameters trained from additional features. A two-step iterative procedure is proposed for solving the objective function. In this two-step iterative procedure, the first step is to compute the Lagrange Multipliers corresponding to each constraints and the second step is to optimize the optimal weight corresponding to each feature. The two steps iterative until converge to a given criteria. This paper is organized as follows. In the next section we will present a MaxMargin formulation for incremental missing features. Section 3 will firstly give a large-margin learning algorithm with incremental absent features, and then the algorithm with an optimal kernel version is proposed. In section 4 we present the experiments comparing our approach to existing ones. We conclude with a discussion in Section 5.
84
X. Liu et al.
2 A Max-Margin Formulations for Incremental Missing Features Given a training set of samples with input data and , we can train these samples with Support Vector Machine and get and is the structural parameters and , where Lagrange multiplier corresponding to each constraint. As the tasks of detection, recognition and decision get further, new samples in classification may contain additional features and some of these additional features may be absent due to measurement noise or corruption. Traditional methods to deal with this classification task are to discard the original samples containing features and retrain a classifier with the new samples. Apparently, this treatment is timeconsuming and it discards the information learned from the original samples. Motivated by reference [5], we propose an incremental large-margin Learning Algorithm to tackle with data which have additional features and some of these features are missing. We formulate an objective function, based on the geometric interpretation of the margin, which aims to maximize the margin of each sample in its own relevant subspace. In this formulation, we make use of the structural parameters trained from existing features and only optimize the structural parameters trained from additional features. We formulate the proposed problem as follows. Considering a training set of samples
, where
and
represent existing and
sample, respectively. Let denote the set of indices of additional features of the and each is characterized by a subset of additional features of can therefore be viewed features from a full set of size m. Each as embedded in the relevant subspace . Each sample also has a . Importantly, since samples share features, the binary class label classifier we learn should have parameters that are consistent across samples, even if those samples do not lie in the same subspace. We address the problem of finding an optimal classifier in the max-margin framework. In the classical SVM method of reference [10], we learn the structural parameter of a linear classifier by maximizing the margin, defined as . Motivated by the ideal of incremental learning, we consider to learning the structural parameters from existing and additional features. Let and denote the structural parameters trained from already existing features and additional absent features, respectively. Consequently, the structural parameters trained from can be denoted as
. Consider now learning
such a classifier in the presence of samples which have additional features and some of these features are absent. In this case, we can make use of the structural parameters trained from already existing features and train the additional features in the max-margin framework only.
A Large Margin Classifier with Additional Features
85
To address this problem we treat the margin of each sample in its own relevant for the sample as subspace. We define the sample margin
(1)
is a vector obtained by taking the entries of that are relevant for , Where namely those for which the additional features has valid features. Its norm is . We now consider our new geometric margin to be the minimum over all sample margins, problem for the additional missing-features case
and get a new optimization
(2) For this optimization problem, since different margin terms are normalized by differin Equation (2), the denominator can no longer be taken out ent norms of the minimization. In addition, each of the terms is non-convex in , which makes it difficult to solve in a direct way. Section 3 will discuss how the geometric margin of Equation (2) can be optimized.
3 A Large Margin Classifier with Additional Features This section will firstly give a large margin classifier with additional features in the original space, and then we propose a kernel version of this algorithm in which the samples firstly are mapped into a high dimensional feature space and secondly a kernel is adopted to substitute the inner product in this high dimensional feature space. Finally, we propose two two-step iterative produces for the above cases, respectively. 3.1 A Large Margin Classifier with Additional Features in the Original Space Motivated by reference [5], we proposed a two-step iterative algorithm in original space for solving Equation (2). Firstly, we introduce a scaling coefficient and rewrite Equation (2) as
(3)
86
X. Liu et al.
Secondly, the following invariance is used: for every solution , there exists a solution that achieves the same target function value and with a margin that equals 1. This allows us to rewrite Equation (3) as a constrained optimization problem over and as follows.
(4)
Equation (4) is equivalent to the problem of minimizing same constraints. After introducing the slack variables therefore translates into
with the , Equation (4)
(5)
Equation (5) is not a quadratic problem. However, Equation (5) is a standard quadratic optimization problem over with constraints after is given, and then new is calculated using the resulting . To solve the quadratic program, we derive the dual problem for given .
(6)
Where the inner product is taken only over features that are valid for both . Table 1 provides a pseudo code for the two-step optimization algorithm. and 3.2 A Large Margin Classifier with Additional Features in the Feature Space In order to tackle with data which can not be linearly classified, the data in the original space is mapped to a so-called feature space by an implicit nonlinear mapping function
A Large Margin Classifier with Additional Features
87
Table 1. Pseudo code for the two-step iterative optimization algorithm in original space
Iterative Optimization Algorithm in Original Space Initialization: Initialize , the maximum iterative times 2. Iteration times and get the optimal . 2.1 Solve Equation (6) for the current
1.
2.2 Using 2.3 If else go to 2.1.
or
to update , stop.
.
and a hyper plane is constructed in this high dimension feature space. The inner product between two vectors in the feature space is computed by a kernel function. According to the theory of optimal kernel function [6], we proposed an optimal kernel function, which is a convex composition of a set of linear kernel function, to substitute the inner product in Equation (6). The optimal kernel function in this paper is as follows in Equation (7) (7) denote the weight of the feature, represents the feature of , Where is the number of all attributes, is a basic kernel function. Using Equation (7) substitutes the inner product in Equation (6), Equation (6) therefore translates into
(8)
For given
and , Equation (8) is a quadratic problem. can be optimized by Equation (9) (9)
Where we get a quadratic objective (10) to optimize
. Substitute .
to Equation (9),
88
X. Liu et al.
(10)
where product between
and and
is the inner
.
and , Equation (10) is a quadratic problem. We imFor given prove the iterative optimization algorithm in Table 1 as above. is set to linear function, which In our experiments, the basic kernel function means that the optimal kernel function is the convex composition of these basic linear , the optimal functions. When all the feature weight equals, i.e. kernel function degenerates to a linear function. This demonstrates that the iterative optimization algorithm in original space is a special case of the iterative optimization algorithm in feature space. Table 2. Pseudo code for the two-step iterative optimization algorithm in feature space
Iterative Optimization Algorithm in Feature Space Initialization: and the maximum iterative times Initialize . 2. Iteration 2.1 Solving Lagrange multipliers . 2.1.1 Solving Equation (8) for the current and to get . 2.1.2 Using to update .
1.
2.2 Solve Equation (10) for the current . or , stop. 2.3 If else go to 2.1.
and
to get the optimal
4 Experiment Results and Analyses In this section, we compare the performance of our algorithm with that of zeroimputation (the values of absent features are filled with zero), mean-imputation (the values of absent features are filled with the average of all of the samples) and maxmargin algorithm on 14 benchmark data sets which come from UCI Repository. Detailed information of these data sets is summarized in Table 3.
A Large Margin Classifier with Additional Features
89
Table 3. The information of data sets used in experiments Data set
#samples
#features
#existing features
#additional features
#training samples
#test samples
bupa
345
6
3
3
230
115
glass
146
9
5
4
98
48
ionospere
351
34
24
10
234
117
iris
100
4
2
2
66
34
monks
124
6
3
3
83
41
475
166
100
66
317
158
musk pima
768
8
4
4
512
256
promoter
106
57
45
12
71
35
tic-tac-toc
958
9
5
4
639
319
voting
435
16
9
7
290
145
wdbc
569
30
18
12
379
190
wine
130
13
7
6
87
43
spambase
4601
57
32
25
3068
1533
waveform
3345
40
20
20
2330
1115
In Table 3, for Glass and Waveform data set, we only take two-class samples. In experiments, the optimal kernel which is the convex of a serial of linear kernel functions is adopted. For each data set with samples and features, a random matrix is generated in each experiment. For a given missing threshold , is set to if . To overcome the influence of the random matrix to classification prediction, we get the statistical results by repeating each experiment 30 times. The performance of this algorithm is measured by means of prediction accuracy. The reported results are got by 3-fold cross validation. In zero-filling algorithm and mean-filling algorithm, the absent features are completed with zero and the mean of observed features, respectively. Then, a standard SVM classification algorithm is applied to the completed data. We denote the iterative optimization algorithm in original space and the iterative optimization algorithm in feature space as Algorithm1 and Algorithm2, respectively. The following part demonstrates the experiment results of Algorithm1 and Algorithm2, respectively. 4.1 The Experiment Results of Algorithm1 and Other Algorithms This subsection firstly illustrates the classification accuracy of Algorithm1, zerofilling algorithm and mean-filling algorithm, and then plots the classification accuracy curve of the three algorithms varying with the missing ratio.
90
X. Liu et al. Table 4. Prediction accuracy of three algorithms Data set
Algorithm1
zero-filling
mean-filling
bupa
57.5731±1.2633
56.8713±0.7272
60.5702±0.9386
glass
66.1458±3.7539
61.0069±6.6696
59.8611±6.3791
ionospere
82.1510±5.8258
81.2536±2.5666
80.9687±2.3525
iris
99.0625±1.2170
97.9167±2.3152
98.8542±1.6205
monks
54.7500±5.8559
54.1667±5.8926
55.3750±5.2510 54.7046±1.0988
musk
58.5970±5.0868
54.1772±1.0077
pima
74.0850±3.3262
75.4314±1.8304
75.1373±2.7241
promoter
62.9902±13.8341
60.9804±12.7562
60.4412±14.6031
tic-tac-toc
53.5168±8.4123
41.8553±8.1801
41.9916±7.8059 95.5977±1.3504
voting
95.5057±1.9797
95.8276±1.3666
wdbc
91.4638±2.1704
91.1640±2.8884
92.4515±3.2891
wine
91.5476±2.1318
90.7937±4.5905
89.6032±6.3846
spambase
84.4412±1.1707
85.0283±1.7815
85.1348±1.7421
waveform
94.1420±0.7427
94.0996±0.5444
94.1077±0.5481
The prediction accuracy of our algorithm, zero-filling algorithm and mean-filling algorithm is summarized in Tables 4, where the missing threshold . Each result is denoted as mean accuracy±stand error. Best method for each data set is marked with bold face. Form table 4, we see that in most cases our algorithm can outperform the other two algorithms in prediction accuracy and in other cases our algorithm can achieve comparable predication accuracy with the other three algorithms. In order to investigate the influence of missing ratio on Algorithm1 and other algorithms, we plot curves of the prediction accuracy as the missing ratio varied with different additional features in Figure1, Figure2, Figure3 and Figure4, respectively. Number of New Addition Featuresequals 5 on Waveform 92.45 Algorithm1 Zero-filling Mean-filling
Accuracy of Classification
92.4
92.35
92.3
92.25
92.2
0
0.1
0.2
0.3 Missing Ratio
0.4
0.5
0.6
Fig. 1. The accuracy with different missing ratio(#(additional feature)=5)
A Large Margin Classifier with Additional Features
91
Number of New Addition Featuresequals 10 on Waveform 92.3
Accuracy of Classification
92.25
92.2
92.15
92.1
92.05
92
Algorithm1 Zero-filling Mean-filling 0
0.1
0.2
0.3 Missing Ratio
0.4
0.5
0.6
Fig. 2. The accuracy with different missing ratio(#(additional feature)=10)
Considering the fact that the missing ratio is no more than 0.6 in real world applications, the missing varied from 0 to 0.6 with an increment 0.1. From Figure1, Figure2, Figure3 and Figure4, one can see that our algorithm shows higher prediction accuracy than that of the other two algorithms with the increment of missing ratio. We also can see that our algorithm demonstrates better performance as the number of additional features increases, which indicates that the performance of our algorithm is more robust than that of the other two algorithms. 4.2 The Experiment Results of Algorithm2 and Other Algorithms This subsection illustrates the classification accuracy of Algorithm2, zero-filling algorithm, mean-filling algorithm and the max-margin feature algorithm and then plots the predictive accuracy curve of four algorithms varying with the missing ratio. Number of New Addition Featuresequals 15 on Waveform 92.25 92.2
Accuracy of Classification
92.15 92.1 92.05 92 91.95 91.9 Algorithm1 Zero-filling Mean-filling
91.85
0
0.1
0.2
0.3 Missing Ratio
0.4
0.5
Fig. 3. The accuracy with different missing ratio(#(additional feature)=15)
0.6
92
X. Liu et al. Number of New Addition Featuresequals 20 on Waveform 92.3
Accuracy of Classification
92.2
92.1
92
91.9
91.8
91.7
Algorithm1 Zero-filling Mean-filling 0
0.1
0.2
0.3 Missing Ratio
0.4
0.5
0.6
Fig. 4. The accuracy with different missing ratio(#(additional feature)=20) Table 5. Prediction accuracy of four algorithms Data set bupa
Alogithm2
zero-filling
mean-filling
Max-margin
75.4721±1.0532
56.8713±0.7272
60.5702±0.9386
68.3512±1.2356
glass
67.3458±3.2546
61.0069±6.6696
59.8611±6.3791
52.7519±3.3486
ionospere
91.3547±5.6743
81.2536±2.5666
80.9687±2.3525
81.9488±2.5578
iris
99.3675±1.1350
97.9167±2.3152
98.8542±1.6205
99.3547±1.7365
monks
56.2500±4.7379
54.1667±5.8926
55.3750±5.2510
54.8768±5.4976
musk
69.7987±4.3812
54.1772±1.0077
54.7046±1.0988
56.8467±5.4936
pima
73.2854±2.4274
75.4314±1.8304
75.1373±2.7241
77.3697±3.8723
60.9804±12.7562
60.4412±14.6031
59.9471±12.7227
41.8553±8.1801
41.9916±7.8059
60.7687±7.4071
promoter
62.9902±13.8341
tic-tac-toc
57.4188±7.7183
voting
97.5997±1.6587
95.8276±1.3666
95.5977±1.3504
96.7976±1.4734
wdbc
91.4638±3.0908
91.1640±2.8884
92.4515±3.2891
92.7507±3.1211
wine
94.7487±2.2742
90.7937±4.5905
89.6032±6.3846
89.9602±5.8826
spambase
89.5671±1.7231
85.0283±1.7815
85.1348±1.7421
87.8339±1.3649
waveform
94.1420±0.8456
94.0996±0.5444
94.1077±0.5481
93.3903±0.5857
The prediction accuracy of our algorithm, zero-filling algorithm, mean-filling algorithm and the max-margin absent feature algorithm is summarized in Tables 3, where the missing threshold . Each result is denoted as mean accuracy±stand error. In order to investigate the influence of missing ratio on the four algorithms, we plot curves of the prediction accuracy as the missing ratio which varies from zero to 0.6 with increment 0.1 in Figure5 and Figure6 on Waveform and Spambase with additional features, respectively.
A Large Margin Classifier with Additional Features
93
Number of New Addition Featuresequals 20 on Waveform 96.5 Algorithm2 Zero-Filling Mean-Filling Max-Margin
96
Accuracy of Classification
95.5 95 94.5 94 93.5 93 92.5
0
0.1
0.2
0.3 Missing Ratio
0.4
0.5
0.6
Fig. 5. The accuracy with different missing ratio on Waveform(#(additional feature)=20) Number of New Addition Featuresequals 25 on Spambase 91
Accuracy of Classification
90
89
88
87
Algorithm2 Zero-Filling Mean-Filling Max-Margin
86
85
0
0.1
0.2
0.3 Missing Ratio
0.4
0.5
0.6
Fig. 6. The accuracy with different missing ratio on Spambase(#(additional feature)=25)
From Figure5 and Figure6, we can see that as the missing ratio increases, the prediction accuracy of four algorithms decreases. Moreover, Algorithm2 shows higher prediction accuracy than that of the other three algorithms with the increment of missing ratio, which indicates that the performance of our algorithm is more robust than that of the other three algorithms. Algorithm1 and Algorithm2 can reduce training time by making use of the structural parameters trained from existing features and only training the additional features in a large margin framework. Moreover, by avoiding completing the data in the pre-processing phase, our algorithm could offer considerable computational saving. However, Algorithm2 cost much more computation resources than that of other three algorithms in optimizing feature weights , especially when samples contains a large number of features.
94
X. Liu et al.
Another fact, which should be emphasized, is that we have not analyzed the convergence of Algorithm1 and Algorithm2 in theory. However, in all experiments, Algorithm1 and Algorithm2 can all achieve an optimal solution.
5 Conclusions and Feature Work We propose an algorithm to learn classifiers from samples which have additional features and some of these additional features are absent due to measurement noise or corruption using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, which aims to maximize the margin of each sample in its own relevant subspace. In this formulation, we make use of the structural parameters trained from existing features and only optimize the structural parameters trained from additional features. A two-step iterative procedure for solving the objective function is proposed. We conduct experiments to evaluate the performance of the proposed algorithm, zero-filling algorithm and mean-filling algorithm on a large number of UCI data sets. In general, our algorithm can achieve better prediction accuracy and reduce the training time. Our cfuture work is to select a suitable kernel function for absent features and extend Algorithm2 to tackle with the problem that there are some features are absent in existing features. Algorithm2 cost much more computation resources in optimizing feature weight than that of other three algorithms, so efficient algorithm is our future work. Moreover, the convergence analysis of Alorithm1 and Algorithm2 in theory and apply our algorithms to real world applications is our other future work.
Acknowledgement This work is supported by the National Natural Science Foundation of China (No.60603015), the Foundation for the Author of National Excellent Doctoral Dissertation, the Scientific Research Fund of Hunan Provincial Education (the Foundation for the Author of Hunan Provincial Excellent Doctoral Dissertation ). We thank the reviewers for their constructive and insightful comments on the first submitted version of this paper.
References [1] Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987) [2] Roth, P.: Missing data: A conceptual review for applied psychologists. Personnel Psychology 47(3), 537–560 (1994) [3] Ghahramani, Z., Jordan, M.I.: Supervised learning from incomplete data via an EM approach. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems, vol. 6, pp. 120–127. Morgan Kaufmann Publishers, Inc., San Francisco (1994)
A Large Margin Classifier with Additional Features
95
[4] Kapoor, A.: Learning Discriminative Models with Incomplete Data. PhD thesis, MIT Media Lab, Feburary (2006) [5] Chechik, G., Heitz, G., Elidan, G., Abbeel, P., Koller, D.: Max-margin Classification of Data with Absent Features. Journal of Machine Learning Research 9, 1–21 (2008) [6] Ong, C.S., Smola, A.J., Williamson, R.C.: Learning the Kernel with Hyperkernels. Journal of Machine Learning Research 6, 1043–1071 (2005) [7] Crammer, K., Keshet, J., Singer, Y.: Kernel design using boosting. In: Advances in Neural Information Processing Systems, vol. 15, pp. 537–544 (2002) [8] Ong, C.S., Smola, A.J.: Machine learning using hyperkernels. In: Proceedings of the International Conference on Machine Learning, pp. 568–575 (2003) [9] Liu, X., Zhang, G.: An Incremental Feature Learning Algorithm Based on Least Square Support Vector Machine. In: Preparata, F.P., Wu, X., Yin, J. (eds.) FAW 2008. LNCS, vol. 5059, pp. 330–338. Springer, Heidelberg (2008) [10] Kazushi, I., Takemasa, Y.: Incremental support vector machines and their geometrical analyses. Neuro-computing, 2528–2533 (2007) [11] Wang, L., Yang, C., Feng, J.: On learning with dissimilarity functions. In: Proceedings of the 24th international conference on machine learning, pp. 991–998 (2007) [12] Dick, U., Haider, P., Scheffer, T.: Learning from Incomplete Data with Infinite Imputations. In: Proceedings of the 25th international conference on machine learning, pp. 232– 239 (2008) [13] Williams, D., Carin, L.: Analytical kernel ma-trix completion with incomplete multi-view data. In: Proceedings of the ICML Workshop on Learning With Multiple Views (2005) [14] Williams, D., Liao, X., Xue, Y., Carin, L.: Incomplete-data classification using logistic regression. In: Proceedings of the 22nd International Conference on Machine learning (2005) [15] Chen, Z., Li, J.: Least Square Support Vector Machine. In: Proceedings of the 2006 International Conference on Computational Intelligence and Security, pp. 176–179 (2006) [16] Asuncion, A., Newman, D.J.: UCI machine learning repository [OL] (2009), http://archive.ics.uci.edu/ml/
Sequential EM for Unsupervised Adaptive Gaussian Mixture Model Based Classifier Bashar Awwad Shiekh Hasan and John Q. Gan School of Computer Science and Electronic Engineering University of Essex Wivenhoe Park, CO4 3SQ, UK {bawwad,jqgan}@essex.ac.uk
Abstract. In this paper we present a sequential expectation maximization algorithm to adapt in an unsupervised manner a Gaussian mixture model for a classification problem. The goal is to adapt the Gaussian mixture model to cope with the non-stationarity in the data to classify and hence preserve the classification accuracy. Experimental results on synthetic data show that this method is able to learn the time-varying statistical features in data by adapting a Gaussian mixture model online. In order to control the adaptation method and to ensure the stability of the adapted model, we introduce an index to detect when the adaptation would fail.
1
Introduction
Gaussian mixture model (GMM) is a successful and simple clustering method that is widely used in many application domains. In GMM configuration the data are assumed to be generated from a finite number of Gaussian distributions. The data is then modeled by a probability density function. p(x) =
K
πk N (x|μk , Σk ) .
(1)
k=1
Where K is the number of Gaussian components, π1 , . . . , πK are the mixing coefficients, N (x|μk , Σk ) is a Gaussian distribution with μk mean and Σk variance. πk should satisfy the following conditions 0 ≤ πk ≤ 1 . and
K
πk = 1 .
(2)
(3)
k=1
In order to better model the data, a training method is required to estimate the model parameters. One well-known and widely-used method is ExpectationMaximization(EM). EM works by alternating between two steps: the E step that P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 96–106, 2009. c Springer-Verlag Berlin Heidelberg 2009
Sequential EM for Unsupervised Adaptive Gaussian Mixture Model
97
uses the values of the current model parameters to calculate the responsibilities of the Gaussian components, and the M step in which the responsibilities are used to re-estimate the model parameters [1]. After a number of iterations the method converges to a model that maximizes the log likelihood of the data points given the model. EM is a batch method that provides a very good estimate of the clusters when the number of clusters is carefully chosen. In [2] the infinite GMM was introdcued to sidestep the difficult problem of deciding the “right” number of mixture components. The inference in the model is done using an efficient parameter-free Markov chain that relies entirely on Gibbs sampling. This results in adding new components to the mixture as new data arrives, which can work well to adapt to unforeseen data but it does not work within a classification configuration. In[3] and [4] Bayesian Information Criterion based incremental methods were developed to build a GMM with the best number of components. Those methods focus only on clustering problems and they still require all the data offline. Some customized adaptive methods for GMM can be found in the literature, like in [5], where the adaptation scheme is based on a constant learning rate that is used to update the GMM and the number of components is changing. This method works only when the data represent several clusters of one class. In this paper we introduce a sequential version of EM to train a GMM in the case where the statistical features of the data are changing over time. The GMM built here is meant to be used for classification which introduces more constraints on how to adapt the components in a way to maintain/improve the classification accuracy. We also introduce a method to detect misrepresentation of the classes.
2 2.1
Method Gaussian Mixture Model and EM
Here we briefly introduce a formulation of Gaussian mixtures in terms of discrete latent variables as it was introduced and deeply discussed in [1]. Let us introduce a K-dimensional binary random variable z having a 1-ofK representation in which a particular element zk is equal to 1 and all other elements are equal to 0. We will define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x|z). The marginal distribution over z is specified in terms of the mixing coefficients zk πk , with p(zk = 1) = πk . p(z) = K k=1 πk . The conditional distribution of x given a particular value of z is given as p(x|zk = 1) = N (x|μk , Σk ) and K p(x|z) = k=1 N (x|μk , Σk )zk . The joint distribution is then given by p(z)p(x|z), and p(x) =
z
p(z)p(x|z) =
K k=1
πk N (x|μk , Σk ) .
(4)
98
B.A.S. Hasan and J.Q. Gan
The introduction of z helps in calculating the responsibility of each component in the mixture γ(zk ) = p(zk = 1|x). The EM algorithm is shown in Method1., where γ(znk ) is the responsibility associated with data point xn , and γ(zk ) is the responsibilities associated with all the data points x. Method 1. The standard EM for GMM E-STEP: Evaluate responsibilities using current parameter values πk N (x|μk , Σk ) K j=1 πj N (x|μj , Σj )
γ(zk ) = p(zk = 1|x) =
.
(5)
M-STEP: Re-estimate parameters using current responsibilities
μnew = k Σknew =
1 Nk
1 Nk
γ(z
γ(z N
N
N n=1
(6)
nk )(xn
− μk )(xn − μk )T .
(7)
n=1
πknew = where Nk =
.
nk )xn
n=1
Nk . N
(8)
γ(znk )
The initial values for the model parameters can be calculated using a simple clustering method. In this study k-means algorithm is used. 2.2
Classification with Unsupervised Training
In the literature Gaussian mixture models were used for classification in a supervised way[6][7]. A GMM is built for each class and then a Bayesian classifier is used to classify new data(when the classes have the same prior the likelihood is enough for classification). This approach is not suitable for unsupervised adaptation as the labels are always required for training/adaptation. In order to overcome this problem we took a different approach. One GMM was built for all training/offline data and then the available labels were used to calculate p(c = classi |zk ), the probability of the class to be classi when the data point is generated from component zk . To calculate the probability of the data point x belonging to class classi , p(c = classi |x) is calculated as follows: p(c = classi |x) =
K k=1
p(c = classi |zk , x)p(zk |x) .
(9)
Sequential EM for Unsupervised Adaptive Gaussian Mixture Model
99
where p(zk |x) is the responsibility of component zk to generate the point x. Assuming p(c = classi |zk , x) is independent of x then p(c = classi |x) =
K
p(c = classi |zk )p(zk |x) .
(10)
k=1
2.3
Sequential EM(SEM)
EM as an optimization method uses a batch of the available data to tune the model parameters and hence better model the data. In the case where we have an incoming stream of non-stationary data, the model built on the training/offline data will not be able to cope with these changes and this will cause a drop in the classification accuracy because of the misrepresentation of the new data by the old model. To tackle this problem a sequential/online version of EM method for GMM is introduced here. The main idea behind this method is to rewrite each parameter/ hyper-parameter as a convex combination of old and new data. This is valid here as all the parameters/hyper-parameters used by EM are sufficient statistics of the distribution. Method2. outlines the sequential EM method, where t is the current time point. Method 2. The sequential EM for GMM E-STEP: Evaluate responsibility using parameters at t-1 and x at t
γ(zkt ) =
t−1 πkt−1 N (xt |μt−1 ) k , Σk
K j=1
πjt−1 N (xt |μt−1 , Σjt−1 ) j
.
(11)
M-STEP: adapt the model parameters 1 (N t−1 μt−1 + γ(zkt )xt ) . k Nkt k
(12)
1 (N t−1 Σkt−1 + γ(zkt )(xt − μtk )(xt − μtk )T ) . Nkt k
(13)
Nkt . t
(14)
μtk = Σkt =
πkt = where Nkt = Nkt−1 + γ(zkt )
In the E-Step we calculate the responsibility γ(zkt ) associated with a new data point xt at time t based on the model parameters estimated at time t − 1 and using the new data point xt only. In the M-Step, the mean μtk is estimated based on the previous estimation t−1 μk and the new data point xt , similarly Σkt and Nkt are calculated based on the previously estimated parameters Σkt−1 , Nkt−1 , and xt .
100
B.A.S. Hasan and J.Q. Gan
γ(zkt ) controls the adaptation process, so the cluster that is closer to the new sample will adapt more than a further away one. The sequential method is initialized by a model that has been generated by the standard method trained on offline training data. The presented SEM is a parametric adaptive method, which does not take into consideration the problem of the optimal number of clusters that represent the data. The assumption is that the model trained over the offline data has already been chosen to optimally represent the data, using cross validation methods for example. 2.4
Adaptation Failure Detection
As the data shift over time the Gaussian components using SEM will try to keep track of changes in the new model, but it is still bounded by the old data. So when a dramatic change in the new data occurs the adaptation method will most likely fail in presenting the new data. This will affect the classification accuracy and most likely the system will end up with fewer dominant components. The other components still exist but they are not playing any actual role in representing/classifying the new data. Figure 1 shows the components’ responsibilities (summed over all data in the adaptation window), the straight lines are components that represent class1 and the boxed lines are components that represent class2. It is clear that one component becomes dominant over time and hence badly affect the classification accuracy. Another cause of failure of adaptation is due to the changes in the probability of class given a Gaussian component p(c = classi |zk ). The model might still be adapted well to the new data but because of the overlapping between the data of the two classes the Gaussian components might shift to represent one class better than the other(s). In order to enhance the system performance, a failure detection method is necessary to know when to stop adaptation, or re-train the system. We have used a simple and efficient method for failure detection based on the responsibility of the components to generate the new data p(zk |X), X = {x1 , . . . , xN } and the probability of the class given the component p(c = classi |zk ). Let cci = {k : p(c = classi |zk ) = max(p(c = classj |zk )), j = 1, 2} which contains the indexes of components that represent classi . We then define cli =
N
αin .
(15)
n=1
where
p(zk |xn ) : p(zk |xn ) = max(p(zj |xn )), j = 1, . . . , K and k ∈ cci . 0 otherwise (16) 4 cli gives an index of how probable classi is among the data X. The failure detection index is then defined as αin =
Sequential EM for Unsupervised Adaptive Gaussian Mixture Model
101
F DI = cl1 /cl2 .
(17)
When the value of F DI > ul = 2.0 or F DI < ll = 0.5 then one of the classes has a dominant component(s) and the adaptive model has failed. For a more strict/loose constraint the upper limit ul and lower limit ll values can be changed accordingly. FDI deals with the first adaptation failure cause. Detecting the second problem is much harder as the labels are not available online. Instead of trying to detect the shifts of the components between the classes, we try to re-calculate the probability p(c = classi |zk ) after several adaptation batches. This is not a very efficient method though and it might actually affect the classification results especially if the classes are too overlapped.
components responsibilities over a window of data 2000 1800 1600
responsibility
1400 1200 1000 800 600 400 200 0
1
2
3
4
5
6 time
7
8
9
10
11
Fig. 1. Responsibilities of the adaptive GMM components
2.5
The Algorithm
The method starts by modeling the offline data using a predefined number of Gaussian distributions. Then the adaptation is done on a fixed sized windows of the online data, after each adaptation window the FDI is calculated to check the adapted system. If the adaptation has failed, re-training is used to build a model for the new data. Algorithm 1. shows an outline of the adaptive algorithm, where classifyComponents is a method that calculates p(c = classi |zk ). It should be pointed out that step 12 in the algorithm is not mandatory, one might want to stop the adaptation , see the discussion for more about this.
102
B.A.S. Hasan and J.Q. Gan
Algorithm 1. Adaptive GMM 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
3 3.1
model = EMGMM(offlineData) classProb = classifyComponents(model,offlineData,labels) previousModel = model newModel = model while there is new data do classify(newDataPoint,previousModel,classProb) newModel = SEMGMM(newModel,newDataPoint) if size(newData)=adaptationWindow then FDI = calculateFDI(newModel,newData) if F DI > ul||F DI < ll then newModel = EMGMM(newData) classProb= classifyComponents(model,newData,newLabels) end if previousModel=newModel end if end while
Experimental Results Introduction
In order to objectively test the proposed method, synthesized data were used instead of real-life data. The data generated in a way to satisfy a number of criterions – the data present two-class problem – number of data points is balanced between the two classes – the data points are generated from a Gaussian mixture model with equal number of components for each class – linear separability of the data (between 80%-90% in our case) – the data is non-stationary over time – the non-stationary change in the data follow a pre-defined path Those criterions are necessary to make sure that the generated data represents the problem we are trying to solve here, and they make it possible to evaluate the method objectively. 3.2
Synthetic Data Generation
Here we detail the data generation method that complies with the previously stated criterions. The data were generated from a Gaussian mixture model with 6 components that represents 2 classes (3 components per class). The components’ mean were selected randomly on either side of a linear hyperplan, the covariances were selected randomly as well. 2000 data points were sampled from the original model and used as the offline data to train a Gaussian mixture model with 6
Sequential EM for Unsupervised Adaptive Gaussian Mixture Model
103
components. The original model, for generating the data, was then shifted on a curve from θ = 0 to θ = π/2 and on 10 consequent steps, as shown in Fig.2. In each of these steps the covariances were scaled randomly and then another 2000 data points were sampled from the shifted model, this insures a control over the non-stationarity in the data. As the probability of both classes are assumed to be the same, the simulated data are balanced and the ongoing streams use small widows of samples by alternating between the two classes. The assumption of balanced online data does not affect the generality of the method, as this can be satisfied in most application areas by using a proper adaptation window that covers data from both classes, using some prior knowledge of the domain. Means of original data model shifted with time 6
5
4
3
2
1
0 −2
−1
0
1
2
3
4
5
6
Fig. 2. Path of the change in data means
3.3
Results
To test the adaptation method, first we provide the results without the use of FDI, so steps 9-13 are removed from the algorithm. The adaptation window is taken as 2000. The performance of the adapted model is tested on the data from the following window. Here we present the results from 10 data sets. Table1 shows the results using the model built on the offline data. Table2 demonstrates the results using the adaptive method. Figure3 shows the change of average accuracy over all the data sets. In the same figure we added the p values calculated using Wilcoxon signed-rank test as suggested in[8]. The signed-rank test results show significant enhancement of accuracy when the adaptive method is applied. Although the focus in this work is on unsupervised adaptation. We here present the results achieved using re-training after a failure detection using FDI to show it’s usability,as in Table3.
104
B.A.S. Hasan and J.Q. Gan Table 1. Results using the static model
Dataset T1 DS1 0.9370 DS2 0.9040 DS3 0.9095 DS4 0.9790 DS5 0.912 DS6 0.925 DS7 0.75 DS8 0.6935 DS9 0.901 DS10 0.82
T2 0.7610 0.8415 0.8985 0.9745 0.879 0.908 0.657 0.549 0.832 0.785
T3 0.6840 0.8470 0.9060 0.9805 0.8685 0.8835 0.6095 0.599 0.736 0.738
T4 0.6980 0.7705 0.8920 0.9945 0.882 0.7815 0.6055 0.5925 0.594 0.743
T5 0.6575 0.8105 0.8665 0.9510 0.8845 0.733 0.6565 0.562 0.552 0.654
T6 0.7325 0.8375 0.7890 0.9625 0.696 0.6305 0.647 0.706 0.4435 0.604
T7 0.6760 0.8260 0.6485 0.9145 0.559 0.5955 0.6775 0.718 0.4265 0.5825
T8 0.6770 0.7335 0.5865 0.8485 0.5405 0.5385 0.71 0.565 0.3825 0.5725
T9 0.6740 0.5630 0.5400 0.7435 0.521 0.506 0.604 0.544 0.479 0.5605
T10 0.7195 0.4670 0.4980 0.6565 0.5125 0.5 0.472 0.511 0.7445 0.499
Table 2. Results using the adaptive algorithm without re-training Dataset T1 DS1 0.937 DS2 0.904 DS3 0.9095 DS4 0.979 DS5 0.9115 DS6 0.925 DS7 0.75 DS8 0.6925 DS9 0.901 DS10 0.82
4
T2 0.956 0.8535 0.9015 0.985 0.921 0.915 0.6265 0.8385 0.849 0.8005
T3 0.9475 0.8855 0.9285 0.973 0.921 0.918 0.5 0.895 0.874 0.767
T4 0.9615 0.81 0.9165 0.991 0.9285 0.9065 0.498 0.869 0.8535 0.8065
T5 0.9595 0.8285 0.9235 1 0.926 0.8685 0.499 0.8765 0.842 0.7195
T6 0.963 0.8095 0.9095 0.9915 0.887 0.748 0.494 0.8635 0.8395 0.6635
T7 0.9785 0.8235 0.8415 0.992 0.8915 0.706 0.501 0.814 0.8 0.627
T8 0.9655 0.829 0.869 0.982 0.8715 0.6155 0.536 0.852 0.816 0.58
T9 0.9645 0.831 0.811 0.9875 0.8315 0.5235 0.49 0.77 0.877 0.4955
T10 0.961 0.8445 0.763 0.962 0.815 0.5135 0.464 0.682 0.839 0.483
Discussion and Conclusion
In this paper we have presented a sequential EM method to adapt a Gaussian mixture model in a classification configuration. In addition we have defined an adaptation failure detection index. This method is suitable for cases where the system data statistics suffers shifts over time, EMG and EEG data are examples of such a case. The data is expected to change slowly over time, sudden changes can be much harder to capture in our method. In[9], Neal presented a justification for online variant of EM. He showed that an online EM based on sequential E-step can converge faster than the standard EM. Sato [10] showed that the online EM algorithm can be considered as a stochastic approximation method to find the maximum likelihood estimator. Although we did not present a discount factor similar to the one presented by Sato, the online EM method for GMM presented here follows the general scheme of online EM in the literature. Here we did not deal with the problem of the optimal number of Gaussian components that represents each class. Some work on incremental addition/removal
Sequential EM for Unsupervised Adaptive Gaussian Mixture Model
105
average accuracies over 10 data sets p=0.5 0.85
accuracy
0.8
p=0.037109 p=0.048828 p=0.027344 p=0.037109 p=0.037109
0.75
p=0.037109 p=0.019531
0.7
p=0.019531 0.65
p=0.013672
0.6
0.55
2
4
6
8
10
12
time
Fig. 3. Average accuracies over time using static model(continuous line) and adaptive model (discrete line) Table 3. Results using the adaptive algorithm with re-training Dataset T1 DS1 0.937 DS2 0.904 DS3 0.9095 DS4 0.979 DS5 0.9115 DS6 0.925 DS7 0.75 DS8 0.6925 DS9 0.901 DS10 0.82
T2 0.956 0.8535 0.9015 0.985 0.921 0.915 0.8355 0.8385 0.849 0.8005
T3 0.9475 0.8855 0.9285 0.973 0.921 0.918 0.6835 0.895 0.874 0.767
T4 0.9615 0.81 0.9165 0.991 0.9285 0.9065 0.69 0.869 0.8535 0.8605
T5 0.9595 0.8285 0.9235 1 0.926 0.8685 0.7285 0.8765 0.842 0.8275
T6 0.963 0.8095 0.9095 0.9915 0.887 0.748 0.8475 0.8635 0.8395 0.8135
T7 0.9785 0.8235 0.8415 0.992 0.8915 0.925 0.83 0.814 0.8 0.788
T8 0.9655 0.829 0.869 0.982 0.8715 0.8395 0.7405 0.852 0.816 0.675
T9 0.9645 0.831 0.811 0.9875 0.8315 0.928 0.9855 0.77 0.877 0.8805
T10 0.961 0.8445 0.997 0.962 0.815 0.803 0.838 0.682 0.839 0.755
of Gaussian components was presented in[3][4], but in a classification configuration this is a very difficult problem. It is hard to know online ,in unsupervised way,the probabilities p(c = classi |zn ) where zn is the newly added component, so we assumed the number of components is static and only the model parameters are adapted. In the algorithm presented, Algorithm1., p(c = classi |zk ) are considered static and calculated only when building the original model. These probability distributions can be updated with time between sessions based on the current model classification. The re-training step mentioned in the algorithm can only be used if labels are available/partially available online. In a totally unsupervised adaptation scheme
106
B.A.S. Hasan and J.Q. Gan
FDI ,which is a fast and reliable measure of the adaptation, indicates when the adaptation fails and then one might stop adaptation and use the last known stable model. The size of adaptation window might have a considerable effect on the performance of the adaptation method. Small window might not change the model enough while longer window means larger drop in the ongoing classification until the new adapted model is used. The selection of the window size is determined mostly by the chosen application. An important feature of such a window is that it provides balanced number of examples from the two classes. This is important to protect the adaptation method from adapting to one class over the other. In [11], we have applied the proposed method in the field of Brain-Computer Interface (BCI). The experimental results showed the usefulness of this approach in building adaptive BCIs.
Acknowledgment The authors would like to thank Prof. Stephen Roberts for his useful input. This work is part of the project “Adaptive Asynchronous Brain Actuated Control” funded by UK EPSRC. Bashar’s study is funded by Aga Khan Foundation.
References 1. Bishop, C.: Pattern recognition and machine learning. Springer, Heidelberg (2006) 2. Rasmussen, C.E.: The infinite Gaussian mixture model. In: Advances in Neural Information Processing Systems, vol. 12, pp. 554–560 (2000) 3. Cheng, S., Wang, H., Fu, H.: A model-selection-based self-splitting Gaussian mixture learning with application to speaker identification. EURASIP Journal on Applied Signal Processing 17, 2626–2639 (2004) 4. Fraley, C., Raftery, A., Wehrensy, R.: Incremental model-based clustering for large datasets with small clusters. Tech. Rep. 439 (2003) 5. Shimada, A., Arita, D., Taniguchi, R.: Dynamic control of adaptive mixture-ofGaussians background model. In: AVSS 2006. Proceedings of the IEEE International Conference on Video and Signal Based Surveillance, vol. 5 (2006) 6. Marques, J., Moreno, P.J.: A study of musical instrument classification using Gaussian mixture models and support vector machines. Tech. Rep. CRL 99/4 (1999) 7. Millan, J.R.: On the need for on-line learning in brain-computer interfaces. In: Proc. IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2877–2882 (2004) 8. Desmar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 9. Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models, 355–368 (1998) 10. Sato, M., Ishii, S.: On-line EM algorithm for the normalized Gaussian network. Neural Comp. 12(2), 407–432 (2000) 11. Awwad Shiekh Hasan, B., Gan, J.Q.: Unsupervised adaptive GMM for BCI. In: International IEEE EMBS Conf. on Neural Engineering, Antalya, Turkey (2009)
Optimal Double-Kernel Combination for Classification Feng Wang and Hongbin Zhang College of Computer Science, Beijing University of Technology, Beijing 100124, China
[email protected],
[email protected]
Abstract. Traditional kernelised classification methods could not perform well sometimes because of the using of a single and fixed kernel, especially on some complicated data sets. In this paper, a novel optimal double-kernel combination (ODKC) method is proposed for complicated classification tasks. Firstly, data sets are mapped by two basic kernels into different feature spaces respectively, and then three kinds of optimal composite kernels are constructed by integrating information of the two feature spaces. Comparative experiments demonstrate the effectiveness of our methods. Keywords: Classification, kernel learning, pattern recognition.
1 Introduction The success of SVM[1][2] has extended the use of kernel method to other linear classification algorithms, such as Kernel Fisher discriminant (KFD) [3], and Kernel Minimum Squared Error (KMSE) [4], called traditional kernelised classification methods in this paper. The key to deliver high performance is that through feature mapping, the geometrical structure of the mapped data is able to adapt well to corresponding linear algorithms or at least possesses a better linear separability than that in the input space. Unfortunately, traditional kernelised classification methods could not perform well sometimes, especially on some complicated data sets. The reason is that only by using a single and fixed kernel, traditional methods are not able to adapt well to complicated data sets. Furthermore, performance of traditional methods could be even worse if an inappropriate kernel or inappropriate kernel parameters are used [5] [6]. In order to overcome the disadvantage of traditional kernelised method, a kind of method of constructing objective kernel function using label information for some specific tasks has been focused recently, such as [7][8][9][10]. Compared with traditional kernelised classification methods, these methods above make progress on the adaptability of kernels to various target data sets. But since restricting themselves to shallow architectures [11] like traditional kernel method, they are still limited in their performance on complicated data sets. Among flexible learning algorithms, [11] establishes a distinction between shallow architectures and deep architectures. Shallow architectures are best exemplified by kernel machines and neural networks with only one hidden layer. And deep architectures are perhaps best exemplified by multi-layer neural networks with several hidden layers. [11] presents mathematical and empirical evidence suggesting that shallow architectures are very inefficient for representing complex, highly-varying functions in terms of required P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 107–122, 2009. © Springer-Verlag Berlin Heidelberg 2009
108
F. Wang and H. Zhang
number of computational component and the training data, while deep architectures are proposed as a way to escape from the limitation. Although a number of learning algorithms for deep architectures have been available, training such architectures is still largely perceived as a big challenge. As a trade-off between traditional kernel methods and deep architectures algorithms, a novel optimal double-kernel combination method (ODKC) is proposed in this paper. [12] points out most kernel combination methods take a risk of losing information in the original kernel matrices and develops a method without losing that information. But the size of its kernel matrix is greatly enlarged so as to cause high computational complexity. ODKC possesses low complex property, because its kernel matrixes have the same size of original kernel matrices and it is easier to be constructed. Also, better than single kernel machine, ODKC is composed of two different basic kernels, which can take the advantages of various kernels to better adapt to complicated tasks. Meanwhile, ODKC possesses non-iterative property, since the problem of optimization reduces to a generalized eigenvalue problem. The paper is organized as follows. Firstly, basic idea is proposed in Section 2. Section 3 presents our method in detail. Following that, our algorithm is described in Section 4. In Section 5, experimental results on five datasets are shown. Finally, we give the conclusion remarks in Section 6.
2 Basic Idea , ,…, , of points , with corres, where m is the number of different classes, our basic
Given a training data set ponding labels 1,2, … , idea is illustrated in Figure 1.
Φ = φ1 o φ0
φ1 = QTφ2 (S )
φ0
Feature Space Input Space • Ο • Ο Ο Ο • • Ο Ο • •
Ο • Ο • •
Ο
• Ο Ο Ο • •
• •
• •••
ΟΟΟ ΟΟ Ο
Fig. 1. Feature mapping Φ · embeds the data of two classes ‘•’and ‘o’ from input space (a) into a feature space (c)
The aim of this work is to seek the optimal objective kernel mapping Φ , called ODKC kernel mapping composed of two basic kernel mappings, and . is chosen as a feature mapping corresponding to a valid basic kernel such as Polynomial Kernel, Gauss Kernel or ANOVA kernel etc. [13]. Data mapped by , illustrated in Figure 1(b), may be still nonlinearly separating, especially for some
Optimal Double-Kernel Combination for Classification
109
complicated data sets. So it is necessary to do another feature mapping in order to reshape geometrical structure of the data in favor of linear classification algorithms. We · as , where is a linear operator, and is defined as choose operator three forms in Section 3.3, just in order to form a unified framework, in which we can discuss the construction of the three kinds of composite kernels easily. How to define and how to optimize the linear operator according to a criterion function are two main focuses in this paper. The criterion function is introduced in Section 2.1. 2.1 The Criterion Function For the classification task, we aim to seek the optimal objective kernel mapping Φ according to a criterion function increasing intra-class compactness and inter-class separability. We, firstly, give the definition of the membership matrix before the criterion function is introduced. is called membership matrix, if the element is:
Definition 2.1.1: matrix
0, 1,
,
。
where f(x) is called membership function, f(xj) = i if and only if i is the label of xj Under appropriate constraints, a reasonable criterion is to minimize the following objective function min
,
,Φ
+
Φ
+
(1)
Φ
,
(2)
Where 1 |
|
Φ
is the centre of data of the k-th class in the feature space corresponding to feature mapping Φ · , |f -1(k)| is the number of data in the k-th class, max
Φ
is the square of the maximum distance between data in the i-th class and its centre in the feature space corresponding to feature mappingΦ · , and|| || is Euclidean norm. incurs a heavy The objective function with our choice of weights penalty if the points Φ are close to the centers of other classes. How to choose weights is to be discussed in detail in Section 3.4. It is obvious that the hypersphere , , Φ that solves the minimization problem is the hypersphere in the feature space containing all the data in the i-th class with the smallest radius .
110
F. Wang and H. Zhang
· and
3 Computations of
Let denote ODKC kernel matrix corresponding to ODKC kernel mapping Φ Based on Eq(2), we will discuss the computations Φ · and in detail below. 3.1 Solve the Optimization Problem of Eq. (2) After the criterion function is proposed, we are going to seek the optimal solution of Φ in the minimization problem of Eq. (2). Firstly, let’s predigest and transform Eq. (2). Let 1 |
|
,the second term of the right-hand side of Eq. (2) can be
and matrix written
(3) +
trace Q
Q
,
where, ,
,…, ,
, ,…,
, ,
+
is a diagonal matrix whose entries are row sums of matrix F, and is a diagonal matrix whose entries are column sums of matrix F. Substitute Eq. (3) into the objective function Eq. (2) and use the constraint 1 we get the minimization problem reduced to
,
min ,
Subject to
+ 1.
Optimal Double-Kernel Combination for Classification
111
Following that, we can solve the constrained and simplified optimization problem by 1, … , for each defining a Lagrangian involving one Lagrange multiplier constraint. ,
,
+
1 .
1, … ,
We then solve by setting partial derivatives of L with respect to 1, … , equal to zero ∂ ∂
2
0
1
(4) and
(5)
0
(6)
1, … , are just the eigenvalues of Eq. (5) shows that Lagrange multipliers 1, … , are the eigenvectors of matrix A. From Eq. (5) and matrix A and Eq. (6), we can see the optimal solution of Lagrangian function is given by the smallest s eigenvalues and corresponding eigenvectors of A. In fact, the symmetric matrix A is indefinite, the eigenvalues of A may be positive, negative or zero. To minimize function L, we let s equal to the number of the negative eigenvalues, and select these negative eigenvalues and corresponding eigenvectors as the solution. As a consequence ,
,
,
0
Hence, the optimal objective kernel mapping we seek is ·
Φ ·
,
· ,
,
where , , are the eigenvectors corresponding to the negative eignvalues of matrix A. · is presented implicitly in the definition of However, because feature mapping A, it is impossible to construct matrix A and compute its eigenvectors 1, … , . In fact, to compute Φ · and , we do not need to construct A and compute its eigenvectors . 3.2 Computations of
· and
∈
Define vector vi RN by the k-th element of vi as and define matrix
,
,…,
,
0,
, we have
and +
Substitute Eq. (7) into the Eq. (5), we have
|
|
.
(7)
112
F. Wang and H. Zhang
+
·
.
(8)
So, 1
+ ,
.
(9)
where 1
+
.
Substitute Eq. (9) into the Eq. (8) and both sides of Eq. (8) are left multiplied , we have with +
where
,
(10)
is a Gram matrix corresponding to feature
. mapping Therefore, given a training set
,
,…,
Φ x ,Φ x ,…,Φ x
in input space, we get
,
,…,
α ,…,α
(11)
are s generalized ,… in the feature space after optimal embeddingΦ · , where eigenvectors of Eq. (10). So, the task of looking for qj has become the generalized eigenvalues problem of Eq. (10). As for a test point , we have Φ , ,
,…,
,
,
.
(12)
And Φ
,Φ ,
,…,Φ ,…,
Φ ,
,…,
,Φ
,…,Φ .
(13)
Optimal Double-Kernel Combination for Classification
113
As can be seen from equation (13), we are not able to compute before we know what matrix is. The next Section will discuss the construction of and computation of . and Computation of
3.3 The Construction of
is proposed just to form a unified framework in which three ODKC kernels can be discussed easily. In this section, we will see that ODKC kernels are easily constructed, corresponding to three kinds of definitions of . Let S be a finite set of N training data , , , , …, , kernels and correspond respectively to the feature mappings and , and and to these points, the corresponding kernel matrices obtained by restricting x . the kernel function corresponding to feature mapping is and chosen as a feature mapping corresponding to a valid and basic kernel such as Polynomial Kernel, Gauss Kernel or ANOVA kernel, different from , then: is defined by:
Definition 3.3.1: The non-linear mapping of ,
:
,
where [v1,v2]=(v11,v12,…v1m,v21,v22,…,v2n)T stands for the concatenation of vector v1 and vector v2. , is as follows: In this case, the computation of ,
, , ,
,
, +
+
, ,
,
So, the kernel matrix + . Figure 2 shows the mapping Φ · in the form of neural network where is defined by Definition 3.3.1. Data is mapped into two different feature spaces in layer-2 by and respectively, and then the objective feature space in layer-3 is created by feature selection from the two feature spaces. is defined by:
Definition 3.3.2: The non-linear mapping of
,
:
for i=1,…,N1, j=1,…,N2, where N1 is the dimension of x, and N2 is that of , is: , the computation of For this choice of ,
.
, ∑ ∑
∑ 1
1
0
,
0
,
∑
2
1
3
3
114
F. Wang and H. Zhang
∗
So, the kernel matrix
.
Figure 3 shows the mapping Φ · in the form of neural network where is defined by Definition 3.3.2. Data is mapped into two different feature spaces in layer-2 by and respectively, then another higher feature space in lay-3 is created by taking into account the correlations between attributions in the two feature spaces in layer-2. Finally, the objective feature space in layer-4 is created by feature selection from the feature space in layer-3. Definition 3.3.3: Let defined as: : So computation of
be chosen as a dot kernel , the non-linear mapping of . , is: ,
is
, , , ,
So, the kernel matrix
.
Figure 4 shows the mapping Φ · in the form of neural network where is defined by Definition 3.3.3. Data mapped by and then by form the feature space in layer-3. The objective feature space in layer-4 is created by feature selection from the feature space in layer-3. For the three options, because and are chosen as valid kernels, it is easy to prove that , is a valid kernel function. ·
φ01( xnew) x1new
φ02 ( xnew)
Q Φ1( xnew)
x
2 n ew
φ0N ( xnew) 1
φ31 ( xnew)
Φs ( xnew)
n xnew
φ3N ( xnew) 2
Fig. 2. Feature mappingΦ · embeds the data of the case of definition 3.3.1
from input space into a feature space in
Optimal Double-Kernel Combination for Classification
φ01( xnew) x
1 new
x
2 new
115
φ01 •φ31 Q
φ02 ( xnew )
Φ1 ( xnew )
φ •φ 1 0
φ0N ( xnew)
N2 3
1
φ31 ( xnew )
n new
φ0N •φ31 1
x
Φs ( xnew )
φ3N ( xnew)
φ0N •φ3N
2
Fig. 3. Feature mapping Φ · embeds the data of the case of definition 3.3.2
1 new
x
2 xnew
n new
x
φ01( xnew)
1
2
from input space into a feature space in
φ31(φ0(xnew))
Q
Φ1 ( xnew )
φ ( xnew) 2 0
Φs ( xnew )
φ0N ( xnew) 1
Fig. 4. Feature mapping Φ · embeds the data of the case of definition 3.3.3
φ3N (φ0(xnew)) 2
from input space into a feature space in
3.4 The Choice of Weighting Coefficients Weights in criteria function incur heavy penalties when the points Φ are close to other centers. The weighting coefficients are chosen as follows
116
F. Wang and H. Zhang
, ∑ where
is the squared Euclidean distance of the point computed in [13] as follow 2
,
|
,
|
from the i-th class centre +
1 |
is the expected squared distance of a point centre, computed in [13] as follow 1 |
,
|
+
,
|
, ;
in the i-th class from its 1 |
|
,
;
and function H(x) is Heaviside function that returns 1 if its argument is greater than 0 and zero otherwise.
4 The Algorithm of ODKC The algorithm of constructing ODKC kernel matrixes can be sum up as follows: Step 1 [Constructing matrixes] Using the label of training data, relative definitions and equations discussed above to construct matrixes below: ,
,
,
,
,
Step 2 [Computing generalized eigenvalues and generalized eigenvectors of Eq. (10)] Compute generalized eigenvalues and generalized eigenvectors of Eq. (10). Let s be the number of the negative eigenvalues and , … be the generalized eigenvectors of Eq. (10), ordered according to s negative eigenvalues: 0 ,Φ ,…,Φ andΦ Step 3 [Optimal feature mapping] Compute Φ Eq. (11) and Eq. (12). ODKC kernel matrix is computed by Eq (13).
using
Step 4 [classification using ODKC kernel matrix] Substitute ODKC kernel matrix for traditional kernel matrix in SVM, Fisher, MSE or KNN and perform these algorithms.
5 Experiments In this Section, we conduct two parts of experiments showing higher performance of ODKC kernels compared to basic kernel. In the first part of experiments, we compared the
Optimal Double-Kernel Combination for Classification
117
Alignment Measure of ODKC kernels with that of gauss basic kernel. In the second part, we conduct classification accuracies of SVM, KFD, KMSE, and k-nearest-neighbor (KNN), when using ODKC kernels and using gauss kernel. In the two parts of experiments, only Gaussian kernel and Polynomial kernel are considered to be used to construct ODKC kernels. In these Tables, symbol P stands for Polynomial kernel , + with with parameters (R, d), and g Gaussian kernel , / parameter . ODKC stands for ODKC kernel constructed by the third definition of ; ODKC stands for ODKC kernel constructed by the first definition of ; ODKC ∗ stands ODKC kernel constructed by the second definition of . stands for gauss kernel. The parameters of ODKC (R, d, and ) are fixed to 4, 2, and 33. And SVM with basic kernel gets its parameters by cross validation. Besides that, other necessary parameters are provided in parentheses in Tables. Five data sets are used in this paper, adopted from the UCI benchmark repository [14]. Basic information about these data sets is shown in table 1. 5.1 Experiments on Alignment Measure The “alignment” measure was introduced by Cristianini et al. [7] for measuring the adaptability of a kernel to the target data. It is defined as a normalized Frobenius inner product between the kernel matrix K and the target label matrix: , ,
,
where y denotes the label vector of the training data. The range of Alignment is [0, 1]. The larger its value is the closer the kernel to the target. Respectively, 80%, 50%, and 20% data was chosen from each class randomly to compute the values of Alignment Measure. Each experiment was repeated 15 times and the mean value is recorded in Table 2, 3, 4 and 5. As can be seen from these Tables, ODKC kernel matrixes have larger Alignment values than traditional ones, showing that ODKC kernel matrixes have much more adaptability to the target data. 5.2 Experiments on Classification Accuracies In this part, we conduct classification algorithms SVM, FDA, MSE, and KNN on the Ionosphere, Wisconsin Breast cancer, Pima Indians diabetes, and Sonar, using ODKC kernels compared with Gaussian kernel. Firstly, two-class problem is considered. Respectively, 80%, 50%, and 20% data was chosen randomly from each class for training and the rest for testing. Each experiment was repeated 15 times and the mean of value is recorded in Table 6, 7, 8 and 9. For multi-classification problem, Table 10 gives classification accuracies for KNN algorithm (k=1 and k=3), using Gaussian kernel and
118
F. Wang and H. Zhang Table 1. Information about the UCI data sets used in this paper
Number of Instances Number of Attributes Number of classes
Ionosphere 351 34 2
Sonar 208 60 2
Breast 699 9 2
Pima 768 8 2
Iris 150 4 3
Table 2. Alignment values in the case of Ionosphere dataset 80%training data ODKC ODKC ODKC
50%training data
0.233 0.180 0.310 0.089
∗
K 33
0.221 0.231 0.316 0.097
20% training data 0.233 0.061 0.331 0.070
Table 3. Alignment values in the case of Sonar dataset 80%training data ODKC ODKC ODKC
∗
K 33
0.124 0.053 0.230 0.035
50%training data 0.165 0.061 0.231 0.044
20%training data 0.267 0.054 0.227 0.049
Table 4. Alignment values in the case of Wisconsin Breast cancer dataset ODKC ODKC ODKC ∗ K 33
80% training data 0.192 0.191 0.155 0.100
50% training data 0.198 0.1478 0.153 0.100
20% training data 0.294 0.299 0.302 0.043
Table 5. Alignment values in the case of Wisconsin Pima Indians dataset ODKC ODKC ODKC ∗ K 33
80% training data 0.095 0.094 0.055 0.040
50% training data 0.092 0.078 0.053 0.051
20%training data 0.020 0.099 0.082 0.097
ODKC kernel on the Iris data set. Respectively, 80% and 60% data was chosen from each class randomly for training and the rest for testing. Each experiment was repeated 15 times and the mean of value is recorded in Table 10. In these Tables, Symbols ‘1’ and ‘3’ in tables stand for 1-nearest neighbor algorithm and 3-nearest neighbor algorithm, respectively. SVM + ODKC stands for linear SVM algorithm with ODKC kernel, FDA + K stands for FDA algorithm with Gaussian kernel, and so on. The results of experiments show the effectiveness of our method.
Optimal Double-Kernel Combination for Classification
119
Table 6. Comparison of classification accuracies (%) for training set and test set for various kernels in the case of the Ionosphere data 80% training data Training Set Test Set SVM + ODKC SVM + ODKC ∗ SVM + ODKC SVM + K FDA + ODKC FDA + ODKC ∗ FDA + ODKC FDA + K 33 MSE + ODKC MSE + ODKC ∗ MSE + ODKC MSE + K 33
95.80 97.43 96.22 94.81 98.30 99.50 99.20 99.29 98.92 97.50 97.20 97.14
95.60 94.89 94.72 93.52 94.95 94.54 94.36 92.95 94.36 93.41 93.21 92.95
50% training data Training Set Test Set 94.32 96.22 95.67 92.60 99.20 99.50 98.20 99.50 96.57 97.24 97.15 96.57
92.68 89.95 89.95 90.10 89.77 90.10 92.61 88.07 92.34 92.40 90.32 91.47
20% training data Training Set Test Set 98.20 98.54 97.52 99.12 98.56 99.12 99.20 99.20 98.02 97.77 98.67 98.02
83.35 86.21 85.70 70.81 85.65 79.81 84.94 83.98 77.82 79.02 80.56 80.21
Table 7. Comparison of classification accuracies (%) for training set and test set for various kernels in the case of the Sonar data 80%training data TrainingSet Test Set SVM + ODKC
100.00
SVM + ODKC
∗
SVM + ODKC
100.00
SVM + K
99.00
FDA + ODKC FDA + ODKC
100.00 ∗
FDA + ODKC
100.00 99.85
FDA + K 33
100.00
MSE + ODKC MSE + ODKC
100.00
100.00 ∗
MSE + ODKC MSE + K 33
100.00 100.00 100.00
96.6 5 96.7 7 96.7 0 95.5 0 95.8 7 96.0 0 95.0 5 92.9 0 97.0 0 97.3 0 97.0 0 90.0 1
50%training data TrainingSet Test Set
20% training data TrainingSet Test Set
99.00
89.28
100.00
82.28
99.00
91.17
100.00
81.77
99.00
91.11
100.00
80.96
99.00
85.65
100.00
80.87
100.00
93.03
100.00
83.00
100.00
94.00
100.00
84.65
98.56
94.73
99.12
84.33
100.00
88.07
100.00
83.98
100.00
97.00
100.00
87.88
100.00
96.20
100.00
89.10
100.00
96.00
100.00
89.07
100.00
92.32
100.00
88.56
5.3 Analysis of Experimental Results Compared with traditional methods, ODKC kernel is composed of two different basic kernels, which can take the advantages of various kernels to better adapt to different target datasets. As can be seen from experiments, the performance of classification algorithm with ODKC kernel is better than those with basic kernel. Speaking specifically, ODKC can provide significant improvement of the classification performance in the case of complicated datasets such as Ionosphere and Sonar datasets, and a limited
120
F. Wang and H. Zhang
improvement in the case of simple datasets like Wisconsin Breast cancer, Pima Indians diabetes and Iris, compared with traditional method. Table 8. Comparison of classification accuracies (%) for training set and test set for various kernels in the case of Wisconsin Breast cancer data 80%training data Training Set Test Set SVM + ODKC SVM + ODKC ∗ SVM + ODKC SVM + K FDA + ODKC SVM + ODKC ∗ SVM + ODKC FDA + K 33 MSE + ODKC MSE + ODKC ∗ MSE + ODKC MSE + K 33
96.87 96.53 95.55 96.47 99.89 99.53 99.12 98.58 99.87 99.84 99.84 98.58
50%training data Training Set Test Set
95.77 94.43 95.63 93.57 97.02 95.46 96.91 96.77 97.56 97.72 97.88 96.70
95.76 95.44 94.34 95.56 97.57 96.23 98.34 97.71 99.45 98.54 99.71 97.42
20% training data Training Set Test Set
91.55 92.66 93.62 91.91 95.77 94.66 96.52 95.98 98.43 96.62 97.42 95.98
97.22 95.55 95.25 96.53 99.02 98.15 99.53 100.00 98.23 96.50 98.58 98.58
85.70 87.65 88.67 85.40 87.77 89.56 88.78 86.07 96.56 92.54 95.53 86.25
Table 9. Comparison of classification accuracies (%) for training set and test set for test data in the case of Wisconsin Pima Indians data 80%training data Training Set Test Set SVM + ODKC SVM + ODKC ∗ SVM + ODKC SVM + K FDA + ODKC FDA + ODKC ∗ FDA + ODKC FDA + K 33 MSE + ODKC MSE + ODKC ∗ MSE + ODKC MSE + K 33
85.45 87.56 86.12 85.62 86.46 85.45 88.76 83.16 90.32 88.54 89.66 89.43
50%training data Training Set Test Set
84.63 81.75 81.22 78.15 81.65 84.34 82.13 80.04 83.65 84.76 83.12 80.60
84.32 83.64 84.52 82.68 89.74 85.53 89.51 83.85 88.33 87.75 89.73 83.54
20%training data TrainingSet Test Set
76.88 77.87 77.60 76.82 77.34 77.54 77.41 76.56 79.45 82.33 79.14 78.64
83.57 84.27 83.66 84.67 91.03 90.45 90.02 83.65 85.45 86.65 88.11 83.58
71.47 73.54 70.50 72.64 77.76 76.65 75.12 74.95 77.07 76.66 75.04 76.50
Table 10. Comparison of classification accuracies (%) for KNN algorithm for various kernels in the case of Iris data in multi-class problem Testing Set (%) 20% 40%
1
96.00 94.65
∗
95.65 93.54
1
1
96.00 93.77
3
+
96.12 95.13
3
∗
94.55 96.45
3
96.12 95.13
KNN 1
95.12 93.69
KNN3
95.67 94.53
Comparing each form of kernel ODKC defined in Section 3.3, algorithm using and for classification kernel ∗ are better than that with kernel accuracies on Ionosphere and Sonar. The reason is for high-dimension data sets like Ionosphere and Sonar, there is lots of correlation information between attributions useful for classification, where kernel ∗ uses the information well. However, cannot perform better than and on some kernel ∗
Optimal Double-Kernel Combination for Classification
121
low-dimension data sets like Wisconsin Breast cancer and Pima Indians diabetes and , which contain little correlation information. As for with deeper architecture can learn more complex decision function than , while depends on the good linear-separability of data in the the good performance of feature space of lagyer-2, such as Iris data set.
6 Conclusion In this paper, three kinds of composite kernels are constructed under a unified framework by different definition of . We also compared classification algorithms when using traditional kernel and these three kinds of ODKC kernels on five datasets. Finally, we gave analysis of experimental results. Only double-kernel combination is considered in this paper. Actually, it is easy to promote the case of double-kernel to multi-kernel according to Section 3.2. So, another attempt in the future is to analysis and study advantages and disadvantages of optimal multi-kernel combinations. Acknowledgments. The work on this paper was supported in part by the National Natural Science Foundation of China (NSFC) under grant 60775011.
References 1. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines. Cambridge Univ. Press, Cambridge (2000) 2. Burges, C.J.C., Scholkopf, B.: Improving the accuracy and speed of support vector learning machines. In: Mozer, M., Jordan, M., Pestsche, T. (eds.) Advance in neural information processing systems, vol. 9, pp. 375–381. MIT Press, Cambridge (1997) 3. Baudat, G., Anousr, F.: Generalized discriminant analysis using a kernel approach. Neural Computation 12(10), 2385–2404 (2000) 4. Ruiz, A., Lopez-de Teruel, P.E.: Nonlinear kernel-based statistical pattern analysis. IEEE Trans. Neural Network 12(1), 16–32 (2001) 5. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the Kernel Matrix with Semidefinite Programming. J. Machine Learning Research 5, 27–72 (2004) 6. Bach, F., Lanckriet, G.R.G., Jordan, M.I.: Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In: Proc. 21st Int’l Conf. Machine Learning (2004) 7. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., Kandola, J.: On kernel-target alignment. In: Advance in neural information processing systems. MIT Press, Cambridge (2001) 8. Nguye, C.H., Ho, T.B.: An efficient kernel matrix evaluation measure. Pattern Recognition 41(11), 3366–3372 (2008) 9. Xiong, H., Swamy, M.N.S.: Optimizing the kernel in the empirical feature space. IEEE transactions on neural networks 16(2), 460–474 (2005) 10. Chen, B., Liu, H., Bao, Z.: Optimizing the data-dependent kernel under unified kernel optimization framework. Pattern recognition 41(6), 2107–2119 (2007) 11. Bottou, L., Chapelle, O., DeCoste, D., Weston, J.: Large-Scale Kernel Machines. MIT Press, Cambridge (2007)
122
F. Wang and H. Zhang
12. Lee, W.-J., Verzakov, S., Duin, R.P.W.: Kernel Combination Versus Classifier Combination. In: Haindl, M., Kittler, J., Roli, F. (eds.) MCS 2007. LNCS, vol. 4472, pp. 22–31. Springer, Heidelberg (2007) 13. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge University Press, Cambridge (2004) 14. Blake, C., Keogh, E., Merz, C.J.: UCI Repository of Machine Learning Databases (1998), http://www.ics.uci.edu/mlearn
Efficient AdaBoost Region Classification M. Moed and E.N. Smirnov Department of Knowledge Engineering, Maastricht University, The Netherlands
[email protected],
[email protected]
Abstract. The task of region classification is to construct class regions containing the correct classes of the objects being classified with an error probability ε ∈ [0, 1]. To turn a point classifier into a region classifier, the conformal framework is employed [11,14]. However, to apply the framework we need to design a non-conformity function. This function has to estimate the instance’s non-conformity for the point classifier used. This paper introduces a new non-conformity function for AdaBoost. The function has two main advantages over the only existing non-conformity function for AdaBoost. First, it reduces the time complexity of computing class regions with a factor equal to the size of the training data. Second, it results in statistically better class regions.
1 Introduction Most of the research in data mining focuses on point classifiers: classifiers capable of estimating the correct class of an object being classified [2,5]. However, in critical applications with high misclassification costs, region classifiers are needed. We define region classifiers as those classifiers that are capable of providing a region of classes that contains the correct class of the object being classified with a probability of error ε ∈ [0, 1]. Point classifiers can be turned into region classifiers using the conformal framework [11,14]. This framework is valid if the data are drawn from the same unknown exchangeability distribution and if the learning process is online. Given an object space X, a class set Y, training data D over X × Y, and an object x ∈ X to be classified, the framework first computes for each class y ∈ Y the non-conformity scores of the instances in D and the labeled instance (x, y). The non-conformity score for an instance estimates how unusual this instance is in D ∪ {(x, y)} relative to the other instances in D ∪{(x, y)}. The p-value py of class y for the object x is computed as the proportion of instances in D ∪ {(x, y)} of which the non-conformity scores are greater than or equal to that of the instance (x, y). Hence, the class y is added to the final class region for the object x if py ≥ ε. The main problem in applying the conformal framework for a point classifier is to design a non-conformity function. This non-conformity function has to estimate the instance’s non-conformity scores for that classifier. So far, non-conformity functions have been proposed for support vector machines, k-nearest neighbor classifiers, linear-discriminant classifiers, naive Bayes classifier, boosting classifiers, and kernel perceptrons [12,13,14]. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 123–136, 2009. © Springer-Verlag Berlin Heidelberg 2009
124
M. Moed and E.N. Smirnov
This paper focuses on the problem of efficiently applying the conformal framework for AdaBoost point classifiers [10]. We first consider the only existing non-conformity function for AdaBoost proposed in [8,14], called the general non-conformity function. We show that if we use this function, the time complexity to construct one class region is very high: it is equal to O(|Y||D|Tab ) where Tab is the time complexity of the AdaBoost algorithm. To reduce this complexity we propose a new non-conformity function for AdaBoost,+ that is based on the weights of the training instances calculated by the AdaBoost algorithm. Since the weights indicate the classification “difficulty” of the instances, they are used as non-conformity values. Thus, our function returns for each training instance the weight of this instance. This results in time complexity to construct one class region equal to O(|Y|Tab ), i.e. we reduce the time complexity with a factor |D|. In addition, we analyze the class regions based on our weights-based non-conformity function. We show that they are statistically better than the class regions based on the general non-conformity function if the AdaBoost point classifiers are neither underfitted nor overfitted. The remaining of the paper is organized as follows. Section 2 formalizes the tasks of point classification and region classification. The conformal framework and AdaBoost are introduced in Sections 3 and 4, respectively. Section 5 considers the general nonconformity function and weights-based non-conformity function. The experiments are presented and discussed in Section 6. Finally, Section 7 concludes the paper.
2 Point and Region Classification Let X be an object space and Y a class set. The instance space Z over X and Y is defined as X × Y. The training data Dn is a bag z1 , z2 , ..., zn of n instances zi ∈ Z drawn from the same unknown probability distribution Q. Given training data Dn and object xn+1 ∈ X to be classified, the point-classification task is to provide an estimate yˆ of the real class yr ∈ Y of the object xn+1 . To provide such a class estimation for any object x ∈ X we identify a point classifier h(Dn , x) in a space H of point classifiers h (h : Z(∗) × X → 2R )1 . We assume that h(Dn , x) outputs for x a posterior distribution of scores {sy }y∈Y over all the classes in Y. The class y with the highest posterior score sy is the estimated class yˆ for x. The region-classification task is different from the point-classification task. Formally, given significance level ε ∈ [0, 1], training data Dn , and object xn+1 ∈ X to be classified, the region-classification task is to find class region Γ ε (Dn , xn+1 ) ⊆ Y that contains the correct class y ∈ Y of xn+1 with probability at least 1 − ε 2 . In order to do so, first a posterior distribution of probabilities {py }y∈Y over Y is generated by normalizing the scores in the posterior distribution {sy }y∈Y . Then, the class region Γ ε (Dn , x) ⊆ Y is defined as a set consisting of those classes y ∈ Y of which the posterior probabilities py are greater than or equal to ε. Since in general the posterior probabilities of point classifiers are not calibrated [15], this approach fails to produce correct class regions. 1 2
Z(∗) denotes the set of all bags defined in Z. A naive approach to the region-classification task is to employ a point classifier h(Dn , x).
Efficient AdaBoost Region Classification
125
In the next section we briefly sketch a framework for constructing classifiers for the region-classification task. For short they will be called region classifiers.
3 The Conformal Framework This section briefly sketch the conformal framework for constructing region classifiers [11,14]. Subsection 3.1 describes formally the framework. Then, Subsection 3.2 discusses options when the framework is being applied. Finally, Subsection 3.3 introduces metrics for evaluating region classifiers based on the conformal framework. 3.1 Formal Description The conformal framework was proposed for turning point classifiers into region classifiers [11,14]. The framework is proven to be valid when the training data and objects to be classified are drawn from the same unknown distribution under the exchangeability assumption. The exchangeability assumption holds when different orderings of training instances are equally likely. Applying the conformal framework is a two-stage process. Given a point classifier h(Dn , x), we first construct a non-conformity function for h(Dn , x) capable of measuring how unusual an instance looks relative to the other instances in the data. Then, we apply the conformal algorithm that employs the non-conformity function to compute the class regions. Formally, the non-conformity function is a function of type A : Z(∗) ×Z → R∪{∞}. Given a bag Dn ∈ Z(∗) of training instances and instance z ∈ Z, it returns a value α in the range of [0, R ∪ {∞}] indicating how unusual the instance z with respect to the instances in Dn . In general, the function A returns different scores for instance z depending on whether z is in the bag Dn (added prediction) or not (deleted prediction): if z ∈ Dn , then the score is lower; otherwise it is higher. Since there is no consensus on this issue [14], care must be taken when choosing between added or deleted prediction. The general non-conformity function was defined in [14] for any point classifier (∗) h(Dn , x). Given a bag Dn ∈ Z of training instances and instance z equal to (x, yr ), the function outputs the sum y∈Y,y =yr sy where sy is the score for class y ∈ Y produced by h(Dn , x). The conformal algorithm is presented in Figure 1. Given significance level ε ∈ [0, 1], training data Dn of n instances, object xn+1 ∈ X to be classified, and the nonconformity function A for a point classifier h(Dn , x), the algorithm constructs class region Γ ε (Dn , xn+1 ) ⊆ Y for the object xn+1 . To decide whether to include a class y ∈ Y in the class region Γ ε (Dn , xn+1 ) the object xn+1 and class y are first combined into instance zn+1 . The instance zn+1 is added to the training data which, resulting in a data set of n + 1 instances denoted by Dn+1 . Then, the algorithm computes the nonconformity score αi for each instance zi ∈ Dn+1 using the non-conformity function A 3 . The non-conformity scores are used for computing the p-value py of the class y for the object xn+1 . More precisely, py is computed as the proportion of instances in 3
We note that depending on the option chosen (added/deleted prediction), the instance zi is included in or excluded from the data used when the non-conformity function A is called.
126
M. Moed and E.N. Smirnov
Significance level , Training data Dn equal to the bag z1 , z2 , ..., zn , Object xn+1 to be classified, Non-conformity function A for a point classifier h(Dn , x). Output: Class region Γ (Dn , xn+1 ). 1: Γ (Dn , xn+1 ) := ∅; 2: for each class y ∈ Y do 3: zn+1 := (xn+1 , y); 4: Dn+1 := Dn zn+1 ; 5: for i := 1 to n + 1 do 6: if using deleted prediction then 7: αi := A(Dn+1 \ zi , zi ); 8: else if using added prediction then 9: αi := A(Dn+1 , zi ); 10: end if 11: end for #{i=1,...,n+1|αi ≥αn+1 } 12: py := ; n+1 13: Include y in Γ (Dn , xn+1 ) if py > ; 14: end for 15: Output Γ (Dn , xn+1 ). Input:
Fig. 1. The conformal algorithm
the training data Dn+1 of which the non-conformity scores αi are greater or equal to that of the instance zn+1 . Once py is set, the algorithm includes the class y in the class region Γ ε (Dn , xn+1 ) if py ≥ ε. The conformal algorithm was originally designed for the online learning setting. This setting assumes an initially empty data set Dn . Then for each integer n from 0 to +∞ we first construct class region Γ ε (Dn , xn+1 ) for the new object xn+1 being classified, and then add the instance (xn+1 , yr ) to Dn where yr is the correct class of xn+1 . In this context we note that the conformal algorithm is proven to be valid [11,14] (i.e. it constructs for any object xn+1 a class region Γ ε (Dn , xn+1 ) ⊆ Y containing the correct class y ∈ Y for xn+1 with a probability of at least 1 − ε) if: – the data are drawn from the same unknown distribution under the exchangeability assumption; – the learning setting is online. 3.2 Possible Options To apply the conformal framework we have to make a set of choices concerning the non-conformity function used and the learning setting. The conformal algorithm outputs valid class regions for any real-valued function used as non-conformity function [11]. However, the class region will be efficient (small) if the function estimates well the difference of any instance with respect to the training data. In this context we note that the general non-conformity function is not always the
Efficient AdaBoost Region Classification
127
most efficient. Therefore, one of the main issues when applying the conformal framework is how to design specific a non-conformity function for the point classifier used. As stated above, the conformal algorithm is valid when the learning setting is online [11]. However, it was shown experimentally in [12,13] that the algorithm produces valid class regions in the offline (batch) learning setting4 . Thus, the conformal algorithm can be used either in the online setting or in the offline setting. In this paper we apply the algorithm in the offline setting only. 3.3 Evaluation Metrics Any class region Γ ε (B, xn+1 ) is valid if it contains the correct class y ∈ Y of the object xn+1 ∈ X being classified with probability of at least 1 − ε. To evaluate experimentally the validity of the class regions provided by the conformal algorithm we introduce the error metric E. E is defined as the proportion of the class regions that do not contain the correct class5 . Thus, in order to prove experimentally that the conformal algorithm is valid we have to show that for any significance level ε ∈ [0, 1] the error E is less than or equal to ε. Any class region Γ ε (B, xn+1 ) is efficient if it is non-empty and small. Thus, to evaluate experimentally the efficiency of the class regions provided by the conformal algorithm, we introduce three metrics: the percentage Pe of empty-class regions, the percentage Ps of single-class regions, and the percentage Pm of multiple-class regions. The empty-class regions, single-class regions, and multiple-class regions can be characterized by their own errors. The percentage Pe of empty-class regions is essentially an error, since the correct classes are not in the class regions. The error Es on single-class regions is defined as the proportion of the invalid single-class regions among all class regions. The error Em on multiple-class regions is defined as the proportion of invalid multiple-class regions among all the class regions. The errors Pe , Es , and Em are components of the error E. More precisely, it is easy to prove that E = Pe + Es + Em . The error E has its own upper bound E u representing the worst-case case when we are not able to pick up correct classes from valid multiclass regions. In this case we will err on all the multi-class regions and, thus, E u is defined equal to Pe + Es + Pm . We note that for any significance level ε ∈ [0, 1] there is no guarantee that E u is less than or equal to ε unless Pm = 0.
4 Boosting Boosting is a general approach to constructing point classifiers from weak point classifiers (defined as just better-than-random classifiers) [10]. Boosting operates by sequentially training the weak point classifiers on re-weighted versions of the training data. The final point classifiers are constructed as weighted-majority voters of the resulting weak classifiers. 4
5
The offline learning setting is the traditional setting in machine learning. In this setting the classified instances are not added to the training data. We note that the computing the evaluation metrics introduced in this subsection depends on the learning setting use (online/offline).
128
M. Moed and E.N. Smirnov
Training data Dn equal to the bag z1 , z2 , . . . , zn , Type h(Dn , x) of weak point classifier, Number of iterations T . Output: AdaBoost point classifier hT (x). 1: for i := 1 to n do 2: wi = n1 ; 3: end for 4: for t := 1 to T do 5: Train weak classifier ht (Dn , x) on Dn using weights wi ; 6: Calculate the error of ht (Dn , x) on Dn using weights wi ; 7: if = 0 or ≥ 12 then 8: T := T − 1; 9: Abort loop 10: end if 11: βt = − log2 ( ); 1− 12: for i := 1 to n do 13: if ht (xi ) = yi then 14: wi := wi ∗ 1− ; 15: end if 16: end for 17: Normalize the weights wi ; 18: end for 19: Output Adaboost classifier hT (Dn , x) equal to the βt -weighted sum of the classifiers ht (Dn , x). Input:
Fig. 2. The AdaBoost algorithm
Adaboost (Adaptive Boosting) is the first efficient boosting algorithm [4]. Due to its straightforward implementation and good generalization performance, it is one of the most widely adopted boosting algorithms to date. The Adaboost algorithm is given in Figure 2. The input is formed by: training data Dn with n instances, type h(Dn , x) of a weak point classifier, and iteration number T . The output consists of a AdaBoost point classifier hT (B, x) based on weak classifiers of type h(Dn , x). The Adaboost algorithm considers the training data Dn as weighted data. This means that each instance zi ∈ Dn is associated with a weight wi indicating the classification “difficulty” of the instance. Since no classification information is provided a priori, the weights wi are initialized uniformly. The Adaboost algorithm itself is iterative. For each iteration t ∈ 1 . . . T , AdaBoost first trains a weak point classifier ht (Dn , x) on the weighted training data Dn . Then, the algorithm calculates the weighted error of ht (Dn , x). If = 0 or ≥ 12 , the current point classifier ht (Dn , x) is viewed as bad. In this case the algorithm deletes ht (Dn , x) and does not perform any more iterations. If = 0 and < 12 , the algorithm computes the weight βt of the weak point classifier ht (Dn , x) equal to − log2 ( ). Then, it re-computes the weights wi of the train1− ing instances so that the weights of the correctly classified instances are decreased by factor 1− . The final weights wi are normalized so that they indicate the classification
Efficient AdaBoost Region Classification
129
“difficulty” of the instances in the training data Dn for the weak point classifiers from the previous t iterations. When all T weak point classifiers have been generated, the algorithm outputs the Adaboost point classifier hT (Dn , x). It is equal to the sum of the weak point classifiers ht (Dn , x) weighted according to their weights βt . In other words: given an object x ∈ X to be classified the classifier hT (Dn , x) outputs score sy for each class y ∈ Y T equal to t=1 βt st (y) where st (y) is the score for object x and class y produced by the t-th weak point classifier with weight βt . The final class scores of the classifier hT (Dn , x) can be normalized to produce distribution of probabilities {py }y∈Y over all the classes in Y. However, these probabilities are not calibrated [7]; i.e. the AdaBoost point classifier alone cannot be used for region classification.
5 AdaBoost Non-conformity Functions As stated in Section 4, the AdaBoost classifiers are point classifiers. To turn these classifiers into region classifiers we need first to construct a non-conformity function, and then apply the conformal algorithm. In this section we focus on non-conformity functions for the AdaBoost classifiers. We first analyze the non-conformity function for AdaBoost proposed in [14]. We show that the time complexity of the conformal algorithm based on this function is very high. To avoid this problem we then propose our new non-conformity function for AdaBoost. We show that the time complexity of the conformal algorithm based on our function is reduced with a factor equal to the number of training instances. Later in the experimental section, we show that our non-conformity function results in better class regions. 5.1 The General Non-conformity Function The first non-conformity function proposed for AdaBoost is the general non-conformity function [8,14]. Given a bag Dn ∈ Z (∗) of n training instances and instance z equal to (x, yr ), the non-conformity function equals A(Dn , z) = y∈Y,y =yr sy , where sy is the score for class y ∈ Y produced by the AdaBoost point classifier hT (B, x). In this context we point out one property of hT (B, x): if the instance z = (x, yr ) is included in the training data Dn , then the score syr increases with the number of the AdaBoost iterations while the scores sy for the remaining classes y ∈ Y \ {yr } decrease. This implies that the function tends to produce non-conformity values close to 0; i.e. it cannot distinguish well the non-conformity of the instances. Thus, the general non-conformity function is applied for AdaBoost only if the instance z = (x, yr ) for which the nonconformity score is being calculated is deleted from the training data Dn . This means that when the conformal algorithm has to be used with the general non-conformity function in combination with AdaBoost, deleted prediction has to be used (see Figure 1). We note that in this case, computing p-value py for one class y ∈ Y requires |Dn | runs of the AdaBoost algorithm. Thus, the time complexity for constructing one class region becomes O(|Y||Dn |Tab ), where Tab is the time complexity of the AdaBoost algorithm.
130
M. Moed and E.N. Smirnov
5.2 Weights-Based Non-conformity Function To reduce the time complexity for constructing one class region, we propose a new non-conformity function for AdaBoost. This new function is called the weights-based non-conformity function. As the name suggests, it is based on the weights of the training instances calculated by the AdaBoost algorithm. Since they indicate the classification “difficulty” of the instances, we interpret them as non-conformity values. Formally, the weights-based non-conformity function is defined as follows: given a bag Dn of training instances and instance z equal to (x, yr ) so that z ∈ Dn , the function returns the weight wT of (x, yr ) calculated by the AdaBoost algorithm after T iterations, i.e. A(Dn , z) = wT . We note that since the AdaBoost algorithm computes weights only for training instances, the instance z has to belong to the data set Dn . This implies that when using the weights-based non-conformity function, added prediction has to be used (see Figure 1). We note that in this case computing p-value py for one class y ∈ Y requires only one run of the AdaBoost algorithm. Thus, the time complexity for constructing one class region becomes O(|Y|Tab ), where Tab is the time complexity of the AdaBoost algorithm. Comparing the time complexities of the conformal algorithm for the general nonconformity function and for the weights-based non-conformity function, we conclude that the latter is |Dn | times faster than the former. To complete our study, we investigate the validity and efficiency of the class regions based on these two functions in Section 6.
6 Experiments and Discussion This section presents our experiments with the conformal algorithm. The algorithm was tested for the general non-conformity function and for the weights-based nonconformity function for AdaBoost. The performance of the algorithm for these two functions are expressed in terms of the validity and efficiency of the class regions. We carried out two series of experiments. In the first one we experimentally compared the performance of the conformal algorithm for the general non-conformity function and for the weights-based non-conformity function. In the second series we investigated the influence of the model complexity of the AdaBoost point classifiers on the class regions based on these two functions. 6.1 Setup The experiments were performed with the conformal algorithm in combination with AdaBoost. As a base classifier, AdaBoost employs the decision stump classifier. The class regions of the conformal algorithm were evaluated using five metrics (defined in Subsection 3.3): the error E, the upper-bound error E u , the percentage Pe of emptyclass regions, the percentage Ps of single-class regions, and the percentage Pm of multiple-class regions. The method of evaluation was 10-fold cross validation. To compare the metrics that correspond to the non-conformity functions investigated we used a paired-t test on significance level 0.05.
Efficient AdaBoost Region Classification
131
6.2 Non-conformity Functions Comparison The conformal algorithm for the general and weights-based non-conformity functions was run on ten UCI data sets [1]. For each data set, the number T of AdaBoost iterations was set equal to the number Tmax of iterations for which the 10-fold cross-validation accuracy of the AdaBoost classifier is maximized. The results are given in Table 1 given below. The table presents the validity and efficiency of the class regions of the conformal algorithm for the general and weights-based non-conformity functions. Both functions have been evaluated for probabilities of error ε = 0.1 and ε = 0.05. Table 1 shows that the class regions computed by the conformal algorithm for both non-conformity functions are valid. This is due to the fact that the error E is always lower than or equal to the probability of error ε. In addition, we observe that the error E is higher for the weights-based non-conformity function. This is due to the fact that for this function: (1) the percentage Ps of single-class regions is higher (8 times significantly) and the percentage Pm of multiple-class regions is lower (8 times significantly). In other words: the weights-based non-conformity function has higher error E because we have much more desired single class regions than undesired multiple class regions. This explains why the upper-bound error E u is always lower for the weightsbased non-conformity function (8 times significantly). From this we conclude that if the AdaBoost classifiers are neither underfitted nor overfitted6, the class regions based on the weights-based non-conformity function are statistically better than those of the general non-conformity function in terms of the percentage Ps of single-class regions, the percentage Pm of multiple-class regions, and the upper-bound error E u .
6.3 AdaBoost Complexity Influence The AdaBoost number T of iterations controls the complexity of the final point classifier hT (B, x). In this subsection we investigate the influence of this complexity on the class regions based on the general and weights-based non-conformity functions. We provide the experimental results for one data set for three values of the iteration number T : 1, Tmax , and 10 × Tmax 7 . These values bound the whole spectrum of complexity of the AdaBoost classifiers. When T = 1 the AdaBoost classifier is underfitted; i.e. it is rather simple since it consists of only one decision stump classifier. When T = Tmax the AdaBoost classifier is optimized - the 10-fold cross validation accuracy is maximized. When T = 10 × Tmax the AdaBoost classifier is overfitted; i.e. it is rather complex. Table 2 and Figure 3 present the results for the iteration number T equal to 1, Tmax , and 10 × Tmax for the hepatitis training data. The observations and explanations are as follows: 6
7
In our experiments the number of AdaBoost iterations maximizes the 10-fold cross-validation accuracy of the AdaBoost classifier. We performed the same experiment on the remaining 9 UCI data sets. Since the results are similar, due to the length restriction we consider in detail only the experiment for the hepatitis data set.
132
M. Moed and E.N. Smirnov
Table 1. Paired t-test performance statistics of the class regions of the conformal algorithm based on the weights-based non-conformity function (Weights) and the general non-conformity function (General). The symbol ‘*’ means that the result is statistically better.
Weight breast-cancer General Weight colic General Weight heart-c General Weight heart-h General Weight heart-statlog General Weight hepatitis General Weight iris General Weight labor General Weight lymph General Weight sonar General
E 0.04 0.06 0.09 0.02∗ 0.08 0.01∗ 0.09 0.00∗ 0.09 0.01∗ 0.09 0.04 0.04 0.00 0.03 0.00 0.04 0.07 0.08 0.00∗
Pe 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.03 0.00 0.00 0.00 0.02 0.00
ε = 0.1 Ps Pm 0.23 0.77 0.23 0.77 0.72∗ 0.28∗ 0.19 0.81 0.74∗ 0.26∗ 0.22 0.78 0.75∗ 0.25∗ 0.22 0.78 0.72∗ 0.28∗ 0.21 0.79 0.76∗ 0.24∗ 0.23 0.77 0.89∗ 0.09∗ 0.31 0.69 0.76∗ 0.21∗ 0.32 0.68 0.07 0.93 0.23 0.77 0.70∗ 0.27∗ 0.23 0.77
Eu 0.81 0.83 0.37∗ 0.83 0.34∗ 0.79 0.33∗ 0.78 0.37∗ 0.79 0.33∗ 0.82 0.11∗ 0.69 0.21∗ 0.68 0.94 0.81 0.33∗ 0.77
E 0.03 0.04 0.05 0.01∗ 0.05 0.00∗ 0.05 0.00∗ 0.04 0.01∗ 0.05 0.03 0.01 0.00 0.02 0.00 0.02 0.05 0.05 0.00
ε = 0.05 Pe Ps Pm 0.00 0.16 0.84 0.00 0.18 0.82 0.00 0.51∗ 0.49∗ 0.00 0.14 0.86 0.00 0.59∗ 0.41∗ 0.00 0.16 0.84 0.00 0.68∗ 0.32∗ 0.00 0.17 0.83 0.00 0.56∗ 0.44∗ 0.00 0.16 0.84 0.00 0.61∗ 0.39∗ 0.00 0.16 0.84 0.00 0.87∗ 0.13∗ 0.00 0.31 0.69 0.02 0.69∗ 0.30∗ 0.00 0.24 0.76 0.00 0.03 0.97 0.00 0.21∗ 0.79∗ 0.01 0.62∗ 0.37∗ 0.00 0.18 0.82
Eu 0.86 0.86 0.54∗ 0.86 0.46∗ 0.84 0.37∗ 0.83 0.49∗ 0.85 0.44∗ 0.87 0.14∗ 0.69 0.30∗ 0.76 0.98 0.83∗ 0.41∗ 0.82
Table 2. The hepatitis data set: paired t-test performance statistics of the class regions of the conformal algorithm based on the weights-based non-conformity function (Weights) and the general non-conformity function (General) for the number T of AdaBoost iterations equal to 1, Tmax , and 10 × Tmax . The symbol ‘*’ means that the result is statistically better.
T =1 T = Tmax T = 10 × Tmax
Weights General Weights General Weights General
E 0.00 0.08 0.09 0.04 0.06 0.10
Pe 0.00 0.00 0.00 0.00 0.00 0.00
= 0.1 Ps Pm 0.01 0.99 0.60∗ 0.40∗ 0.76∗ 0.24∗ 0.23 0.77 0.42 0.58 0.76∗ 0.24∗
Eu 0.99 0.48∗ 0.33∗ 0.82 0.64 0.33∗
E 0.00 0.05 0.05 0.03 0.03 0.06
Pe 0.00 0.00 0.00 0.00 0.00 0.00
= 0.05 Ps Pm 0.01 0.99 0.48∗ 0.52∗ 0.61∗ 0.39∗ 0.16 0.84 0.15 0.85 0.54∗ 0.46∗
Eu 0.99 0.57∗ 0.44∗ 0.87 0.88 0.51∗
(1) when T = 1 the error E is lower for the weights-based non-conformity function. This is due to the fact that: (1) the percentage Ps of single-class regions is lower and (2) the percentage Pm of multiple-class regions is higher. Thus, the upper-bound
Efficient AdaBoost Region Classification
133
error E u is higher for the weights-based non-conformity function. We explain these results with the AdaBoost weights. When T = 1 the instance weights take on only two possible values: one value for correctly classified instances and one value for incorrectly classified instances. Thus, the weights-based non-conformity function returns implausible nonconformity values. This contrasts with the general non-conformity function which, for T = 1, employs the class scores of the base classifier. Since the scores of the base classifier used (decision stump) are considered reasonable [6], the general non-conformity function outputs better non-conformity values. (2) when T = Tmax the error E is higher for the weights-based non-conformity function. This is due to: (1) the percentage Ps of single-class regions is higher and (2) the percentage Pm of multiple-class regions is lower. Thus, the upper-bound error E u is lower for the weights-based non-conformity function. We explain these good results with the AdaBoost weights. When the AdaBoost point classifier is neither underfitted nor overfitted, the weights plausibly 3 represent the classification difficulty of the training instances [3] that we consider as non-conformity values. This contrasts with the general non-conformity function which, for T = Tmax , employs the class scores of the AdaBoost classifier. Since the class scores of the AdaBoost point classifier are known to be uncalibrated [7], this results in relatively worse non-conformity values. (3) when T = 10 × Tmax the error E is lower for the weights-based non-conformity function. This is due to the fact that: (1) the percentage Ps of single-class regions is lower and (2) the percentage Pm of multiple-class regions is higher. Thus, the upper-bound error E u is higher for the weights-based non-conformity function. These results are also due to the AdaBoost algorithm. When T = 10 × Tmax this algorithm concentrates too much on training instances that were consistently classified incorrectly. As a result these instances receive higher weights while the remaining instances receive lower weights. When the iteration number T receives a very high number, most of the training instances receive a weight of 0 (this is demonstrated experimentally for the hepatitis data set in Figure 4, presenting the weight distributions for T = 10 and T = 10.000). Thus, the weights-based non-conformity function returns implausible non-conformity values. This contrasts with the general non-conformity function which, for a high number of iterations T , employs the class scores of the AdaBoost point classifier. These class scores result in better non-conformity values. From (1), (2), and (3) we conclude again that the class regions based on the weightsbased non-conformity function are statistically better than those of the general non-conformity function when the accuracy of the AdaBoost point classifier used is maximized; i.e. the classifier is neither underfitted nor overfitted. In this context we note that when the AdaBoost point classifier is either underfitted or overfitted, the general non-conformity function results in better class regions, but they are still worse than those of the weights-based non-conformity function when the AdaBoost point classifier has maximum accuracy.
134
M. Moed and E.N. Smirnov
T = 1 (general)
T = 1 (weights−based)
1
1 Pm
Pm
E Ps
0.8
E Ps
0.8
Pe Performance
Performance
Pe 0.6
0.4
0.2
0 0
0.6
0.4
0.2
0.2
0.4 0.6 Significance level ε
0.8
0 0
1
0.2
T = Tmax (general)
0.4 0.6 Significance level ε
0.8
T = Tmax (weights−based)
1
1 Pm
Pm
E Ps
0.8
E Ps
0.8
P
P
e
Performance
Performance
e
0.6
0.4
0.2
0 0
0.6
0.4
0.2
0.2
0.4 0.6 Significance level ε T = 10 × T
max
0.8
0 0
1
0.2
0.4 0.6 Significance level ε
T = 10 × T
(general)
max
1
0.8
0.8
= 200 (weights−based)
P
P
E P
E P
m
0.8
s
s
Pe Performance
Pe Performance
1
1 m
0.6
0.4
0.2
0 0
1
0.6
0.4
0.2
0.2
0.4 0.6 Significance level ε
0.8
1
0 0
0.2
0.4 0.6 Significance level ε
0.8
1
Fig. 3. Performance graphs of the conformal algorithm based on the general non-conformity function and the weights-based non-conformity function on the hepatitis data set. Note that the diagonal line (0.0,0.0)-(1.0,1.0) is the calibration line. Ideally, the error E should be smaller than or equal to this calibration line for all ε ∈ [0, 1].
Efficient AdaBoost Region Classification
135
Instance weights over several iterations 0.08
10.000 iterations 10 iterations
0.07 0.06
Weight
0.05 0.04 0.03 0.02 0.01 0
20
40
60 80 Instance number
100
120
140
Fig. 4. Weight distribution for a single run of the AdaBoost algorithm on the hepatitis data set for 10 and 10.000 iterations. Note that the weights are sorted.
7 Conclusions This paper introduced the weights-based non-conformity function for AdaBoost. This function reduces the time complexity of computing class regions with a factor equal to the size of the training data compared to the general non-conformity function. Using the weights-based non-conformity function results in statistically better class regions than those produced by the general non-conformity function, if the AdaBoost classifiers employed are neither underfitted nor overfitted. Thus, we conclude that the weights-based non-conformity function allows construction of better AdaBoost region classifiers that can be trained on larger data sets. To increase the practical use, our future research will concentrate on the problem of adapting the weights-based non-conformity function so that the conformal framework for AdaBoost can be applied to strong classifiers. The key idea is to employ a soft margin version of AdaBoost [9] that enables control of complexity by means of a regularization parameter C. In this way, our adapted function will return weights again but this time controlled by C. Thus, we can employ strong classifiers for the AdaBoost algorithm and we can overcome the weight skewness that can be observed in Figure 4.
136
M. Moed and E.N. Smirnov
References [1] Asuncion, A., Newman, D.J.: UCI machine learning repository (2007) [2] Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) [3] Caprile, B., Furlanello, C., Merler, S.: Highlighting hard patterns via adaboost weights evolution. In: Roli, F., Kittler, J. (eds.) MCS 2002. LNCS (LNAI), vol. 2364, pp. 72–80. Springer, Heidelberg (2002) [4] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: International Conference on Machine Learning, pp. 148–156 (1996) [5] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Heidelberg (2002) [6] Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11, 63–91 (1993) [7] Niculescu-Mizil, A., Caruana, R.: Obtaining calibrated probabilities from boosting. In: Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, pp. 413–418 (2005) [8] Proedrou, K.: Rigorous Measures of Confidence for Pattern Recognition and Regression. PhD thesis, Royal Holloway College, University of London, UK (2003) [9] R¨atsch, G., Onoda, T., M¨uller, K.-R.: Soft margins for adaboost. Machine Learning 42(3), 287–320 (2001) [10] Schapire, R.: The boosting approach to machine learning: An overview. In: MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA (March 2001) [11] Shafer, G., Vovk, V.: A tutorial on conformal prediction. Journal of Machine Learning Research 9, 371–421 (2008) [12] Smirnov, E.N., Vanderlooy, S., Sprinkhuizen-Kuyper, I.G.: Meta-typicalness approach to reliable classification. In: Proceedings of the 17th European Conference on Artificial Intelligence, Riva del Garda, Italy, August 28 - September 1, pp. 810–811. IOS Press, Amsterdam (2006) [13] Vanderlooy, S., van der Maaten, L., Sprinkhuizen-Kuyper, I.: Off-line learning with transductive confidence machines: an empirical evaluation. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 310–323. Springer, Heidelberg (2007) [14] Vovk, V., Gammerman, A., Shafer, G.: Algorithmic learning in a random world. Springer, Heidelberg (2005) [15] Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the 18th International Conference on Machine Learning, pp. 609–616. Morgan Kaufmann, San Francisco (2001)
A Linear Classification Method in a Very High Dimensional Space Using Distributed Representation Takao Kobayashi and Ikuko Shimizu Department of Computer and Information Sciences, Tokyo University of Agriculture and Technology, 2-24-16 Nakacho, Koganei-shi, 184-8588 Japan {tkobaya, ikuko}@cc.tuat.ac.jp
Abstract. We have proposed a fast learning and classification method by using distributed representation of vectors. In this paper, first, we shows that our method provides faster and better performance than 1-NN method by introducing a definition of a similarity concerned with LSH scheme. Next we compare our method with the Naive Bayes with respect to the number of dimensions of features. While the Naive Bayes requires a considerably large dimensional feature space, our method achieves higher performance even where the number of dimensions of a feature space of our method is much smaller than that of Naive Bayes. We explain our method by formalizing as a linear classifier in a very high dimensional space and show it is a special case of Naive Bayes model. Experimental results show that our method provides superior classification rates with small time complexity of learning and classification and is applicable to large data set. Keywords: Locality sensitive hashing, 1-NN method, Naïve Bayes, Bagging.
1 Introduction In recent years, huge amount of data sets have been occurred and stored in various scenes in the society. These data sets include transaction data, observation data, web pages and various contents of texts, images, sounds and videos. Because of the background, attention for the technology of leaning and classification against large data sets is led. Some typical methods take an enormous amount of time for learning so that they fail to finish in practical time if a large number of samples are involved. Also the classification problem for a large class set is not solved sufficiently. For example, generic object recognition [1] and handwritten Chinese character recognition involve more than several thousand classes. To classify patterns into a large number of classes we also have to manage huge amount of samples in most cases. On the other hand the memory size deployed in a computer is increasing every year, so new algorithms using large memory space become feasible. We have proposed and evaluated a learning and classification method using large memory size [2]. The key idea of the method is that a feature vector is represented by a set of
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 137–147, 2009. © Springer-Verlag Berlin Heidelberg 2009
138
T. Kobayashi and I. Shimizu
vectors and a similarity function for classification is defined using it. We call it as “distributed representation” and also call the classification method as “DR classification (DRC).” In this paper, we present excellences of DRC by clarifying relationships to other well-known methods. Chapter 2 introduces a similarity by distributed representation and considers relations to locality sensitive hashing scheme including comparison with the 1-NN method. Chapter 3 shows DRC is a special variant of Naïve Bayes classification. And also shows that Bagging applied for DR classifiers is almost the same to Naïve Bayes classifier by DR. Chapter 4 discusses about meaning of linear classification in a very high dimensional space. Chapter 5 draws a conclusion.
2 Classification by Distributed Representation 2.1 Definition of a Similarity by Distributed Representation Charikar introduced a definition of the hash family F as locality sensitive hashing (LSH) scheme as follows: [3]
sim( x, y ) = PrH ∈F (H ( x) = H ( y ) )
(1)
where sim(x, y) is some similarity function between two points x and y. In other words, the locality sensitive hash family has a property such that two points are nearer if the probability of collision of these two hash codes is higher. Thus the distance between x and y can be estimated by calculating ∑ δ (H l ( x), H l ( y ) ) using many Hl , where δ is the Kronecker delta. On the other hand, we introduce a new definition in the reverse way. Namely, we prepare L hash functions H1, …, HL in advance, then the similarity between two vectors is defined as follows: L
sim( x, y ) = ∑ δ (H l ( x) = H l ( y ) ) .
(2)
l =1
Therefore the distance between two vectors is self-evidently defined as
dist ( x, y ) = L − sim( x, y ) .
(3)
It is easily proved that it satisfies the axiom of metric. Then we describe concrete hash functions. To construct them there exists the way by dividing a space into Voronoi cells using an appropriate point set. We give a redefinition of Eq.3 with L discrete point sets Q1, … , QL prepared in advance as follows: L
sim DR ( x, y ) = ∑ Ql ( x) ∩ Ql ( y ) l =1
(4)
A Linear Classification Method in a Very High Dimensional Space
139
where we use following notations. For a discrete point set Q, Q(x) is a set of the nearest neighbor in Q from a vector x (i.e. Q(x) is a set having only one element). Thus a vector x is represented as (Q1(x), …, QL(x)) and this manner can be used to measure a distance or to classify patterns. We call this form as distributed representation (DR). In this idea, we are implicitly using mathematical properties of high dimensional spaces. Generally our approach is not good for lower dimension (e.g. less than 10). However, in the case of higher dimension (e.g. more than 20), we can construct good algorithms by using proper data structures such as hash tables although required memory size is apt to large. Here we illustrate a property of high dimensional spaces with an instance of hash families. Let x and y be vectors in N dimensional spherical space (i.e. |x| = |y| = 1). For arbitrary N×N orthogonal matrices R1, R2, … the expected value of {−1,+1}N ( Rl x) ∩ {−1,+1}N ( Rl y ) depends on the distance between x and y. Fig.1 shows relation between the distance and the expected value. As the number of dimensions becomes higher, the value becomes more locally sensitive. In this case, the number of elements | { 1, +1}N | is 2N which becomes very large as N grows.
-
Expected value 1.0
Dimension=5 Dimension=10 Dimension=20 Dimension=40
0.8
0.6
0.4
0.2
0.0 0
30
60
Distance (degree)
90
Fig. 1. Relation between the distance between two vectors and the expected value of |Q (x)∩Q (y)|. In this case Q = {Rb | b {-1, +1}N, R is an arbitrary orthogonal matrix} is adopted for hash functions.
∈
2.2 Relationship between DR Classification and the 1-NN Method We have proposed a definition of a similarity between an unknown vector x and a sample set Y = {y(0), …, y(M-1)} belonging to a class, which is called DR classification (DRC), as follows [2]:
140
T. Kobayashi and I. Shimizu
L
simDR ( x, Y ) = ∑ Ql ( x) ∩ l =1
∪Q ( y
( j)
l
y
( j)
) .
(5)
∈Y
Thus classifying x into the most suitable class Ck* is formalized as Eq.6 using sample sets Y1, …, YK belonging each class C1, …, CK .
k* = arg max{simDR ( x, Yk )} .
(6)
k
Clearly, ∀
{
}
k , sim DR ( x, Yk ) ≥ max sim DR ( x, y k( j ) ) . j
(7)
And the right hand side is exactly the 1-NN method. Therefore the performance of DRC may have some relation with the 1-NN method. In fact, we have confirmed that classification rates of the DR classifier can excel ones of the 1-NN method under practical conditions of the numbers of dimensions and learning samples [4]. An experimental result will be shown in next section. 2.3 Experiment for Comparing Classification Rates We present an experiment for comparison of DRC (Eq.5) and the 1-NN method with the similarity of Eq.4 employing an off-line handwritten numerical data set BIRDS [5] which includes more than 190,000 patterns. It has so many patterns that we can observe relations between the number of learning samples and classification rates. For this experiment we adopted the way mentioning in [2] as a distributed representation, i.e. the feature space is Facto(N) = {x = (x1, …, xN) | x1…xN is a permutation of N integers 2i-(N-1) (i = 0, …, N-1)}, and N-1 sets are used as Q1, … , QN-1, where Ql = {a = (a1, …, aN) | ai = -1,+1, ∑δ(ai, +1) = l} (l = 1, …, N-1). Thus the element a such that Ql(x) = {a} is determined as follows:
⎧+ 1 ( xi ≥ N + 1 − 2l ) a = ( ai ), x = ( xi ), ai = ⎨ (i = 1, …, N ) . (otherwise) ⎩− 1 For reference, rates of 1-NN method with Euclidean metric in Facto(N) (simEUC) are also measured. The result is shown in Fig.2. The rates of DRC are higher than ones of the 1-NN with simDR while they are lower than ones of 1-NN with simEUC. However, the 1-NN method has a critical drawback, that is, its classification time becomes extremely larger as the number of learning samples is increased because no fast algorithm is known for classification of the 1-NN method. However DRC has fast algorithms for both of learning and classification. To mention the time complexity of DRC, here let M be the number of whole learning samples, N be the number of dimensions of a feature space and K be the number of classes, then there exists an algorithm using O(MN) learning time and O(NK) classification time. And the memory size of a classifier is lower than O(MN). Namely, classification time of DRC is independent of the number of learning samples.
A Linear Classification Method in a Very High Dimensional Space
141
Therefore we can take a simple strategy that we have only to add sufficient learning samples in order to improve the classification rate.
0.99
0.99
tea 0.97 r no 0.95 tia icif 0.93 ss laC 0.91
et ar 0.95 no 0.93 tia icif 0.91 ss 0.89 laC 0.87
0.97
Dimension = 24
DRC 1-NN (simDR) 1-NN (Euclid)
0.89 0.87 0.85
Dimension = 35
DRC 1-NN (simDR) 1-NN (Euclid)
0.85 0.83 0.81
0
50000
100000
150000
Number of learning samples
0
50000
100000
150000
Number of learning samples
Fig. 2. Classification rates of DRC excel ones of 1-NN with simDR although they are lower than ones of 1-NN with Euclidean metric at the same number of learning samples. The rates of all methods become gradually higher as increasing learning samples.
3 Relation between Naïve Bayes and DRC In this chapter we introduce a new format of Naïve Bayes classification. Then we formalize both Naïve Bayes and DRC as high dimensional multi-class linear classifiers. A multi-class linear classifier is designed as a format of the dot products of an input vector and each weight vector corresponding to a class. Weight vectors are determined by learning samples. We formalize weight vectors of these classifiers in the unified style. We also show Bagging of DRC becomes a linear classifier and is very similar to Naïve Bayes. 3.1 Orthodox Naïve Bayes Classifier
The conditional probability of a class Ck with occurrence of an N dimensional vector x = (x1, …, xN) can be written using Bayes theorem: Pr (Ck x1 , … , x N ) = Pr (x1 , … , x N C k ) Pr (C k ) Pr (x1 , … , x N ) .
(8)
For simplicity, we assume that prior probabilities Pr(Ck) are the same for all k. Thus classification is formalized simply as
arg max{Pr (x1 ,…, xN Ck )} . k
(9)
For Naïve Bayes model the assumption that each feature is conditionally independent is used. Then the probability is expressed by the form of the product:
142
T. Kobayashi and I. Shimizu
Pr(x1 ,…, xN Ck ) = ∏ Pr (xi Ck ) . N
(10)
i =1
Each Pr(xi|Ck) can be estimated from a sample set. Finally, a Naïve Bayes classifier is formalized as Eq.11, where Laplace correction ε is added so as to avoid the case such that an estimated probability is zero.
⎫ ⎧N arg max ⎨∑ log(Pr (xi Ck ) + ε )⎬ . k ⎭ ⎩ i=1
(11)
Here we try to apply the method of Eq.11 to classification for vectors used in section 2.3. Where a feature vector x is in Facto(N) and the number of dimensions N is from 24 to 63. The experimental result shows in Fig.3. However, classification rates do not reach 80%, so that they are far from those of DRC. We will discuss good conditions for Naïve Bayes classification in chapter 4. 0.80 0.78
et ar no 0.76 tia cif 0.74 is sa lC0.72
dimension = 24 dimension = 35 dimension = 48 dimension = 63
0.70 0.68 0
50000
100000
150000
Number of learning samples
Fig. 3. Classification rates by Naïve Bayes model are measured in the cases of dimensions 24, 35, 48 and 63. For this experiment the same data set and the same feature vectors as that in section 2.3 are used.
3.2 Naïve Bayes Using Distributed Representation
In this section we introduce a new format of Naïve Bayes classification by distributed representation. As shown in Eq.5 a DR classifier uses L attributes (Q1(x), …, QL(x)) derived from a input vector x. Here we assume the following equation
Pr((Q1 ( x),…, QL ( x)) Ck ) = ∏ Pr((l , Ql ( x)) C k ) . L
l =1
Then a Naïve Bayes classifier is constructed as follows
A Linear Classification Method in a Very High Dimensional Space
⎧L ⎫ arg max⎨∑ log(Pr((l , Ql ( x)) Ck ) + ε )⎬ . k ⎩ l =1 ⎭
143
(12)
Each attribute Ql(x) (l = 1, .., L) takes |Ql| kinds of value respectively. Therefore the inside of large brackets in Eq.12 can be written as Eq.13 with a bijective function F that maps (l, Ql(x)) to {1, …, ∑|Ql|} . (See Appendix.)
∑ Ql ⎛ L ⎞ ⎜ ∑ δ ( F (l , Ql ( x)), u ) × log(Pr(u Ck ) + ε )⎟ . ∑ u =1 ⎝ l =1 ⎠
(13)
Each probability Pr(u|Ck) for all u and k is estimated by a learning sample set. Namely it is calculated as Yk
L
Pr(u C k ) = ∑∑ δ ( F (l , Ql ( y k( j ) )), u ) j =1 l =1
Yk .
Note that |Yk| is a constant, then a similarity function between an input vector x and a learning sample set Yk can be defined as follows: sim NB ( x, Yk ) =
∑ Ql ⎛ L ⎛ Yk L ⎞⎞ ⎜ ∑ δ ( F (l , Ql ( x )), u ) × log⎜ ∑∑ δ ( F (l , Ql ( y k( j ) )), u ) + ε ′ ⎟ ⎟ ∑ ⎜ ⎟⎟ ⎜ u =1 ⎝ l =1 ⎝ j =1 l =1 ⎠⎠
(14)
where the left side of the product, ∑δ(F(l, Ql (x)), u) is a component of a ∑|Ql| dimensional sparse binary vector determined by an input vector, and the right side ⎞ ⎛ Yk L log⎜⎜ ∑∑ δ ( F (l , Ql ( y k( j ) )), u ) + ε ′ ⎟⎟ ⎠ ⎝ j =1 l =1
is a component of a ∑|Ql| dimensional vector with real value determined by learning samples. It can be regarded as a weight vector. The similarity simNB is obtained by calculating the dot product of these two vectors. Namely this operation means linear classification in ∑|Ql| dimensional space. To construct a good classifier, |Ql| must be big because Ql must have a high resolution in a feature space. So ∑|Ql| necessarily becomes very big. For example, in experiments in section 2.3 and 3.4 ∑|Ql| is in the range of 224 and 248. DRC shown in Eq.5 is also represented in the form of linear classification as
simDR ( x, Yk ) =
∑ Ql ⎛ L ⎞ ⎜ ∑ δ ( F (l , Ql ( x)), u ) × w(u, k ) ⎟ ∑ u =1 ⎝ l =1 ⎠
(15)
⎧ Yk L ⎫ where, w(u, k ) = min ⎨ ∑∑ δ ( F (l , Ql ( y k( j ) ), u )), 1 ⎬ . ⎩ j =1 l =1 ⎭ The difference between simNB (Eq.14) and simDR (Eq.15) is the value of components of weight vectors. The weight vectors by simDR are binary vectors. Thus simDR is
144
T. Kobayashi and I. Shimizu
simpler than simNB for program implementation. To measure these performances an experiment will be shown in section 3.4. 3.3 Bagging of DC Classifiers
In this section we lead a form similar to Eq.14 in a different way described in [7]. A form of Bagging [6] with T DC classifiers learned with m samples for each class Ck is formalized as follows:
sim Bagging ( x, Yk ) =
1 T ∑ simDR ( x, Yk(t ) ) T t =1
=
Q ⎞⎤ 1⎡T ∑ l⎛ L ⎢∑ ∑ ⎜ ∑ δ ( F (l , Ql ( x)), u ) × wt (u, k ) ⎟⎥ T ⎣⎢ t =1 u =1 ⎝ l =1 ⎠⎦⎥
=
∑ Ql ⎛ L 1 T ⎞ ⎜ ∑ δ ( F (l , Ql ( x)), u ) × ∑ wt (u, k ) ⎟ ∑ T u =1 ⎝ l =1 t =1 ⎠
(16)
where, t-th learning sample set belonging class Ck is Yk(t) (t = 1, …, T) (|Yk(t)|= m), and
⎧⎪ wt (u, k ) = min ⎨ ⎪⎩
Yk( t )
⎫⎪
L
∑∑ δ ( F (l , Ql ( y k(t , j ) ), u )), 1 ⎬ . ⎪⎭
j =1 l =1
We assume that learning samples are available inexhaustibly and let T→∞, then
1 T ∑ wt (u, Ck ) → E[w(u, k )] . T t =1
(17)
This is the average weight of classifiers learned with randomly taken m samples. Suppose p is a average value of w(u, k) learned with one sample, then average value learned with m samples is 1-(1-p)m. Therefore we can estimate E[w(u, k )] from actual whole learning sample set Yk and lead following equation:
sim Bagging ( x, Yk ) = ∑ Ql ⎛ ⎜ ∑ ⎜ u =1 ⎜ ⎝
⎛ ⎛ 1 δ ( F (l , Ql ( x)), u ) × ⎜⎜1 − ⎜⎜1 − ∑ Y l =1 k ⎝ ⎝ L
Yk
L
∑ ∑ δ ( F (l, Q ( y j =1 l =1
l
( j) k
⎞ )), u ) ⎟ ⎟ ⎠
m
⎞ ⎟ ⎟ ⎠
⎞ ⎟ ⎟⎟ ⎠
(18)
This equation shows that creating a Bagging classifier results in creating only one linear classifier with adjusted weights. The optimal value of m in Eq.18 can be determined by cross-validation. The weight vector of Eq.18 seems to be similar to that of Eq.14.
A Linear Classification Method in a Very High Dimensional Space
145
3.4 Experimental Results
We conduct an experiment to measure classification rates of DRC (simDR; Eq.5 or Eq.15), Naïve Bayes by DR (simNB; Eq.14) and Bagging of DR classifiers (simBagging; Eq.18). The same dataset as that in section 2.2 is used for this experiment. Fig 4 shows the results. In the case of dimensions 24, simNB and simBagging become higher than simDR when the number of learning samples is increased. While in the cases of 35 and 48, there are only slight differences between simDR and the others. It is natural that simBagging is higher than simDR because simBagging is ensemble of classifiers of simDR. It is also observed that the rates of simNB and simBagging are almost the same. This result is acceptable because forms of weight expression in Eq.14 and Eq.18 are similar. 0.98
0.97
Dimension = 24
0.96
simDR simNB
0.95
simBagging
0.94 0
50000
100000
150000
0.98
0.98
0.97
0.97
0.96
Dimension = 35
0.96 0.95
simDR
0.94
0.95
simNB
0.93
simBagging
0.92
0.93
Dimension = 48
0.94
simDR simNB simBagging
0.91 0.90
0.92 0
50000
100000
150000
0
50000
100000
150000
Fig. 4. Comparison of classification rates of DRC (simDR), Naïve Bayes by DR (simNB) and Bagging of DR classifiers (simBagging). Is is observed that all have strong correlations.
4 Discussion By well-known phenomenon as the curse of dimensionality, classification becomes harder as the number of dimensions of a feature space becomes larger for the limited
146
T. Kobayashi and I. Shimizu
number of learning samples because the distribution of feature vectors becomes extremely sparse in the space. However as we observed in Chapter 2 and 3, classification by distributed representation is the method such that the feature vector space is mapped to a very high dimensional space where linear classification is performed. Furthermore its performance is excellent. Linear classification in very high dimensional space reminds us of SVM using kernels. SVM attains a hyperplane such that leaning sample sets of each class are divided by the iterative method. Its learning process is so time-consuming that learning for a large amount of samples is not practical. Further, applying SVM to a large class set is also unacceptable. On the other hand, a learner of DRC creates weight vectors for each class which classify all learning samples directly. Therefore DRC can be applied to classification problems involving very large data sets. Naïve Bayes classification by DR is similar to DRC because their difference is only for weights and an experimental result show their classification rates are close. DRC can be therefore regarded as a variant of Naïve Bayes classification. Further, Bagging of DR classifier is very similar to Naïve Bayes by DR, thus we can say reversely that they are variants of DRC. By the way, the conventional Naïve Bayes model is implemented as linear classification in the original feature vector space. In this case, in fact, a relatively higher dimension space (e.g. several hundred or several ten thousand) is used. For example, in the case of document classification, Naïve Bayes model using feature vectors of which attributes are frequencies of each word is applied. In [8], 38500 dimensional feature vectors are used for 20,000 learning samples and 20 classes. The reason why Naïve Bayes classification acts well seems to be that a high dimensional space is used.
5 Conclusion We formalized the learning and classification method using distributed representation and considered relation to LSH scheme, 1-NN, Naïve Bayes and Bagging. We presented that distributed representation model is a metamorphosis of the 1NN method using a similarity defined by hash functions and confirmed that its classification rates are higher than ones of the 1-NN method by an experiment using a real dataset. We also formalized DRC as a linear classifier in a very high dimensional space and considered relations to Naïve Bayes and Bagging. These results will accelerate further progress concerning these methods. DR learner and classifier works so fast that we can expect to solve various classification problems involving huge data set.
References 1. Yanai, K.: Current state and future directions on generic object recognition. IPSJ Transaction on Computer Vision and Image Media 48(SIG) (CVIM19), 1–24 (2007) (in Japanese) 2. Kobayashi, T., Nakagawa, M.: A pattern classification method of linear-time learning and constant-time classification. Transactions of IEICE J89-A(11), 981–992 (2006) (in Japanese)
A Linear Classification Method in a Very High Dimensional Space
147
3. Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing 2002 (2002) 4. Kobayashi, T., Shimizu, I., Nakagawa, M.: Theoretical studies of the Power Space Similarity method: a fast learning and classification algorithm. In: Proceedings of the 3rd Korea-Japan Joint Workshop on Pattern Recognition, November 2008, pp. 29–30 (2008) 5. http://www.geocities.jp/onex_lab/birdsdb/birdsdb.html 6. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 7. Kobayashi, T., Nakagawa, M.: Pattern recognition by distributed coding: test and analysis of the power space similarity method. In: Proc. 9th IWFHR, October 2004, pp. 389–394 (2004) 8. Joachims, T.: A probabilistic analysis of the rocchio algorithm with TD.IDF for text categorization. Technical Report CMU-CS-96-118, Carnegie-Mellon Institute (1996)
Appendix: Bijective Function F For any x, Ql(x) for each l (l = 1, …, L) becomes |Ql| kinds respectively. Consequently (l, Ql(x)) for all l becomes ∑|Ql| kinds. A bijective function L ⎧ ⎫ F : (l , Ql ( x)) → ⎨1,…, ∑ Ql ⎬ l =1 ⎩ ⎭
can be defined in the following manner, for instance:
F : { (1, {q}) q ∈ Q1 } → {1, …, Q1 }
F : { (2,{q}) q ∈ Q2 } → { Q1 + 1, …, Q1 + Q2 } l ⎧ l −1 ⎫ F : { (l , {q}) q ∈ Ql } → ⎨ ∑ Qi + 1, …, ∑ Qi ⎬ i =1 ⎩ i =1 ⎭
L ⎧ L−1 ⎫ F : { ( L,{q}) q ∈ QL } → ⎨ ∑ Qi + 1, …, ∑ Qi ⎬ i =1 ⎩ i =1 ⎭
PMCRI: A Parallel Modular Classification Rule Induction Framework Frederic Stahl, Max Bramer, and Mo Adda University of Portsmouth, Buckingham Building, Lion Terrace, Portsmouth PO1 3HE, United Kingdom {Frederic.Stahl, Max.Bramer, Mo.Adda}@port.ac.uk
Abstract. In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.
1 Introduction The growing interest and importance of commercial knowledge discovery and data mining techniques has led to a growing interest in the area of classification rule induction from data samples to enable the classification of previously unseen data. Research in classification rule induction can be traced back at least to the 1960s [1].A very widely used method to induce classification rules is TDIDT [2] which has given rise to a variety of implementations such as C5.0. However alternative algorithms exist such as the Prism algorithm [3]. Prism produces more generalised rules than TDIDT and thus tends to perform better on noisy datasets. As a result Prism has been used in areas where datasets are naturally noisy such as image recognition [4] or text classification. Prism is also the base for further data mining algorithms such as PrismTCS [5], N-Prism [6]. A free implementation of Prism can be found in the WEKA package [7] and also in the Inducer workbench[8]. The increase in performance of computer hardware such as CPU power and disc storage and sensors to record data such as CCTV cameras enables companies and researchers to generate and store larger and larger datasets to which they still wish to apply classification rule induction algorithms. This has led to the exploration of a new niche in data mining, parallel and distributed data mining. So far, work on distributed and parallel classification rule induction has been focused on the well-established TDIDT approach. Notable developments are the SLIQ [9] and its successor the SPRINT [10] algorithm. The latter P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 148–162, 2009. © Springer-Verlag Berlin Heidelberg 2009
PMCRI: A Parallel Modular Classification Rule Induction Framework
149
achieves an almost linear scale up with respect to the number of CPUs and the sample size. However, very little work has been done on scaling up alternative algorithms such as Prism. One approach to scaling a data mining algorithm is to sample the data before the algorithm is applied. Catlett’s work [11] showed that sampling of data results in a loss of accuracy in the induced classifier. However Catlett’s research was conducted 17 years ago and the datasets he used were fairly small compared with those used today. Frey and Fisher found in 1999 that the rate of increase of accuracy slows down with the increase of the sample size [12]. This resulted in seeking optimized methods for sampling massive datasets such as progressive sampling [13]. Whereas sampling might be an option for predictive modelling, scaling up data mining algorithms is still desirable in applications that are concerned with the discovery of new knowledge. Chan and Stolfo considered a way to scale up classification rule induction by dividing the data into subsets that fit in a single computer's memory and then generating a classifier on each data subset in parallel on several machines[14, 15]. The different classifiers generated are then combined by using various algorithms in order to achieve a final classifier. Despite the significant reduction of run times of the classification rule induction process, Chan and Stolfo's studies also showed that this approach does not achieve the accuracy of a single classifier induced on the same training data. In order to meet the need for a well scaling, more generalised and thus noise tolerant classifier, we investigate and improve PrismTCS’s scaling behaviour and derive a parallel approach to inducing classification rules in parallel for algorithms based on the Prism family. We present a framework that induces modular classification rules in parallel based on the PrismTCS algorithm and evaluate its scaling behaviour.
2 Inducing Modular Classification Rules The main drawback of the TDIDT approach, also often called the divide and conquer approach, lies in the intermediate representation of its classification rules in the form of a decision tree. Rules such as: IF a = 1 AND b = 1 THEN class = 1 IF c = 1 AND d = 1 THEN class = 0 which have no attribute in common, could not be induced directly using the TDIDT approach. In such cases, TDIDT will first need to introduce additional tests that are logically redundant simply to force the rules into a form suitable for combining into a tree structure. This will inevitably lead to unnecessarily large and confusing decision trees.Cendrowska designed the original Prism algorithm to induce directly sets of 'modular' rules that generally will not fit conveniently into a tree structure, thus avoiding the redundant terms that result when using the TDIDT approach. Prism generally induces rule sets that tend to overfit less compared with TDIDT, especially if it is applied to noisy datasets or datasets with missing values [6]. Cendrowska's Prism algorithm follows the separate-and-conquer approach which learns a rule that explains a certain part of the training data. It then separates the data explained by the rule induced and induces the next rule using the remaining data. Thus it recursively
150
F. Stahl, M. Bramer, and M. Adda
“conquers” until no training data is left. This strategy can be traced back to the AQ learning system [16]. The basic separate and conquer algorithm can be described as follows: Rule_Set = []; While Stopping Criterion not satisfied{ Rule = Learn_Rule; Remove all data instances covered from Rule; } The algorithm specific procedure Learn_Rule learns the best rule for the current training data subset. After each induced rule all data instances that are not covered are deleted and the next rule is learned from the remaining data instances. This is done until a Stopping Criterion is fulfilled. Also the Stopping Criterion is an algorithm specific one that differs from algorithm to algorithm. PrismTCS (Prism with Target Class, Smallest first) a version of Prism that attempts to scale up Prism to larger datasets has been developed by one of the present authors [5]. Whereas in PrismTCS the separate-and-conquer approach is applied only once, in the original Prism algorithm it is applied for each class in turn. PrismTCS has a comparable predictive accuracy to that of Prism [5]. Our implementation of PrismTCS for continuous data only is summarised in the following pseudo code: (a)
(b)
(c) (d) (e) (f)
(g)
working dataset W = restore Dataset; delete all records that match the rules that have been derived so far and select the ; target class i = class that covers the fewest instances in W; For each attribute A in W - sort the data according to A; - for each possible split value v of attribute A calculate the probability that the class is i for both subsets A < v and A ≥ v; Select the attribute that has the subset S with the overall highest probability; build a rule term describing S; W = S; Repeat b to e until the dataset contains only records of class i. The induced rule is then the conjunction of all the rule terms built at step d; restore Dataset = restore Dataset – W; Repeat a to f until W only contains instances of class i or is empty;
The following approaches and the parallel classification rule induction algorithm presented in this paper are explained in the context of PrismTCS. However, our approaches can be applied to Prism and all its descendants analogously.
PMCRI: A Parallel Modular Classification Rule Induction Framework
151
2.1 Speeding Up PrismTCS We identified two major overheads in Prism and PrismTCS that lower its computational efficiency considerably. The overheads comprise sorting for continuous attributes in step b of the algorithm and the frequent deletion of data instances and resetting of the training dataset in step a, e and g of the algorithm. With respect to the sorting overhead we removed the innermost loop of sorting in step b by employing a presorting strategy. The training dataset is pre-sorted by building attribute lists of the structure
similar to the SPRINT algorithm [10, 17]. These attribute lists can be sorted before the first iteration of the Prism algorithm and remain sorted during the whole duration of the algorithm. With respect to frequent restoring of the training dataset an efficient data compression algorithm has been developed. When we talk about compression we mean an efficient way to delete and restore instances of the training dataset, while maintaining the sorted nature of the attributes, which is needed frequently in the algorithms of the Prism family. For example regarding the PrismTCS pseudo code, data instances are deleted in step e and g and restored in step a. The challenge here is to find a way of data efficient compression that takes account of the pre-sorted attribute lists. One way to implement this would be to keep two copies of each attribute list in memory, one list for resetting purposes and one list to work with, analogously to the “working dataset W” and the “restore Dataset” in the PrismTCS pseudo code in section 2. Attribute lists would be restored by replacing the working set of lists with a restore set of lists. However this approach involves a considerable overhead of memory usage by keeping two copies of the whole training dataset in the form of attribute lists in memory. A further overhead in processing time is caused by frequently creating deep copies of attribute list records. We derived a more memory and time efficient algorithm for deleting and restoring data which involves having the dataset only stored once in the memory. We do that by only working with the record ids of each attribute list record which are stored in an integer array. This array is used to reference attribute values and class values which are stored in separate double precision and character arrays. Thus when pre-sorting the attribute list we only need to sort the integer array with record ids. Also when deleting list records we only need to delete the references in the record ids array. Thus the attribute values array and class values array are left untouched during the whole duration of the Prism algorithm. However we also need to avoid expensive resizing of the record ids array due to deletion and resetting of ids. We do that by replacing ids that need to be deleted in the array by the next id that does not need to be deleted and thus the size of the actual array stays the same. If the number of ids that are not deleted is n then PrismTCS will only take record ids stored between indices 0 and n-1 into account and ignore the rest. Thus PrismTCS is required to update n whenever ids are deleted or reset. The pseudo code below shows the basic compression algorithm: int numbRelevant;// number of relevant ids in the array boolean[] remove;// each index in the array corresponds //to a actual id value that needs to be //deleted removal(numbRelevant, remove){
152
F. Stahl, M. Bramer, and M. Adda
int i,j; j = 0; FOR(i=0; i
y = 0.002x2 + 9.036x (R² = 1) y = 0.002x2 + 1.158x (R² = 1) y = 3.620x (R² = 0.998)
PMCRI: A Parallel Modular Classification Rule Induction Framework
153
Fig. 1. Scale up of PrismTCS
3 PMCRI: A Parallel Modular Classification Rule Induction Framework There have been several attempts to scale up classification rule induction via parallelisation. In the area of TDIDT we have already mentioned the SPRINT [10] algorithm. We can distinguish two types of parallel processing in data mining: fine grained parallelisation and loosely coupled distributed data mining [19]. Whereas fine grained parallelisation makes use of “shared memory multiprocessor” machines (SMP), loosely coupled distributed data mining makes use of “shared nothing” or “massively parallel processors” (MPP) [20]. We will focus here on parallelising modular classification rule induction using an MPP approach. Our reasoning is that MPP can be represented by a network of workstations and thus is a cheap way of running parallel algorithms for organisations with limited budgets. We want to use an MPP infrastructure to parallelise the modular classification rule induction of Prism by using the Cooperating Data mining Model (CDM) [19]. The CDM model is illustrated in figure 2. It is partitioned into different sections which we call layers. In the first layer of the CDM model is the sample selection procedure which partitions the data sample S into n sub samples where n is the number of workstations available. There are n Learning algorithms L1,…,Ln in the second layer that run on the corresponding subsets and generate concept descriptions. C1, …. Cn. In the third layer these concept descriptions are then merged by a combining procedure to a final concept description Cfinal. The final concept description in the case of classification rule induction would be a set of classification rules. We developed a parallel modular classification rule induction framework for the Prism family and tested it on PrismTCS, the PMCRI (Parallel Modular Classification Rule Induction) framework [21], which applies to the CDM model. Parallelisation in the first layer is achieved by distributing all attribute lists evenly over n processors and processing them locally by algorithms L1 to Ln, which induce rule terms. To implement the second layer in the CDM model we used a distributed blackboard system architecture based on the DARBS distributed blackboard system [22].
154
F. Stahl, M. Bramer, and M. Adda
Fig. 2. Cooperating Data Mining
Fig. 3. The architecture of the PMCRI framework using a distributed blackboard system in order to parallelise the induction of modular rule terms. It is divided into two logical partitions: one to submit local rule term information and one to retrieve global information about the algorithm’s status.
A blackboard system can be imagined as a physical blackboard which is observed by several experts with different knowledge domains, having a common problem to solve. Each expert will use its knowledge domain plus knowledge written on the blackboard in order to infer new knowledge about the problem and advertise it to the other experts by writing it on the blackboard. In the software model such a blackboard system can be represented by a client server architecture. Figure 3 shows the basic communication pattern of PMCRI. The expertise of each expert machine is determined by the attribute lists it holds. Thus loosely speaking, each expert can induce the “locally best rule term”. The experts then use then the “Local Rule Term Partition” on the blackboard to exchange information in order to find the “globally best rule term”. The winning expert then will communicate the ids of the instances that are uncovered by this rule term to the other waiting experts using the “Global Information Partition” on the blackboard system. Now the next rule term can be induced in the same way. In
PMCRI: A Parallel Modular Classification Rule Induction Framework
155
Fig. 4. The combining procedure of the CDM is realised by the moderator program, which will assemble the complete rules after each expert submits its globally best rule terms together with information about the rule for which they were induced
PMCRI the attribute lists decrease in size at the same rate, thus the workload on each expert machine stays in the same proportion for all remaining experts. The following steps describe how PMCRI induces one rule in the context of PrismTCS[23]: Step 1 Moderator (PrismTCS) writes on “Global Information Partition” the command to induce locally best rule terms. Step 2 All Experts induce the locally best rule term and write the rule terms plus its covering probability and the number of list records covered on the “local Rule Term Partition” Step 3 Moderator (PrismTCS) compares all rule terms written on the “Local Rule Term Partition”; adds best term to the current rule; writes the name of the Expert that induced the best rule term on the Global Information Partition Step 4 Expert retrieves name of winning expert. IF Expert is winning expert { derive by last induced rule term uncovered ids and write them on the “Global Information Partition” and delete uncovered list records } ELSE IF Expert is not winning expert { wait for by best rule term uncovered ids being available on the “Global Information Partition”, download them and delete list records matching the retrieved ids. }
156
F. Stahl, M. Bramer, and M. Adda
In order to induce the next rule term, PMCRI would loop back to step one. For PMCRI to know when to stop the rule it needs to know when the remaining list records on the expert machines are either empty or consist only of instances of the current target class. This information is communicated between the winning expert and the moderator program using the Global Information Partition. It is possible to implement all the descendants of the original Prism algorithm simply by adapting the learner algorithm within this framework. In layer 3 at the end of the PMCRI execution each expert machine will hold a set of terms for each rule. The implementation of the combining procedure in layer three in the CDM model is realised by communicating all the rule terms locally stored at the expert machines to the blackboard. Each rule term is associated with information about the rule and the class for which the terms were induced. The moderator program then simply appends each rule term to its corresponding rule as illustrated in figure 4.
4 Performance of PMCRI To evaluate PMCRI we used the yeast dataset from the UCI repository [18]. To create a larger dataset and thus a higher (more challenging) workload for our system, we first appended it to itself in a horizontal direction until the dataset comprised a total number of 50 attributes. We used this base dataset to evaluate the system's performance. We then appended the data to itself in a vertical direction in order to increase the number of instances and thus increase the system's workload. Please note that the learner algorithms of PMCRI are based on PrismTCS and produce exactly the same rules as PrismTCS would induce. There is therefore no need to concern ourselves with issues concerning the comparative quality of rules generated by the different algorithms. Also please note that all datasets used were based on the yeast dataset, thus the classifiers induced in each experiment were identical and issues relating to differing numbers of rules and rule terms do not arise. This enables us to focus on issues of workload only. We made scale up experiments to evaluate PMCRI's performance with respect to its number of processors, speed up experiments to evaluate its performance with respect to the number of processors together with the processors' workload and size up experiments in order to evaluate its performance concerning the system's total workload. The hardware we used comprised ten identical machines where each machine had one Pentium processor with 2.8 GHz and one GB memory. We run lightweight xUbuntu Linux Systems on each machine. 4.1 Scale Up Scale up is used to observe the system's ability to be enlarged. For examining the scale up of PMCRI we observe how the response time changes if the number of
PMCRI: A Parallel Modular Classification Rule Induction Framework
157
Fig. 5. Scale up of PMCRI
processors is increased while the workload per machine stays constant. In the ideal case the response time of PMCRI would stay constant since the total workload of each processor stays the same. We studied three scale up experiments with a workload corresponding to 130,000, 300,000 and 850,000 instances per processor. Please note due to the distribution of the data via attribute lists each processor only holds a part of each data instance, for example the 130k instances workload for each processor in the case of two processors are actually 260k instances with half the attributes of each instance. In order to simplify matters we just refer to 130k training instances. The results of the scale up experiments are shown in figure 5. The results show a nice scale up. The drop in scale up with adding more processors can be explained by the additional communication overhead in the LAN, as more processors need to synchronise by communicating information about covered and uncovered list ids, while the amount of data per processor stays constant, the number of ids that need to be communicated increases with an increasing number of total instances. However as figure 5 also shows the effect of communication overhead can be lowered by increasing workload per processor. Loosely speaking the higher the overall workload the more the system profits from using additional processors. 4.2 Speed Up Speed up is used to compare how much a parallel algorithm is faster than the serial version of it. However we are limited with workload of the serial version of PrismTCS by the size of the memory of the computer used. However as can be seen in section 4.3 all parallel versions of PrismTCS using PMCRI are faster than the serial version thus an absolute speed up compared with the serial PrismTCS would be positive. However we are able to determine the relative speed up of PMCRI in the context of PrismTCS by basing it to a two processor configuration. By examining the speed up characteristics of PMCRI we observe how the response time changes with the number of processors while the total workload stays constant. We studied three speed up experiments with a total number 600k, 1,000k and 2,000k data instances for configurations of 2, 5 and 10 processors. The results of the speedup experiments are shown in figure 6.
158
F. Stahl, M. Bramer, and M. Adda
Fig. 6. Speed up of PMCRI
As we can expect the speedup increases with the size of the dataset. The ideal speedup would be if the amount of processors is doubled then the response time is reduced by half. However this behaviour is contradicted for the same reason as the scale up behaviour, by a communication overhead. Especially for small datasets, the additional communication overhead contradicts the benefit of more processors considerably, but for larger datasets the benefit from having more CPU power strongly outweighs the communication overhead. Again we observe that PrismTCS parallelised using PMCRI is faster the more processors we use especially for higher data workloads. 4.3 Size Up In size up experiments we examine how PMCRI performs on a fixed processor configuration. We do that by increasing the size of the data and leaving the number of processors constant. Figure 7 shows the size up for three different processor configurations of PMCRI and our serial version of PrismTCS. We increased the dataset size from 17k up to 8,000k training instances.
Fig. 7. Size up of PMCRI
PMCRI: A Parallel Modular Classification Rule Induction Framework
159
Fig. 8. Size up using relative response time
Generally we observe a linear size up for PrismTCS and all PMCRI configurations, thus the processing time is a linear function of' the size of the dataset. The equations of a linear regression that prove the linear behaviour are shown below where x is the number of training instances and y the runtime in ms: PrismTCS: PMCRI with 2 processors: PMCRI with 5 processors: PMCRI with 10 processors:
y = 3.840x (R² = 0.994 ) y = 2.019x (R² = 0.997 ) y = 1.065x (R² = 0.995 ) y = 0.775x (R² = 0.997 )
Please note that all experiments from figure 7 were sized up to their maximum number of training instances. The maximum number of data instances is limited by the total amount of memory in the system. Buffering of attribute lists to the hard disc would be possible in order to overcome memory limitations; however frequent I/O operations are unwanted due to their time expensive behaviour. We can clearly see that if we use double the amount of memory we can hold roughly double the amount of training instances into memory. Figure 8 shows the size up using the relative response time for the same configurations as in figure 7, however this time we also added the ideal size up into the figure. The ideal size up would be that if we have double the amount of training instances on the same number of processors then we will need double the amount of time to train our classifier. We can observe that the serial version of PrismTCS clearly scales worse than the ideal runtime however for all parallel versions we observe a better sizing up behaviour than ideal, in particular the more processors we use the better the size up result compared with the ideal size up behaviour. Thus we can say PMCRI shows a superior size up performance.
5 Conclusions We presented the work and first results of the PMCRI framework, a Parallel Modular Classification Rule Induction algorithm. PMCRI harvests the computational power of
160
F. Stahl, M. Bramer, and M. Adda
a network in order to make modular classification rule induction scaling better on large datasets. We started the paper by discussing the limitations of the more popular TDIDT algorithm and then we discussed the Prism algorithm family as an alternative that tries to overcome TDIDT's limitations naturally. However Prism's downside is that it does not scale well on large datasets. We discovered the frequent deletion of and restoring of data records and frequent sorting operations for continuous attributes as bottlenecks of Prism. The latter bottleneck has been tackled by using attribute lists in order to keep the training data sorted during the whole duration of the algorithm. We addressed the bottleneck of the frequent deletion and restoring of data records by proposing an algorithm that efficiently deletes attribute list instances without having to resize the underlying data structure. We achieve a linear scaling behaviour of PrismTCS by implementing these two approaches. We then described our work on a parallel classification rule induction algorithm based on the same rule generalisation method as PrismTCS, the PMCRI algorithm. The parallelisation in PMCRI is achieved by distributing attribute lists evenly over the machines in the network. Each machine induced rule terms independently that are locally the best. A global view of the algorithm is achieved by each machine by exchanging information about the local status of each machine using a distributed blackboard architecture. As both PMCRI and PrismTCS employ the same rule generalisation strategy the rule sets produced by both on the same training data are identical. We then experimentally analysed PMCRI's performance. With respect to its scale up behaviour we observed that the more processors we use the higher the synchronisation overhead due to communicating local information between processors. However the synchronisation overhead is contradicted by the processors' workload, thus the higher the workload the closer PMCRI scales up to its ideal behaviour. With respect to PMCRI speed up we again observed a speed up below its ideal speed up, which again can be explained by a communication overhead between processors, however we could observe that the higher the workload the closer the actual speed up performance is to its ideal. With respect to PMCRI size up we observed a linear behaviour on a fixed processor configuration with respect to its workload. Generally for PMCRI we observed superior size up behaviour to its ideal. We also stretched the boundaries in order to find the maximum workload of the system. The maximum workload was limited by the total memory available thus the more machines we used the more data we were able to use in order to train our classifier. However we can generally say that the benefit of using more processors in PMCRI slows down due to an additional communication overhead per processor. However the communication overhead can be outweighed by adjusting the total workload of the system. Thus, loosely speaking, the user of the PMCRI system should balance the number of processors used with the total workload, thus for a smaller workload a configuration with less processors or even the serial version might be more beneficial concerning the system's runtime. However for large workloads a configuration with more processors is more likely to be beneficial. We are currently developing a more intelligent workload balancing strategy for PMCRI that takes into account that computers might have different CPU speeds and memory sizes available.
PMCRI: A Parallel Modular Classification Rule Induction Framework
161
References 1. Hunt, E.B., Marin, J., Stone, P.J.: Experiments in Induction. Academic Press, London (1966) 2. Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986) 3. Cendrowska, J.: PRISM: an Algorithm for Inducing Modular Rules. International Journal of Man-Machine Studies 27, 349–370 (1987) 4. Shu-Ching, C., Mei-Ling, S., Schengcui, Z.: Detection of Soccer Goal Shots Using Joint Multimedia Features and classification Rules. In: Fourth International Workshop on Multimedia Data Mining, Washington, DC, USA, pp. 36–44 (2003) 5. Bramer, M.: An Information-Theoretic Approach to the Pre-pruning of Classification Rules. In: Proceedings of the IFIP Seventeenth World Computer Congress - TC12 Stream on Intelligent Information Processing, pp. 201–212. Kluwer, B.V., Dordrecht (2002) 6. Bramer, M.: Automatic Induction of Classification Rules from Examples Using N-Prism. In: Research and Development in Intelligent Systems XVI (2000) 7. Garner, S.: Weka: The Waikato Environment for Knowledge Analysis. In: New Zealand Computer Science Research Students Conference, pp. 57–64 (1995) 8. Bramer, M.: Inducer: a public domain workbench for data mining. International Journal of Systems Science 36(14), 909–919 (2005) 9. Metha, M., Agrawal, R., Rissanen, J.: SLIQ: A Fast Fcalable Classier for Data Mining. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057. Springer, Heidelberg (1996) 10. Shafer, J.C., Agrawal, R., Mehta, M.: SPRINT: A Scalable Parallel Classifier for Data Mining. In: Twenty-second International Conference on Very Large Data Bases (1996) 11. Catlett, J.: Megainduction: Machine learning on very large databases. University of Technology, Sydney (1991) 12. Frey, L.J., Fisher, D.H.: Modelling Decision Tree Performance with the Power Law. In: Evelenth International Workshop on Artificial Intelligence and Statistics (1999) 13. Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: Geoffrey, I. (ed.) Knowledge Discovery and Data Mining, pp. 23–32 (1999) 14. Chan, P.K., Stolfo, S.J.: Experiments on Multistrategy Learning by Meta Learning. In: Second International Conference on Information and Knowledge Management, pp. 314– 323 (1993) 15. Chan, P.K., Stolfo, S.J.: Meta-Learning for Multistrategy and Parallel Learning. In: Second International Workshop on Multistrategy Learning, pp. 150–165 (1993) 16. Michalski, R.S.: On the quasi-minimal solution of the general covering problem. In: Proceedings of the Fifth International Symposium on Information Processing, Bled, Yugoslavia, pp. 125–128 (1969) 17. Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel Classification for Data Mining on Shared Memory Multiprocessors. In: Fifteenth International conference on Data Mining (1999) 18. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Sciences (1998) 19. Provost, F.: Distributed Data Mining: Scaling up and Beyond. In: Kargupta, P.C.H. (ed.) Advances in Distributed and Parallel Knowledge Discovery. AAAI Press / The MIT Press (2000) 20. Kamath, C., Musik, R.: Scalable Data Mining through Fine-Grained Parallelism. In: Kargupta, P.C.H. (ed.) Advances in Distributed and Parallel Knowledge Discovery. AAAI Press / The MIT Press (2000) 21. Stahl, F., Bramer, M.: P-Prism: A Computationally Efficient Approach to Scaling up Classification Rule Induction. In: IFIP International Conference on Artificial Intelligence. Springer, Milan (2008)
162
F. Stahl, M. Bramer, and M. Adda
22. Nolle, L., Wong, K.C.P., Hopgood, A.: DARBS: A Distributed Blackboard System. In: Twenty-first SGES International Conference on Knowledge Based Systems (2001) 23. Stahl, F., Bramer, M.: Parallel Induction of Modular Classification Rules. In: Twentyeighth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, Cambridge (2008)
Dynamic Score Combination: A Supervised and Unsupervised Score Combination Method Roberto Tronci, Giorgio Giacinto, and Fabio Roli DIEE Department of Electrical and Electronic Engineering, University of Cagliari, 09123 Cagliari, Italy {roberto.tronci, giacinto, roli}@diee.unica.it
Abstract. In two-class score-based problems the combination of scores from an ensemble of experts is generally used to obtain distributions for positive and negative patterns that exhibit a larger degree of separation than those of the scores to be combined. Typically, combination is carried out by a “static” linear combination of scores, where the weights are computed by maximising a performance function. These weights are equal for all the patterns, as they are assigned to each of the expert to be combined. In this paper we propose a “dynamic” formulation where the weights are computed individually for each pattern. Reported results on a biometric dataset show the effectiveness of the proposed combination methodology with respect to “static” linear combinations and trained combination rules.
1
Introduction
The combination of experts is largely used in the literature to improve the reliability and the performance of a recognition system [1,2]. The combination can be made at different levels and in different ways depending on the classification problem and the data of interest. In the case of two-class problems, the classes are usually denoted with the terms positive class, and negative class (in some specific domains such as biometry the terms are genuine and impostor, respectively). Usually, a two-class expert computes for each pattern a similarity (dissimilarity) score with respect to prototypes of the positive class. The decision is taken by setting a decision threshold on the score produced, so that the outcome is positive if the score is greater than a decision threshold, while it is negative otherwise. If the scores produced are not easily separable through a decision threshold the patterns can be subdivided into “true” and “false” patterns within the positive and negative classes. The aim of a two-class problem is to maximise the true positive rate (i.e. the number of positive patterns classified as positive over the total number of positive patterns), while minimising the false positive rate (i.e. the number of negative patterns classified as positive over the total number of negative patterns). Typically this tradeoff is measured through the use of the Receiver Operating Characteristic curve (ROC), which plots the true positive rate against the false positive rate as the decision threshold th varies along the score range [3,4]. The Area Under P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 163–177, 2009. c Springer-Verlag Berlin Heidelberg 2009
164
R. Tronci, G. Giacinto, and F. Roli
the ROC Curve (AUC), and the Equal Error Rate (EER) are two performance measure derived from the ROC that are typically used to compare the performance of different experts. The AUC is the area under the ROC curve, the larger the AUC, the better the ROC. The EER represents the point where the false positive rate and the false negative rate are equal, and a good system should keep this value as small as possible. The ROC, AUC, and EER are illustrated in Figure 1.
Fig. 1. An example of a ROC curve, its AUC and its EER
As said previously, different methodologies to combine the experts had been developed. In the case of two-class problem the combination is performed at the score level. The combination of scores aims to obtain a distribution of combined scores whose distributions for positive and negative patterns exhibit a larger degree of separation than those produced by individual experts. In the literature different score combination methods have been proposed [5,6,7,8,9], the vast majority of them are based on a linear combination of scores where the weights of the combination are computed through the maximisation of a performance function. These weight are “static”, that is they are computed once from a training set, and they are equal for all the patterns of the application we are dealing with. In this paper we propose a formulation of linear score combination that is “dynamic” in the sense that the computed weights for the combination are not fixed for all the patterns, but they vary according to the pattern to be classified. The proposed formulation allows devising a number of combination techniques, that can be both unsupervised and supervised. In addition, this formulation is general enough so that typical “static” combination rules can be considered as special cases.
A Supervised and Unsupervised Score Combination Method
165
The paper is organised as follows: Section 2 describes the proposed formulation, while some consideration concerning its generality are reported in Section 3. Different techniques based on supervised and unsupervised estimation of the parameters are presented in Section 4. Section 5 reports the experimental phase aimed at assessing and comparing the performances of the proposed methods. Finally, conclusions are outlined in Section 6.
2
Dynamic Score Combination
Let E = {E1 , E2 , . . . Ej . . . EN } be a set of N experts and X = {xi } be the set of patterns. Let also fj (·) be the function associated to expert Ej that produces a score for each pattern xi , sij = fj (xi ). In order to improve the performance of a pattern recognition system, a linear combination of the scores produced by the N experts for each pattern xi can be employed [5] N s∗i = αj · sij (1) j=1
where the values of the weights αj are usually computed by maximising some performance measure (e.g. AUC, error rates, etc.) [5,6,8]. To this end, a training set is used, and the resulting weights are fixed for all the patterns to be classified. This approach can be considered as “suboptimal” in the sense that the weights are computed so that the performance are increased on average. On the other hand, an “optimal” solution should produce a set of weights for each pattern. This formulation leads to the following equation: s∗i =
N
αij · sij
(2)
j=1
if we were able to compute the weights αij the result of the combination should better fit the pattern xi . Thus, ideally the weights αij can be adapted both to the characteristics of the pattern, and to the performance of the experts for that pattern (“dynamic” combination), instead of having a set of weights αj based on the global performance of the experts on the training samples (“static” combination). It is easy to see that in the case of Equation 1 we face the problem of estimating the value of N variables, while in the case of Equation 2 we face the problem of estimating N · M variables, where M is the number of patterns to be classified. As a consequence, it is easy to see that the high number of variables in the latter formulation makes it hard to devise a solution to the weights’ estimation problem. In the following, we propose an equivalent formulation of the problem, with a reduced number of variables to estimate. Let us consider the “dynamic” linear combination of experts in Equation 2, with N experts E, and let the scores for pattern xi be sorted from the smaller to the larger: si1 ≤ si2 ≤ · · · ≤ siN
166
R. Tronci, G. Giacinto, and F. Roli
The formula in Equation 2, can be built in a incremental way. If we consider the first three experts, we have the following expression αi1 si1 + αi2 si2 + αi3 si3 the previous expression is equivalent to αi1 si1 + si2 + αi3 si3 and taking into account that si2 is a constant. we have the following equivalent expression αi1 si1 + αi3 si3 If we consider four experts we have αi1 si1 + αi2 si2 + αi3 si3 + αi4 si4 and, according to the previous reasoning, this is equivalent to αi1 si1 + αi2 si2 + αi4 si4 and if we iterate the reasoning we obtain that the above expression is equivalent to α i1 si1 + αi4 si4 It can be seen that if we apply the same reasoning for all the N experts we then obtain that the dynamic combination is equivalent to: βi1 · min(sij ) + βi2 · max(sij ) , ∀ xi j
j
Consequently, Equation 2 can be rewritten as s∗i = βi1 · min(sij ) + βi2 · max(sij ) j
(3)
j
If we further add the constrain that the weights should sum to one, then Equation 3 can be rewritten in the following form: s∗i = (1 − βi ) · min(sij ) + βi · max(sij ) j
j
, ∀ xi
(4)
This constrain is inherited from linear combination. A consequence of this constrain is that the result of the combination is in the range [minj (sij ), maxj (sij )]. On the other hand, if this constrain is not added, the result of the combination can be a generic real value. Let us consider Equation 4 from now on. As stated before, the aim of combination is to find the ideal combined score values s∗i for each pattern xi to be classified. Then, taking into account Equations 2 and 4 the following relations hold N αij · sij = s∗i j=1
A Supervised and Unsupervised Score Combination Method
167
(1 − βi ) · min(sij ) + βi · max(sij ) = s∗i N βi =
j
j=1
j
αij · sij − minj (sij )
maxj (sij ) − min(sij )
, ∀ xi
(5)
The last relation implies that exists an optimal value of the weight βi that allows obtaining the ideal combined score value s∗i .
3
General Nature of the Dynamic Score Combination
In this section we focus on showing the generality of Equation 4, while in the next section we are going to present some methods to compute the value of βi . We named the formulation expressed in Equation 4 as Dynamic Score Combination (DSC) s∗i = (1 − βi ) · min(sij ) + βi · max(sij ) j
j
, ∀ xi
as the weights are computed dynamically for each pattern. This formulation can be considered a general “linear” score combination function. It is easy to see that this formulation embeds the following types of combination: – the “static” score linear combination: weights can be estimated either in supervised or unsupervised way, and it is a particular case of the “dynamic” linear score combination. This can be easily proven from the expressions reported in the previous section s∗i =
N
αj · sij
j=1
and for each pattern xi the βi have to be computed N j=1 αj · sij − minj (sij ) βi = , ∀ xi maxj (sij ) − min(sij ) – the Sum/Mean rule: this is a particular case of a “static” linear combination, where the weights are equal to a constant value s∗i =
N j=1
sij
or s∗i =
N 1 sij N j=1
and for each pattern xi the value of βi is given by the following expressions N j=1 sij − minj (sij ) sum βi = maxj (sij ) − min(sij ) N 1 j=1 sij − minj (sij ) N mean βi = maxj (sij ) − min(sij )
168
R. Tronci, G. Giacinto, and F. Roli
– the Max rule: when βi = 1, the maximum score is used s∗i = Max{sij } – the Min rule: when βi = 0, the minimum score is used s∗i = Min{sij } – the Dynamic Score Selection (DSS): the selection of one of the available scores from N experts for each pattern xi is ideally performed by choosing between the maximum and the minimum scores for a positive or negative pattern respectively [7]. This choice can be made using an estimation of the probability of the pattern to belong to the positive class. As an alternative, to account for the uncertainty in the estimation of the probability, the following expression can be used to obtain a score in the range [minj (sij ), maxj (sij )] [10] s∗i = P (pos|xi ) · max(sij ) + P (neg|xi ) · min(sij ) (6) j
j
Moreover, if we does not limit the value of the combined score in the range [minj (sij ), maxj (sij )], we can use a different interval such as the “ideal” values of scores for a positive and negative pattern (e.g. 1 and 0 respectively). Consequently, Equation 4 can be modified by using the ideal values for positive and negative patterns in place of the maximum and the minimum scores, respectively. Finally, Equation 3 can be used to obtain combined scores that take values in .
4
Ways to Estimate βi
The main problem with Equation 4 is to estimate the value of βi for each pattern xi . As it was shown in the previous section, Equation 4 embeds different combination methods that suggest different possible strategies of estimating βi . One way to estimate βi is that of using the Dynamic Score Selection (DSS) approach illustrated in Equation 6, thus attaining a supervised Dynamic Score Combination. It is worth noting that βi = P (pos|xi ) in this case. As illustrated in [10], a classifier C is trained on a training set made up of patterns represented on the space of scores produced by the N experts. Afterwards, this classifier is used to estimate the P (pos|xi ) for each pattern xi , and finally the combined score is computed according to Equation 6. We will refer to this formulation both as Dynamic Score Selection, and as supervised Dynamic Score Combination. The previous procedure requires that a training set is available. Unfortunately this training set is not always available, or a classifier can’t be used due to requests of low computational time and complexity. One way to estimate βi is to obtain the P (pos|xi ) (or βi ) in an unsupervised way. According to [1] the P (pos|xi ) of a combined system can be computed using the Mean rule, the Max rule, or the Min rule. When dealing with similarity scores, these rules are usually used directly on the scores instead of on the probability estimations. Thus, we propose to compute the βi for each pattern xi using the following rules:
A Supervised and Unsupervised Score Combination Method
– the Mean rule:
169
N 1 βi = sij N j=1
– the Max rule: βi = max{sij } – the Min rule: βi = min{sij } in this way the values of βi are “dynamic” because they are computed for each pattern xi , and are unsupervised because they don’t require a training set to be estimated, and can be computed on the fly.
5
Experiments
The aim of the following experiments is to show that the proposed formulation of the Dynamic Score Combination allows attaining good performance with respect to other combination techniques. When dealing with scores, one of the most relevant application of two-class classifiers is the biometric authentication. Thus, the experiments have been performed using a multi-modal biometric authentication dataset. The dataset is the Biometric Scores Set Release 1 of the National Institute of Standards and Technology1 (NIST). This dataset contains similarity scores from 4 experts related to face and fingerprint recognition system. The scores were normalised using the Min-Max rule. In the experiments we compared the Dynamic Score Combination (DSC) based on the Mean rule, the Max rule, and the Min rule, with the related “fixed” rules, to evaluate the performance of the proposed formulation when the value of βi is estimated in an unsupervised way. Moreover, results are also compared with the ideal score selector [7], the optimal “static” linear combiner, a “static” linear combination whose weights are computed by Linear Discriminant Analysis (LDA) [11,12], and the Dynamic Score Selection (DSS) [7,10], where the value of βi is estimated using a classifier. The classifiers used for the DSS are the k-Nearest Neighbour (k-NN), the Linear Discriminant Classifier (LDC), the Quadratic Discriminant Classifier (QDC) and the Support Vector Machine (SVM). The performance of the optimal “static” linear combiner are estimated by performing an exhaustive search on the value of the combination weight αj . The dataset was subdivided into 4 subsets: one subset at a time was used as the training set (for the classifiers and LDA), and the other three as the test set. In Tables 1 and 3 the performance are assessed in terms of average and standard deviation of Area Under the ROC Curve (AUC), Equal Error Rate (EER), and the d . The AUC is computed using the Wilcoxon-Mann-Whitney (WMW) statistic [13] n+ n− pos pos neg 1 sp,j > sneg p=1 q=1 I(sp,j , sq,j ) pos neg q,j AU C = , I(sp,j , sq,j ) = neg 0 spos n+ · n− p,j < sq,j 1
Web site: http://www.itl.nist.gov/iad/894.03/biometricscores/
170
R. Tronci, G. Giacinto, and F. Roli
where n+ is the number of positive patterns and n− is the number of negative patterns. The EER is the point of the ROC where the false positive rate and the false negative rate are equal. This measure is largely used in the biometric field, the smaller the EER, the better the performance. The d is a measure of the separation of the distributions of positive and negative score, the larger the d , the higher the separation [12,14] |μpos − μneg | d = 2 2 σpos σneg 2 + 2 In Tables 2 and 4 performance are assessed in terms of False Matching Rate (FMR, i.e. the false positive rate) and False Not Matching Rate (FNMR, i.e. the false negative rate) where F M Rj (th) = P (sij > th|xi ∈ negative) F N M Rj (th) = P (sij ≤ th|xi ∈ positive) In particular, the tables report the F M R 1% (F N M R 1%) values, that are the FNMR (FMR) values when F M R = 1% (F M R = 0%), and the F M R 0% (F N M R 0%) values, that are the FNMR (FMR) values when the F M R = 0% (F M R = 0%).
Table 1. Average and standard deviation of AUC, EER and d’ for the combination methods in Finger-Face dataset using all possible score combinations of 3 experts Ideal Score Selector Optimal Linear Combiner Mean Max Min DSC Mean DSC Max DSC Min LDA DSS k-NN DSS ldc DSS qdc DSS svm
AUC 1.0000(±0.0000) 0.9997(±0.0004) 0.9982(±0.0013) 0.9892(±0.0022) 0.9708(±0.0085) 0.9986(±0.0011) 0.9960(±0.0015) 0.9769(±0.0085) 0.9945(±0.0040) 0.9987(±0.0016) 0.9741(±0.0087) 0.9964(±0.0039) 0.9996(±0.0004)
EER 0.0000(±0.0000) 0.0050(±0.0031) 0.0096(±0.0059) 0.0450(±0.0048) 0.0694(±0.0148) 0.0064(±0.0030) 0.0214(±0.0065) 0.0634(±0.0158) 0.0296(±0.0123) 0.0104(±0.0053) 0.0642(±0.0149) 0.0147(±0.0092) 0.0048(±0.0026)
d’ 25.4451(±8.7120) 3.1231(±0.2321) 3.6272(±0.4850) 3.0608(±0.3803) 2.0068(±0.1636) 3.8300(±0.5049) 3.8799(±0.2613) 2.3664(±0.2371) 2.3802(±0.2036) 6.9911(±0.9653) 2.7654(±0.2782) 9.1452(±3.1002) 4.8972(±0.4911)
Tables 1 and 2 show the average performance attained for all the possible combinations of three experts, and for all the 4 subdivisions of the dataset. The Tables show that the proposed DSC formulation, where the values of βi are estimated by the Mean rule (DSC Mean), outperforms the Mean rule in terms of the AUC, the EER, the d’, the FMR 1%, and the FNMR 1%. The DSC
A Supervised and Unsupervised Score Combination Method
171
Table 2. Average and standard deviation of FMR 1%, FNMR 1%, FMR 0%, and FNMR 0% for the combination methods in Finger-Face dataset using all possible score combinations of 3 experts Ideal Score Selector Optimal Linear Combiner Mean Max Min DSC Mean DSC Max DSC Min LDA DSS k-NN DSS ldc DSS qdc DSS svm
FMR 1% 0.0000(±0.0000) 0.0027(±0.0047) 0.0090(±0.0075) 0.0851(±0.0118) 0.1044(±0.0229) 0.0048(±0.0035) 0.0311(±0.0102) 0.0909(±0.0232) 0.0483(±0.0270) 0.0100(±0.0067) 0.0969(±0.0225) 0.0160(±0.0129) 0.0034(±0.0034)
FNMR 1% 0.0000(±0.0000) 0.0025(±0.0037) 0.0175(±0.0245) 0.2304(±0.0841) 0.6994(±0.1220) 0.0028(±0.0036) 0.0746(±0.0680) 0.5159(±0.1603) 0.1300(±0.1026) 0.0354(±0.0563) 0.6461(±0.1195) 0.1140(±0.1760) 0.0019(±0.0023)
FMR 0% 0.0000(±0.0000) 0.0741(±0.0328) 0.0530(±0.0210) 0.5055(±0.0839) 0.2471(±0.0643) 0.0553(±0.0249) 0.3516(±0.0884) 0.2261(±0.0613) 0.1575(±0.0867) 0.0791(±0.0391) 0.2079(±0.0561) 0.3906(±0.1591) 0.0551(±0.0210)
FNMR 0% 0.0000(±0.0000) 0.0920(±0.1232) 0.4899(±0.4020) 0.8172(±0.2128) 0.9964(±0.0093) 0.5002(±0.4168) 0.7133(±0.3363) 0.9725(±0.0533) 0.5325(±0.3085) 0.2481(±0.3227) 0.9788(±0.0346) 0.3800(±0.3001) 0.1142(±0.1217)
where the values of βi are estimated on the Max, and the Min rule, outperform the Max and the Min rule, respectively, in terms of all considered measures of performance. In particular, the larger improvement is exhibited by the DSC formulation of the Max rule. It is also worth noting that the DSC Mean, and the DSC Max generally obtain better performance than those obtained by LDA, that is a trained linear combiner. Thus the proposed unsupervised Dynamic Combination mechanism allows outperforming supervised linear combination mechanisms.
Table 3. Average and standard deviation of AUC, EER and d’ for the combination methods in Finger-Face dataset using all possible score combinations 4 experts Ideal Score Selector Optimal Linear Combiner Mean Max Min DSC Mean DSC Max DSC Min LDA DSS k-NN DSS ldc DSS qdc DSS svm
AUC 1.0000(±0.0000) 1.0000(±0.0000) 0.9996(±0.0003) 0.9898(±0.0026) 0.9700(±0.0054) 0.9996(±0.0002) 0.9957(±0.0016) 0.9739(±0.0048) 0.9977(±0.0019) 0.9998(±0.0001) 0.9760(±0.0051) 0.9991(±0.0008) 0.9998(±0.0002)
EER 0.0000(±0.0000) 0.0017(±0.0010) 0.0045(±0.0013) 0.0427(±0.0035) 0.0742(±0.0080) 0.0052(±0.0019) 0.0207(±0.0030) 0.0716(±0.0060) 0.0200(±0.0068) 0.0071(±0.0025) 0.0632(±0.0098) 0.0097(±0.0044) 0.0040(±0.0024)
d’ 31.1087(±0.6621) 3.3325(±0.3414) 3.9444(±0.0599) 3.0807(±0.1952) 1.9082(±0.0421) 3.9402(±0.0763) 3.7414(±0.1634) 2.1957(±0.0564) 2.5846(±0.0686) 8.3496(±1.0404) 2.9149(±0.2305) 11.4145(±2.9062) 5.5261(±0.3032)
172
R. Tronci, G. Giacinto, and F. Roli
Table 4. Average and standard deviation of FMR 1%, FNMR 1%, FMR 0%, and FNMR 0% for the combination methods in Finger-Face dataset using all possible score combinations of 4 experts Ideal Score Selector Optimal Linear Combiner Mean Max Min DSC Mean DSC Max DSC Min LDA DSS k-NN DSS ldc DSS qdc DSS svm
FMR 1% 0.0000(±0.0000) 0.0000(±0.0000) 0.0039(±0.0026) 0.0800(±0.0122) 0.1122(±0.0121) 0.0039(±0.0015) 0.0322(±0.0074) 0.1044(±0.0119) 0.0251(±0.0108) 0.0058(±0.0039) 0.0980(±0.0160) 0.0097(±0.0061) 0.0032(±0.0025)
FNMR 1% 0.0000(±0.0000) 0.0003(±0.0003) 0.0001(±0.0001) 0.2096(±0.1180) 0.6942(±0.0925) 0.0002(±0.0002) 0.0463(±0.0177) 0.5359(±0.1129) 0.0743(±0.0578) 0.0010(±0.0011) 0.6042(±0.0686) 0.0061(±0.0069) 0.0001(±0.0000)
FMR 0% 0.0000(±0.0000) 0.0354(±0.0226) 0.0251(±0.0044) 0.5442(±0.0509) 0.2186(±0.0107) 0.0348(±0.0065) 0.4198(±0.0672) 0.2070(±0.0167) 0.0825(±0.0075) 0.0554(±0.0194) 0.1941(±0.0352) 0.4011(±0.1851) 0.0264(±0.0013)
FNMR 0% 0.0000(±0.0000) 0.0014(±0.0007) 0.1525(±0.0985) 0.9060(±0.1821) 0.9981(±0.0038) 0.1424(±0.0820) 0.8435(±0.3075) 0.9779(±0.0442) 0.2342(±0.2240) 0.0426(±0.0188) 0.9923(±0.0003) 0.1759(±0.1204) 0.0745(±0.0626)
Tables 3 and 4 show the results obtained using all the possible combinations of the four experts at time. The average and standard deviation of the results are related to the 4 subdivisions of the dataset. The behaviour of the DSC Max and the DSC Min is the same of that obtained in the previous experiments. On the other hand, in this case the performance of the DSC Mean, and that of the Mean rule are nearly the same. In Figure 2 a graphical comparison between the DSC based on the Mean, Max, and the Min rule and the Mean, Max, and Min rule themselves is reported. Results are related to all the possible combinations of 3 and 4 experts among the 4 subdivisions. In Figure 2a the AUC obtained using the DSC formulation is compared to the AUC obtained using the fixed rules. The graph reports the number of times that the AUC obtained with an unsupervised DSC is greater then the AUC obtained with the correspondant “fixed” rule. In Figure 2b an analogous graph compares the EER obtained using the DSC to the EER obtained using the fixed rules. These figures clearly show that the Dynamic Score Combination performs generally better than the “fixed” rules. Figure 3 shows the same comparison as in Figure 2, where the performance are evaluated in terms of FMR 1% (Fig. 3a), FNMR 1% (Fig. 3b), FMR 0% (Fig. 3c), and FNMR 0% (Fig. 3d). These graphs clearly show that the DSC formulations based on the Max, and the Min rule allow to obtain better performance than those of the “fixed” rules also when these performance measures are considered. This statement holds also in the case of Mean rule when the FMR 1% and FMR 0% are taken into account (Figures 3a and 3c), while when the FNMR 1% and FNMR 0% are considered the performance are nearly the same. Figures 4 and 5 show the comparison between the DSC based on the Mean rule (the best unsupervised DSC) and the Dynamic Score Selection based on k-NN, LDC, QDC, and SVM (the supervised version of the DSC). As for the
A Supervised and Unsupervised Score Combination Method
AUC
>= AUC
AUC
> AUC
0.7
0.8
DSC Fixed
173
Fixed
DSC
Min
Max
Mean
0
0.1
0.2
0.3
0.4
0.5
0.6
0.9
1
(a)
EER
<= EER
EER
< EER
0.7
0.8
DSC Fixed
Fixed
DSC
Min
Max
Mean
0
0.1
0.2
0.3
0.4
0.5
0.6
0.9
1
(b) Fig. 2. Comparison between the DSC based on “fixed” rules and the “fixed” rules themselves in terms of AUC (a), and the EER (b). In black are reported the number of times that the AUC, or the EER, obtained with the DSC, outperform those obtained using the related “fixed” rule, while in white is reported the opposite case.
previous Figures, the graphs show the number of times were the DSC Mean outperforms the DSS. In Figure 4a the AUC obtained using the DSC Mean is
174
R. Tronci, G. Giacinto, and F. Roli
compared to the AUC obtained using the DSS. The graph shows that the DSC Mean is pretty equivalent to the DSS k-NN, and the DSS QDC. Moreover the DSC Mean outperforms the DSS LDC, while it is outperformed by the DSS SVM. In Figure 4b the comparison is shown in terms of EER. In this case the DSC Mean outperforms all the DSS methods, but the one based on SVM. In Figure 5 the DSC Mean outperforms the DSS based on k-NN, LDC, and QDC, but not the DSS based on SVM. By analysing the Tables 1-4, and the Figures 4 and 5 it is clear that the supervised version of the Dynamic Score Combination (the DSS) does not always outperform the unsupervised ones. In particular, if the DSC Mean is considered, the DSS LDC always exhibits lower performance, the DSS k-NN and DSS QDC provide similar performance, while the SVM outperforms all other methods. These results allows to conclude that the advantages of going supervised strongly depend on the classifier chosen.
FMR
DSC
1% <= FMR
Fixed
1%
FNMR
DSC
FMRFixed 1% < FMRDSC 1% Min
Min
Max
Max
Mean
Mean
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
(a)
0.5
0.6
DSC
0% <= FMR
Fixed
0%
DSC
Min
Max
Max
Mean
Mean
0.3
0.4
0.5
(c)
1%
0.8
0.9
1
0.6
0.7
0.8
0.9
0% <= FNMR
Fixed
0%
FNMRFixed 0% < FNMRDSC 0%
Min
0.2
0.7
FNMR
FMRFixed 0% < FMRDSC 0%
0.1
Fixed
(b) FMR
0
1% <= FNMR
FNMRFixed 1% < FNMRDSC 1%
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(d)
Fig. 3. Comparison between the DSC based on “fixed” rules and the “fixed” rules themselves in terms of FMR 1% (a), FNMR 1% (b), FMR 0% (c), and FNMR 0% (d). In black are reported the number of times that the DSC outperforms the related “fixed” rule, while in white is reported the opposite case.
A Supervised and Unsupervised Score Combination Method
175
SVM AUC
>= AUC
AUC
> AUC
0.7
0.8
DSC DSS
DSS
DSC
QDC
LDC
k−NN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.9
1
(a)
SVM EER
<= EER
EER
< EER
DSC DSS
DSS
DSC
QDC
LDC
k−NN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) Fig. 4. Comparison between the DSC based on the Mean rule and the DSS in terms of AUC (a), and the EER (b). In black are reported the number of times that the AUC, or the EER, obtained with the DSC Mean is better than that obtained using the DSS, while in white is reported the opposite case.
176
R. Tronci, G. Giacinto, and F. Roli
SVM
SVM FMR
DSC
1% <= FMR
DSS
1%
FNMR
DSC
FMRDSS 1% < FMRDSC 1%
QDC
QDC
LDC
LDC
k−NN
k−NN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1% <= FNMR
DSS
1%
FNMRDSS 1% < FNMRDSC 1%
1
0
0.1
0.2
0.3
0.4
(a)
0.5
0.6
0.7
0.8
0.9
1
(b)
SVM
SVM FMR
DSC
0% <= FMR
DSS
0%
FNMR
DSC
FMRDSS 0% < FMRDSC 0%
QDC
QDC
LDC
LDC
k−NN
k−NN
0
0.1
0.2
0.3
0.4
0.5
(c)
0.6
0.7
0.8
0.9
0% <= FNMR
DSS
0%
FNMRDSS 0% < FNMRDSC 0%
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(d)
Fig. 5. Comparison between the DSC based on the Mean rule and the DSS in terms of FMR 1% (a), FNMR 1% (b), FMR 0% (c), and FNMR 0% (d). In black are reported the number of times that the DSC Mean outperforms the DSS, while in white is reported the opposite case.
6
Conclusions
In this paper we propose a general formulation of the linear combination of scores called Dynamic Score Combination. This formulation is general because it embeds into itself different formulations: the “static” linear combination, the “dynamic” linear combination, and the Dynamic Score Selection. Moreover this formulation can be both used in supervised and unsupervised mode. The experiments show that the unsupervised use is generally preferable with respect to some “fixed” combination rules with a similar computational complexity. Moreover, when the supervised techniques are considered, their performance strongly depends on the classifier used, and in many cases they are very close to those provided by the best unsupervised Dynamic Score Combination.
A Supervised and Unsupervised Score Combination Method
177
In conclusion the proposed formulation can be exploited both in supervised, and unsupervised mode. The unsupervised DSC based on the Mean rule has the advantages to have low computational time and complexity (comparable to those of a “fixed” combination rule). The supervised DSC (the DSS), depending on the classifier used, can allow obtaining better performance, but with an increase of computational time and complexity.
References 1. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 2. Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons, Inc., Chichester (2004) 3. Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997) 4. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861–874 (2006) 5. Marrocco, C., Duin, R., Tortorella, F.: Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition 41(6), 1961–1974 (2008) 6. Ross, A., Nandakumar, K., Jain, A.: Handbook of Multibiometrics. Springer, Heidelberg (2006) 7. Tronci, R., Giacinto, G., Roli, F.: Dynamic Score Selection for Fusion of Multiple Biometric Matchers. In: ICIAP 2007, pp. 15–20 (2007) 8. Nandakumar, K., Chen, Y., Dass, S., Jain, A.: Likelihood ratio-based biometric score fusion. IEEE Trans. on PAMI 30(2), 342–347 (2008) 9. Marcialis, G., Roli, F.: Fusion of multiple fingerprint matchers by single-layer perceptron with class-separation loss function. Pattern Recognition Letters 26(12), 1830–1839 (2005) 10. Tronci, R., Giacinto, G., Roli, F.: Dynamic score combination of binary experts. In: 19th International Conference on Pattern Recognition (ICPR 2008), pp. 1–4. IEEE Computer Society, Los Alamitos (2008) 11. Ross, A., Jain, A.K.: Information fusion in biometrics. Pattern Recognition Letters 24, 2115–2125 (2003) 12. Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley & Sons, Inc., Chichester (2001) 13. Hanley, J., McNeil, B.: The meaning and the use of the area under a receiver operanting charateristic curve. Radiology 143, 29–36 (1982) 14. Tronci, R., Giacinto, G., Roli, F.: Selection of experts for the design of multiple biometric systems. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 795–809. Springer, Heidelberg (2007)
ODDboost: Incorporating Posterior Estimates into AdaBoost Olga Barinova and Dmitry Vetrov Moscow State University, dept. of Computational Mathematics and Cybernetics, 119992 Moscow, Russia [email protected], [email protected]
Abstract. Boosting methods while being among the best classification methods developed so far, are known to degrade performance in case of noisy data and overlapping classes. In this paper we propose a new upper generalization bound for weighted averages of hypotheses, which uses posterior estimates for training objects and is based on reduction of binary classification problem with overlapping classes to a deterministic problem. If we are given accurate posterior estimates, proposed bound is lower than existing bound by Schapire et al [25]. We design an AdaBoost-like algorithm which optimizes proposed generalization bound and show that incorporated with good posterior estimates it performs better than the standard AdaBoost on real-world data sets.
1 Introduction Boosting is one of the best-performing classification algorithms developed so far, in experimental comparison of classifiers performed by Caruana et al. [6] it demonstrated the best result. There are many different views on the nature of boosting, but the most theoretically solid explanation of boosting generalization capabilities is margin theory [25]. Margin theory provides an upper generalization bound independent of number of iterations made by boosting. This bound suggests that boosting may not overfit even if ran for many rounds. However, different researchers reported that boosting is very sensitive to noise and suffers from overfitting after considerable amount of iterations, although much less then one should expect from such general model [9][16]. It is still an open question why boosting actually overfits in practice even with constant complexity base learner (for example, stump). In this paper we consider the problem of worsening performance of boosting in case of binary classification problems with overlapping classes. By overlapping classes we mean that for a considerable amount of instances x both objects ( x, +1)
and ( x, −1) can occur with positive probabilities. We propose a method for reduction of overlapping classes’ problem to deterministic problem, where each instance x is associated with a label с ( x ) . We also introduce a new upper generalization bound for weighted averages of hypotheses based on this reduction, which is independent on number of iterations made by boosting. Proposed bound has a similar form to the one provided by margin theory [25], but uses estimates of posterior probabilities for P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 178–190, 2009. © Springer-Verlag Berlin Heidelberg 2009
ODDboost: Incorporating Posterior Estimates into AdaBoost
179
training samples as an input. When incorporated with accurate posterior estimates, it has a lower extra term comparing to upper bound provided by margin theory. Then we propose an AdaBoost-like algorithm, which is proven to minimize the introduced upper bound. We incorporate developed algorithm with posterior estimation method proposed by Vezhnevets et al. [29] and perform experiments on real-world datasets from UCI repository [3]. This paper is organized as follows. In section 2. we consider related work, in section 3. new upper-bounds for weighted averages of hypotheses are introduced. Section 4. describes an AdaBoost-like algorithm which was designed specially for minimizing proposed generalization bound. In section 5. we describe an algorithm for estimation of posterior probabilities for training instances, which uses only training set. Section 6 shows experimental results comparing proposed algorithm with standard AdaBoost on real-world datasets. The last section is left for conclusion.
2 Related Work A large body of research addresses the problem of explaining and avoiding the phenomenon of boosting overfitting. Domingos [10] showed the relation between margin theory explanation of boosting and bias-variance explanation. He also made an interesting statement that reducing variance (increasing margins) is beneficial only for unbiased samples, while for biased samples it is preferable to have high variance (lower margin). Biased sample is a sample, for which the optimal prediction, for a given loss and the family of classifiers, differs from its current label. Many works were devoted to explaining and avoiding degradation of boosting performance in case of noisy datasets. Several authors reported that boosting tends to increase the weights of few hard-to-learn samples, which leads to overfitting. Several modifications of the reweighting scheme were proposed that make weights change more smoothly. Domingo et al. [11] propose a modification of AdaBoost in which the weights of the examples are kept bounded by its initial value, however the authors admit no significant difference between AdaBoost and its modification in experiments on noisy data. Friedman [15] suggests that shrinking the weights of base hypothesis would increase boosting generalization capability. The property of concentrating on few hard-to-learn patterns can be interpreted in terms of margin maximization; this view leads to regularized modifications. Ratsch et al. [22] view boosting as minimization of cost functional through an approximate gradient descent with respect to a margin. The authors propose regularization methods to achieve “soft margins” for AdaBoost which should avoid boosting from concentrating on misclassified samples with large negative margins. Friedman [15] also considers regularization methods through introducing proportional shrinkage into gradient boosting. Other authors see unboundness of loss function as the major source of overfitting. Viewing boosting as a gradient descent search for a good fit in function space allows modifying loss functions. Mason et al. [19] views boosting as gradient descent on an appropriate cost functional in a suitable inner product space. Proposed modification of boosting algorithm with normalized sigmoid cost function was reported to outperform
180
O. Barinova and D. Vetrov
AdaBoost when boosting decision stumps, particularly in the presence of label noise. Friedman [15] considers Huber loss function. Rosset [24] proposes an approach of weight decay for observation weights which is equivalent to "robustifying" the underlying loss function. However the author admits that in experiments on real-world data there is no consistent winner between the nondecayed and the decayed versions of boosting algorithm. A large body of research addresses learning in the presence of classification noise. Krause & Singer [18] suggest employing EM algorithm and changing loss function to make boosting tolerant to known level of noise. Takenouchi & Eguchi [27] develop a modification of AdaBoost for classification noise case. The authors modify loss function and show that proposed method moderates the overweighting for outliers using a uniform weight distribution. Other researchers see the main source of boosting overfitting not in an inappropriateness of particular loss function, but in general concept of average loss minimization and illustrate it’s inappropriateness for the case of overlapping classes [29]. The authors use posterior estimates for eliminating samples form the training set which belong to opposite class with greater probability than to their own class. They show that this approach can lead to better boosting generalization performance and lower test error. In this work we extend their approach by estimating the probabilities of the objects to belong to each class. We then use these probabilities for recomputing the weights of training objects in order to make boosting procedure give more attention to the object which classification is confident and less attention to the objects whose classification is unclear (i.e. the probabilities to belong to each class are close to 0.5). The closest work to ours is the work by Schapire et al. [27]. The authors describe a modification to boosting algorithm that permits the incorporation of prior human knowledge as a means of compensating for a shortage of training data. They base on a boosting-style algorithm by Collins et al. [7] and modify target distribution in order to capture both class labels of the training samples and their posterior estimates. They show that this modification can lead to better classification performance. In fact there are many cases when class probabilities for each object of the training set are known. Imagine a situation when class labels are assigned to training objects by more than one expert. Then the rate of experts who voted that the given object belongs to the first class can be considered as a probability that the object belongs to the first class. Taking into account this additional information may potentially improve boosting performance by weakening the influence of the objects for which the experts have different opinions. In the paper we develop the framework for incorporating this knowledge into standard Discreet AdaBoost algorithm [12]. We show that such extension leads to new generalization bounds which are tighter than the ones derived in classical case. Furthermore we suggest a way how the proposed approach can be used even in cases when we do not have any prior knowledge about probabilities to belong to each of the classes. Our experiments show that in general such boosting incorporated with posterior estimates tends to outperform AdaBoost on real-world problems.
ODDboost: Incorporating Posterior Estimates into AdaBoost
181
3 Proposed Generalization Bounds Let T = ( xi , yi ) , i = 1,..., n be the training set, where xi ∈ X is the vector of attributes and yi ∈ {−1, +1} is the class label. We consider binary classification problems. We take the assumption of statistical learning framework, and assume that the pairs (x,y) are random variables distributed according to an unknown distribution P ( x, y ) . Later we’ll use a notion of margin of an object (x,y) relative to classifier f(x), which is by definition a value of yf(x). Support vector machine (SVM) was the first classifier which exploited largemargin approach [0]. Boosting was shown to have large margin properties only afterwards. The application of margin theory to boosting introduced by Schapire et al. in [25] still remains the most theoretically solid explanation for boosting generalization capabilities. Below we remind the most important results of margin theory. 1 1 m m Let’s denote PS [ χ ( x, y )] = ∑ i =1 I [ χ ( x , y )] and ES [ z ( x, y )] = ∑ i =1 z ( xi , yi ) . i i m m Theorem (Schapire et, al) Upper generalization bound based on cumulative distribution of margins Let P be a distribution over X × {+1, −1} and S be a sample of m examples chosen independently at random according to P . Assume that the base-classifier space H is finite and let δ > 0 . Then with probability at least 1 − δ over the random choice of training set S , every weighted average of hypotheses f where weights sum up to one, satisfies the following bound for all θ > 0 :
P [ yf ( x) ≤ 0] ≤ PS [ yf ( x) ≤ θ ] + ε ( m,| H |, θ , δ ) , 1/ 2 ⎛ 1 ⎛ log m log | H | ⎞ ⎞ where ε ( m,| H |,θ , δ ) = O ⎜ log(1/ ) δ + ⎟ ⎟⎟ ⎜ m ⎜⎝ θ2 ⎠ ⎠ ⎝
This theorem states that classification error of any weighted average of hypotheses can be upper-bounded by the value of cumulative margins distribution on training set plus an extra term, which depends on the number of training objects, number of weak hypotheses and two user-defined constants. The most important property of this extra term is it’s independence of the number of averaged hypotheses and their coefficients. If we are given a fixed training set for calculating this upper bound, the extra term is constant. Thus we can govern the value of the bound by choosing hypotheses and coefficients in weighted average in order to minimize the first term. AdaBoost algorithm [25] was proven to do exactly the right thing. Theorem (Schapire et, al) Large margin properties of AdaBoost Suppose the base learning algorithm, when called by AdaBoost, generates classifiers with weighted training errors ξ1 ,..., ξT . Then for any θ , we have that T
PS [ yf ( x) ≤ θ ] ≤ 2T ∏ ε t1−θ (1 − ε t )1+θ . t =1
182
O. Barinova and D. Vetrov
In this work, following the logic of margin theory, we introduce a new generalization bound for weighted averages and propose AdaBoost-like algorithm for minimization of this new bound. Below is given a sketch of considerations leading to proposed generalization bound. First work on boosting [12] operated with PAC-learning framework [17] for binary classification problem. Unlike statistical learning where both -1 and +1 class labels can be associated with an instance x ( p(+1|x) and p(-1|x) can be both positive), PAClearning assumes that label of each instance x is defined by target function c(x) and is fixed. Freund & Schapire [13] in their discussion on Friedman’s paper suggest that for overlapping classes (when Bayesian error is not zero) “AdaBoost is not the optimal method in this case”. It should be noted that in real world applications it is a rare case if Bayesian error is zero and classes are perfectly separable due to imperfect feature vector representation of objects, limitations of measuring equipment and noise. Our main idea is to bring boosting back from statistical learning to PAC learning framework. Further we will need the following notation: pmax ( x ) max ( p ( +1 | x ), p ( −1 | x ) ) pmin ( x ) min ( p ( +1 | x ), p ( −1 | x ) )
E y ( z ( x, y ) ) = p ( +1 | x ) × z ( x, +1) + p ( −1 | x ) × z ( x, −1) . The next statement shows how any statistical learning problem can be reduced to an equivalent PAC-learning problem. Here we operate with true distributions. Theorem 1 Let P be a distribution over X × {+1, −1} . Consider distribution
P ( x ) = ( pmax ( x ) − pmin ( x ) ) P ( x ) / (1−2 B ) and target function c ( x ) = arg max y ( p ( +1 | x ), p ( −1 | x ) ) . Consider classifier space H . Then for any classifier f(x)holds an inequality: P [ yf ( x ) ≤ 0] ≤ B + (1 − 2 B ) × P [c ( x) f ( x ) ≤ 0] ,
where B stands for the error of Bayesian classifier. And thus solutions of statistical learning problem with distribution P and PAClearning problem with distribution P ( x ) and target function c ( x ) coincide: A rg min H P ( h ( x )≠ c ( x ) ) = A rg min H P ( h ( x )≠ y ) Proof: The proof of this and the following theorems can be found in [0]. Using this fact, previous theorem and generalization bound from Schapire et al. we can upperbound classification error of any weighted average using a sample S from modified distribution P . Theorem 2 Let P be a distribution over X × {+1, −1} and let P stand for corresponding distribution and c(x) for corresponding target function in PAC-learning framework. Let S
ODDboost: Incorporating Posterior Estimates into AdaBoost
183
be a sample of m examples chosen independently at random according to P where labels are chosen according to c(x). Assume that the base-classifier space H is finite and let δ > 0 . Then with probability at least 1 − δ over the random choice of training set S , every weighted average of hypotheses f where weights sum up to one, satisfies the following bound for all θ > 0 : P [ yf ( x ) ≤ 0] ≤ B + (1 − 2 B ) × PS [c ( x ) f ( x ) ≤ θ ] + (1 − 2 B ) × ε ( m, | H |, θ , δ ) ,
where B is an error of Bayesian classifier over the distribution ε ( m, | H |, θ , δ ) is taken from Schapire’s theorem. This bound looks much like Schapire’s one but the extra term here is smaller. The problem is that instead of sample S from modified distribution P we are given only a sample S from distribution P . But still we can approximate cumulative distribution of margins on sample S by weighted cumulative distribution of margins on sample S . This approximation causes some increase of extra term, but still it remains smaller than the one from Schapire’s theorem. Noticing also that the condition c ( x ) f ( x ) < θ implies that E y ( yf ( x ) ) < θ , we obtain theorem 3. Theorem 3 Let P be a distribution over X × {+1, −1} Let S be a sample of m examples chosen independently at random according to P . Assume that the base-classifier space H is finite and let δ > 0 . Then with probability at least 1 − δ over the random choice of training set S , every weighted average of hypotheses f where weights sum up to one, satisfies the following bound for all θ > 0 : ⎛ p ( x) − pmin ( x) ⎞ P [ yf ( x) ≤ 0] ≤ B + (1 − 2 Bˆ ) × ES ⎜ max I[c ( x ) f ( x ) ≤θ ] ⎟ + ε ( m,| H |,θ , δ ) 1 − 2 Bˆ ⎝ ⎠ 1 m where B stands for the error of Bayesian classifier, Bˆ = ∑ pmin ( xi ) , m i =1
and ε ( m, | H |, θ , δ ) = ε ( m,| H |, θ , δ ) − 2 B
log | H |
compared to the term in
m
Schapire’s theorem. This upper bound can be calculated only in case when exact values of posteriors for every training example are known. This limits practical importance of theorem 3. The following theorem states the upper bound of classification error for the case when only rough estimates of posteriors are given. Theorem 4 Let P be a distribution over X × {+1, −1} Let S be a sample of m examples chosen independently at random according to P . Assume that the base-classifier space H is finite and let δ > 0 . Suppose we are given estimates of posteriors for all training samples, that obey a
condition pˆ max ( xi ) − pmax ( xi ) < α , i = 1,..., M .
184
O. Barinova and D. Vetrov
Then with probability at least 1 − δ over the random choice of training set S , every weighted average of hypotheses f where weights sum up to one, satisfies the following bound for all θ > 0 : 2α ⎛ pˆ − pˆ ⎞ + ε ( m,| H |, θ − β (α ), δ ) P [ yf ( x ) < 0] ≤ B + (1 − 2 B) × ES ⎜ max min I[ Eˆ yf ( x )<θ + β (α )] ⎟ + y ⎝ 1 − 2B ⎠ 1− 2B
where B stands for the error of Bayesian classifier, 1 m B = ∑ pˆ min ( xi ) , m i =1
β (α ) is a continuous function of α , and ε ( m, | H |, θ , δ ) = ε ( m,| H |, θ , δ ) − 2 B
log | H |
compared to the term in
m
Schapire’s theorem. Theorems 3 and 4 provide new upper generalization bounds for weighted average of hypotheses. These bounds are closer to the true error than the bound provided by margin theory, although not significantly. It is known that lower values of generalization bounds usually correspond to better classifier performance even if those bounds are loose [4]. This fact allows expecting to achieve better classifier performance when minimizing proposed generalization bounds instead of conventional generalization bound provided by margin theory. Bounds from theorem 3 and 4 cannot be calculated directly because they require knowledge of Bayesian error on distribution P. However theorem 4 allows deriving new AdaBoost-like algorithm which reduces generalization upper bound by first estimating posteriors with further minimization of second term in upper bound while keeping all other terms unchanged. The corresponding algorithm is described in the next section.
4 ODDboost Algorithm Following the line of margin theory we designed an algorithm that takes a training sample and posteriors for each sample and minimizes the value of the bound from Theorem3 (if exact posteriors are known) or the bound from Theorem 4 (if we know only rough estimates of posteriors). The following theorem shows that this algorithm minimizes upper generalization bounds from theorem 3 and theorem 4. Theorem 5 Suppose that proposed algorithm, when called by ODDboost, generates classifiers m
with weighted training errors ξ1 ,..., ξT : ξt = ∑ Dt (i ) p ( y ≠ ht ( xi ) | xi ) . Then for any
θ , we have that
i =1
ODDboost: Incorporating Posterior Estimates into AdaBoost
185
T ⎛ ( pˆ ( x ) − pˆ min ( x) ) ⎞ ES ⎜ max I ⎡ E ( yf ( x ) ) ≤θ ⎤ ⎟ ≤ 2T ∏ ε t1−θ (1 − ε t1+θ ) y ⎣ ⎦ 1 − 2B t =1 ⎝ ⎠
ODDboost algorithm 1) Estimate posteriors for each training instance pˆ (+1| xi ), i = 1,..., N using training set pˆ ( x ) − pˆ min ( xi ) 2) Assign weights to training instances D1 (i ) = max i , i = 1,..., M 1 − 2B 3) For t = 1: T a) Obtain weak hypothesis ht ( x) ∈ {−1, +1} , using weights Dt (i ) and posterior
estimates pˆ (+1 | xi ) and pˆ (−1 | xi ) by minimizing m
ε t = ∑ Dt (i ) p ( y ≠ ht ( xi ) | xi ) i =1
M
b) Calculate et ( x) = ∑ Dt (i ) p ( y ≠ ht ( xi ) | xi ) i =1
c)
Calculate weight of current weak learner α t =
d) Update weights of training instances Dt +1 (i ) =
1 ln ( (1 − et ) / et ) 2 Dt (i ) × Ε y ( exp(− yα t ht ( xi ) ) Zt
,
M
where Z t = ∑ Dt +1 (i ) i =1
T
4) Final classifier sign∑ α t ht ( x) t =1
ODDboost algorithm has much in common with standard AdaBoost, the main differences are the following. While in original AdaBoost initial weights of training examples are equal, in proposed algorithm they are taken proportional to a value ( pˆ max ( x) − pˆ min ( x) ) , this allows to concentrate on must-learn examples, which belong
to arg max y ( pˆ ( y | x) ) class with high confidence.
Second, in proposed algorithm weak learner is trained using posterior estimates and m
ε t = ∑ Dt (i ) p ( y ≠ ht ( xi ) | xi ) is minimized instead of simply minimizing training error i =1 m
ε t = ∑ Dt (i ) I[ y ≠ h ( x )| x ] made by AdaBoost. This property can be useful because incorpoi =1
i
t
i
i
rating posterior estimates in this step can help to choose better weak hypotheses. The last difference between ODDboost and AdaBoost is in the formula for updating weights of training instances. AdaBoost multiplies weight of a training instance by exp(− yiα t ht ( xi ) and thus uses a label of training object from training set, ODDboost
186
O. Barinova and D. Vetrov
updates training weights by multiplying them by Ε y ( exp(− yα t ht ( xi ) ) and thus takes into account the fact that training instance can belong to each class with known probabilities. This property helps ODDboost to avoid concentrating on hard-to-learn patterns, which AdaBoosting is known to do. It’s worth to mention that AdaBoost and ODDboost have the same computational complexity. In approach presented by Schapire et al. [27] a new training set with weight over example-label pairs is created. This new training set contains three copies of each training instance with different weights and different labels. Each training instance xi is replicated twice as ( xi , +1) and as ( xi , −1) . In their work each replicate of training example has a weight wl = η pˆ (l | xi ) , where l is an assigned class label of a new object, η is a user-defined parameter of the algorithm, and pˆ (l | xi ) is an estimated conditional probability of a class l given feature vector xi . We could view our algorithm in a similar way and say that ODDboost algorithm includes two copies of each training instance into теуц training set and set initial weights of training instances equal to wl = pˆ (l | xi ) ( pˆ max ( xi ) − pˆ min ( xi ) ) 1 − 2 B ,
(
)
1 m where B = ∑ pˆ min ( xi ) . So our method differs from the one proposed by Scahpire m i =1
et al. in a way of initializing training examples weights. Another difference is that we modify the formula for weights update while in [27] this step is kept unchanged.
5 Posteriors Estimation Our goal is to roughly estimate the conditional probabilities of training set samples labels and to incorporate them into proposed boosting procedure. We used a method for posteriors estimations proposed in [29]. Its general idea is based on the use of Platt scaling [20] to obtain calibrated probabilities from boosting. Platt’s scaling method fits the sigmoid into raw classifier output, thus approximating posterior as: P ( y | x) ≈
1 1 + exp( A ⋅ F ( x) + B )
where constants A and B are set to maximize likelihood of special calibration set. To estimate posteriors we construct an iterative process; during iterations we randomly split data into 3 parts. We train a boosted committee on the first part, calibrate probabilities on the second and estimate posterior for samples labels in the third part (using the built classifier and estimated calibration parameters). We repeat this procedure several times. At each step we acquire an estimate of class probabilities for training instances. Posterior estimates for every sample are averaged. In [29] the authors experimented with Real AdaBoost algorithm [26] and used posterior estimates for eliminating noisy samples from the data. In this work we modify Discrete AdaBoost [12] in order to incorporate posterior estimates into the algorithm. Averaging of estimates was extensively studied in context of regression tasks [28]. Perrone [21] proved that averaging estimates always generates improved estimate in
ODDboost: Incorporating Posterior Estimates into AdaBoost
187
the sense of any convex optimization measure even without independence assumption on the estimates. Brieman [5] experimented with averaging class probability estimates obtained by learning decision trees on bootstrap replicates of the training set. He showed that averaging decreases error of class probability estimation.
6 Experiments In order to study the behavior of proposed ODDboost algorithm we have conducted a set of experiments on real world data. We compare the performance of AdaBoost and ODDboost with stumps as a base learner. Stumps were chosen as base learners to avoid possible issues with base learner complexity [23]. Table 1. Test error (%) of AdaBoost and ODDboost trained on various data sets Dataset
100 iterations
1 iteration
AdaBoost
ODDboost
AdaBoost
ODDboost
13.2±0.5
14.2±0.2
12.1±0.1
16.0±0.1
13.1±0.1
15.0±0.5
15.0±0.5
14.5±0.2
13.0±0.1
17.4±0.1
13.0±0.1
BREAST
8.5±0.3
7.5±0.3
4.5±0.1
4.2±0.0
5.4±0.0
4.4±0.1
CHECKER9
47.0±0.4
46.0±0.4
28.8±0.4
27.5±0.2
29.5±0.4
27.0±0.2
GERMAN
39.0±0.3
38.0±0.3
23.7±0.2
26.6±0.2
25.7±0.3
25.0±0.1
MADELON
37.5±0.3
37.6±0.3
40.1±0.5
38.7±0.2
41.5±0.3
38.0±0.1
MAGIC04
26.4±0.2
26.4±0.2
17.0±0.5
17.6±0.2
17.9±0.1
16.4±0.1
PAGE
6.3±0.1
6.4±0.1
3.4±0.2
3.6±0.1
3.3±0.1
2.9±0.1
PIMA
26.3±0.3
25.2±0.3
25.5±0.5
24.2±0.2
26.2±0.1
24.7±0.2
LETTER T
31.9±0.3
31.8±0.3
15.0±0.2
14.1±0.1
16.2±0.2
14.6±0.0
VOTE
4.4±0.2
4.4±0.2
3.7±0.2
4.3±0.2
4.6±0.2
4.3±0.2
AUSTRA AUSTRALIA
Total :
AdaBoost
ODDboost
13.2±0.5
1000 iterations
2
1
6
12
1
14
We used the following experimental strategy. First we estimated posterior probabilities by running 100 iterations of procedure described in section 5. We used AdaBoost classifier as initial classifier for posteriors estimation. The obtained posteriors were then used as inputs for ODDboost algorithm. We measured its 5x2 cross-validation error [8] after 1, 100, and 1000 iterations respectively. In order to measure the performance of our algorithm on real world data we have selected 11 datasets from UCI-repository [3]. In our experiments we split the dataset
188
O. Barinova and D. Vetrov
in two subsets of equal size and use one subset for training and the other for testing and vise-versa. We measured the test error. This procedure was repeated 5 times for a total of 10 runs (5x2 cross-validation). To compare our approach with standard boosting we run AdaBoost algorithm and estimated its error rate in a similar way.
Fig. 1. Crossvalidation error depending on number of iterations made by boosting. From top left to bottom right: test error on Australian, pima-indians-diabetes, madelon and breast cancer datasets. Error of standard AdaBoost are shown in red and errors of proposed ODDboost are shown in blue.
Table 1 presents the results of our experiments. The best overall result is shown in bold, the second-best is shown in italic. The last line shows overall scores: if algorithm performed the best 2 is added to it’s score, second-best result adds 1 to the the score. These results allow us to make the following conclusions: ODDboost generally performs better than AdaBoost both after 100 and 1000 iterations. As the number of learning iterations increases, AdaBoost tends to overfit on most datasets, while ODDboost still keeps high generalization ability. The typical test errors w.r.t. the number of iterations for AdaBoost and ODDboost are shown in figure 1. It can be seen that ODDboost tends to be more stable. We suppose that it happens because we deliberately reduce the weights of noisy objects thus concentrating mostly on reliable objects which do not lead to overfitting. The reweighting of training objects can be considered as a special kind of regularization which can potentially be used for any classification algorithm.
ODDboost: Incorporating Posterior Estimates into AdaBoost
189
On the other hand the obvious drawback of such approach is its requirement of posterior estimates at the time of training. Posterior estimates are not always available for training samples so special procedure for estimation of posteriors is needed. Not only this is difficult problem by itself, but it also requires more computational efforts.
7 Conclusion We considered the problem of poor performance of boosting in case of binary classification problems with overlapping classes. We introduced a new upper generalization bound for weighted averages of hypotheses based on reduction of overlapping classes’ problem to a deterministic problem. Introduced bound doesn’t depend on number of iterations made by boosting. Proposed bound uses estimates of posterior probabilities for training samples as an input and has a similar form to the upper bound provided by margin theory. When incorporated with accurate posterior estimates, an extra of proposed bound is lower than the one from margin theory. Based on the new upper bound we proposed ODDBoost algorithm, which is proven to minimize the second term in the bound while keeping all other terms unchanged. We incorporated developed algorithm with posterior estimation method proposed in [29] and performed experiments on real-world datasets from UCI repository, which confirm that proposed algorithm generally performs better than standard AdaBoost.
Acknowledgments Authors would like to thank D. Kropotov for useful discussions. The work was partly supported by grants of Russian Foundation for basic research (projects ##08-0100883, 08-01-00405).
References 1. Barinova, O., Vetrov, D.: ODDboost: Incorporating posterior estimates into AdaBoost, Technical Report 1422 (2009), http://graphics.cs.msu.ru/en/publications/text/oddboost.pdf 2. Boser, B., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Fifth Annual Workshop on Computational Learning Theory. ACM Press, Pittsburgh (1992) 3. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998) 4. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge 2, 121–167 (1998) 5. Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123–140 (1996) 6. Caruana, R., Niculescu-Mizil, A.: An Empirical Comparison of Supervised Learning Algorithms. In: Proceedings of the 23rd International Conference on Machine Learning (ICML 2006)(2006) 7. Collins, M., Schapire, R., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Machine Learning 48(1/2/3) (2002) 8. Dietterich, T.G.: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation 10(7), 1895–1923 (1998)
190
O. Barinova and D. Vetrov
9. Dietterich, T.G.: An experimental comparison of three methods for constructing ensebles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2) (1999) 10. Domingos, P.: A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. In: Proc. of the 17th National Conference on Artificial Intelligence (2000) 11. Domingo, C., Watanabe, O.: Madaboost: A modication of adaboost. In: 13th Annual Conference on Comp. Learning Theory (2000) 12. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1997) 13. Freund, Y., Schapire, R.: Discussion of the paper “Additive logistic regression: a statistical view of boosting” by J. Friedman, T. Hastie and R. Tibshirani. The Annals of Statistics 38(2), 391–393 (2000) 14. Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. The Annals of Statistics 28(2), 337–407 (2000) 15. Friedman, J.: Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5) (2001) 16. Grove, A.J., Schuurmans, D.: Boosting in the limit: Maximizing the margin of learned ensembles. In: Proceedings of the Fifteenth National Conference on Artifical Intelligence (1998) 17. Kearns, M., Vazirani, U.: An introduction to computational learning theory. MIT Press, Cambridge (1994) 18. Krause, N., Singer, Y.: Leveraging the Margin More Carefully. In: ACM International Conference Proceeding Series, vol. 69 (2004) 19. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. In: Neural Information Processing Systems, vol. 12, pp. 512–518. MIT Press, Cambridge (2000) 20. Niculescu-Mizil, A., Caruana, R.: Obtaining Calibrated Probabilities from Boosting. In: Proc. of 21st Conference on Uncertainty in Artificial Intelligence (2005) 21. Perrone, M.: Improving regression estimation: Averaging methods for Variance reduction with extension to General Convex Measure Optimization, Ph.D. Thesis, Brown University (1993) 22. Ratsch, G.: Robust Boosting and Convex Optimization. Doctoral dissertation, University of Potsdam (2001) 23. Reyzin, L., Schapire, R.: How boosting the margin can also boost classifier complexity. In: Proceedings of the 23rd International Conference on Machine Learning (2006) 24. Rosset, S.: Robust Boosting and Its Relation to Bagging. In: KDD 2005 (2005) 25. Schapire, R., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. In: Machine Learning: Proceedings of the Fourteenth International Conference (1997) 26. Schapire, R., Singer, Y.: Improved boosting algorithm using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999) 27. Schapire, R., Rochery, M., Rahim, M., Gupta, N.: Incorporating prior knowledge into boosting. In: Machine Learning: Proceedings of the Nineteenth International Conference, pp. 538–545 (2002); Takenouchi T., Eguchi S.: Robustifying AdaBoost by adding the naive error rate. Neural Computation 16 (2004) 28. Taniguchi, M., Tresp, V.: Averaging Regularized Estimators. Neural Computation (1997) 29. Vezhnevets, A., Barinova, O.: Avoiding boosting overfitting by removing confusing samples. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS, vol. 4701, pp. 430–441. Springer, Heidelberg (2007)
Ensemble Learning: A Study on Different Variants of the Dynamic Selection Approach Jo˜ ao Mendes-Moreira1,2, Alipio Mario Jorge2,3, Carlos Soares2,4, and Jorge Freire de Sousa5 1
Faculdade de Engenharia, Universidade do Porto, DEI, Portugal 2 LIAAD-INESC Porto L.A., Portugal 3 Faculdade de Ciˆencias, Universidade do Porto, Portugal 4 Faculdade de Economia, Universidade do Porto, Portugal 5 Faculdade de Engenharia, Universidade do Porto, DEIG, Portugal [email protected], [email protected], [email protected], [email protected]
Abstract. Integration methods for ensemble learning can use two different approaches: combination or selection. The combination approach (also called fusion) consists on the combination of the predictions obtained by different models in the ensemble to obtain the final ensemble prediction. The selection approach selects one (or more) models from the ensemble according to the prediction performance of these models on similar data from the validation set. Usually, the method to select similar data is the k-nearest neighbors with the Euclidean distance. In this paper we discuss other approaches to obtain similar data for the regression problem. We show that using similarity measures according to the target values improves results. We also show that selecting dynamically several models for the prediction task increases prediction accuracy comparing to the selection of just one model.
1
Introduction
Ensemble learning is an important research field in different research communities. It consists in the use of a set of prediction models, instead of just one, to accomplish the prediction task. The interest in ensemble learning is mainly due to the improvement of accuracy and robustness when compared to the use of just one model [10]. It has two main phases: ensemble generation and ensemble integration. When the same induction algorithm is used to generate all the models of the ensemble, it is said that the ensemble generation is homogeneous, otherwise it is called heterogeneous. Ensemble integration can use two different approaches [16]: combination (also called fusion) or selection. The former consists in the combination of the predictions obtained by the different models in the ensemble to obtain the final prediction. The selection approach, in particular its dynamic version, selects one (or more) models from the ensemble according to the prediction performance of these models on similar data from the validation set. In this paper we focus on the dynamic selection approach which is relatively less popular. Well known ensemble methods such as bagging [3], boosting [9], P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 191–205, 2009. c Springer-Verlag Berlin Heidelberg 2009
192
J. Mendes-Moreira et al.
random forests [4], among others, use the combination approach. However, there are some indications that dynamic selection can obtain better results [29,27] than combination using the above mentioned ensemble methods. In this paper we present a study on dynamic selection, testing some alternatives to the usual selection methods described in the literature. We test algorithms (never previously used in the dynamic selection framework) that measure similarity between data according to the outputs. We also compare different integration functions in order to investigate whether there are benefits from selecting more than one predictor from the ensemble comparing to the selection of only one predictor. We start by reviewing the related work on the selection approach. Then, we propose alternative methods and present the experiments done, followed by a discussion on the results. Our results show that using similarity measures according to the target values improves results as well as using several models for the prediction task instead of just one.
2
Dynamic Selection
The assumption behind the use of dynamic selection is that each model/predictor in the ensemble has different prediction abilities on different subsets of the input space. What follows assumes that base models are already available as well as the estimates of the errors of these models on the validation data as described by Merz [19]. Dynamic selection (Figure 1) consists of the following tasks:
validation data
(1) Find similar data
similar data
(2) Select model(s)
F
fˆ1 , fˆ2 ,..., fˆk1 x
(3) Model(s) prediction(s)
(4) Ensemble prediction
fˆ1 (x), fˆ2 (x),..., fˆk1 (x)
fˆ f ( fˆ1 (x), fˆ2 (x),..., fˆk1 (x), vd , sd )
Fig. 1. The dynamic selection process
1. given an input value x, select similar data from the validation set; 2. select the model(s) from the ensemble according to their performance on the selected similar data; 3. predict fˆi (x) for the given input value, for each selected model i; 4. obtain the ensemble prediction fˆf . This is straightforward if just one model is selected, otherwise, it is necessary to combine results.
Ensemble Learning: A Study on Different Variants
193
F = {fˆi , i = 1, ..., k} represents the ensemble of k predictors, and k1 represents the number of selected predictors from the ensemble. vd and sd are, respectively, the validation data and the selected data. Dynamic selection typically uses three different data sets: the training set to obtain the k base predictors of the ensemble F ; the validation set to assess base predictors’ generalization error; and the test set to assess ensemble predictor’s generalization error. While task (3) is straightforward, the others are not. In this section we describe each one of the remaining three tasks, with particular emphasis on the methods used to obtain similar data. 2.1
Obtaining Similar Data
The standard method for obtaining similar data (task 1) in the context of ensemble learning is the well known k-nearest neighbors algorithm with Euclidean distance [30]. One limitation of this method is that it weighs all input variables equally even if there are input variables with different levels of relevance in the explanation of the target variable. If the data set has many variables with a small impact on the target variable and a small subset of dominant input variables, this problem can significantly affect the results. Didaci & Giacinto propose the use of DANN - Discriminant Adaptive Nearest Neighbor [12], which locally reshapes the nearest neighborhood according to the target values. In practice, some of the explanatory variables are discarded, reducing the dimensionality of the problem. Experiments done by Didaci & Giacinto show that this approach as well as a dynamic choice of the number k of neighbors can improve meaningfully the results when compared with the standard Euclidean distance [7]. In the context of random forests [4] there are also solutions that measure similarity according to the target values [28]. The k-nearest examples are the k examples with the highest proportion of trees from the forest where the example appears in the same leaf node as the test example. All these approaches use the predictions of the base models for each example in the training set, as it is fully described in [19]. 2.2
Selecting the Models
The simplest method of selecting models (task 2) is to pick the one with the best performance according to a given metric [30,11]. However, the dynamic selection approach can use more than one model [19]. The dynamic voting with selection [29] and its regression version, called dynamic weighting with selection [24], use half of the models from the ensemble, selecting the ones which are locally (i.e., on the selected examples) more accurate. 2.3
Combining the Predictions of Base Models
When more than one model is selected, their results are combined (task 4). The dynamic voting (weighting) methods [22,29,24] obtain the final prediction by a
194
J. Mendes-Moreira et al. Table 1. Data sets Data set Diabetes
Nr of examples Nr of input vars 43
2
Pyrimidines
74
27
Breast cancer
194
32
Machine CPU
209
6
Boston housing
506
13
weighted average of the base models’ predictions, where the weights are proportional to the accuracy on similar data. The dynamic voting with selection gives better results than simple dynamic voting, for both regression and classification problems. Many other methods can be used to combine results [20].
3
Our Study
In this section we study different approaches on ensemble selection in order to identify major advantages/ disadvantages of each approach. We only consider the regression problem. In all the experiments, we use the same set of base predictors, the same regression data sets obtained from the repository of Torgo [26] (Table 1), and the same experimental setup. All the variables in the data sets are numerical. All the experiments were carried out using the R-project [25] statistical package. The ensemble F (see Figure 1) has four models: – – – –
the linear model (function lm); CART (function rpart), using the default parameters; 1-nearest neighbor (function nn); 20-nearest neighbor (function nn).
For the construction of the ensemble, the only concern was to select models that are expected to make diverse errors. This motivated the use of 1 and 20 nearest neighbors. While with 1 nearest neighbor the prediction is unstable (i.e., small changes in the data may cause significant changes to the predictions), with 20 neighbors it is much smoother [32]. In our experimental setup we use two cross-validation cycles, illustrated in Figure 2. In the inner cycle, which is part of the training phase, we estimate the performance of the base-level algorithms and generate the ensemble. In the outer cycle, which is part of the testing phase, we estimate the performance of the ensemble. Firstly, the data is randomly split in 10 parts of approximately the same size. Then, a 10-fold cross-validation cycle (i.e., the outer cycle) is carried out, using nine subsets (training set ) to build the ensemble and the other subset (test set ) to estimate the performance of the ensemble. The generation of the ensemble consists of a 9-fold cross-validation cycle (i.e., the inner cycle) on the
Ensemble Learning: A Study on Different Variants
1
2
3
4
Folds 5 6
7
8
9
195
10 1st run 2nd run 3rd run 4th run 5th run 6th run 7th run 8th run 9th run 10th run
This process is repeated using each fold (cell) as test set. Training set Validation set
Test set Not used
Fig. 2. Experimental setup
nine subsets that are provided as training set. Out of these nine subsets, eight are used to obtain the base models and the other subset (validation set ) to estimate their performance. The process is repeated by using each of the nine subsets as validation set. Finally, the models that are part of the ensemble are obtained by applying the base algorithms to the whole training set and the ensemble is evaluated on the test set. The outer cycle is repeated by using each of the 10 subsets as test set. The reason for using 10-fold cross-validation is the small number of examples in some of the data sets used, which would make the performance estimates obtained using a simpler method (e.g., simple hold-out) unreliable [8]. The main advantage is to train the base predictors with more examples (it can be critical for small data sets) but it has the disadvantage of increasing the computational cost. The process can be repeated several times (we do it ten times) to obtain different sample values for the evaluation criterion. We evaluate the performance of the models using the variation index, calculated as the ratio between the squared root of the mean squared error (mse) and the average of the corresponding target values, 1 ˆ (f (xi ) − f (xi ))2 , n i=1 √ mse Variation Index = . y n
mse =
(1) (2)
Its main advantage is to express the dispersion metric as a percentage of the average, which simplifies interpretation. Using the described experimental setup, three different experiments were executed each one of them with a different purpose:
196
J. Mendes-Moreira et al.
– The first experiment compares different similarity searching approaches; – The second one tests the effect of the number of similar examples; – And the last experiment compares different approaches to accomplish tasks 2 and 4, i.e., different methods for both model selection (task 2) and model combination (task 4).
3.1
Testing Methods to Obtain Similar Data
The first experiment tests different methods for the selection of examples that are similar to the test case (task 1). We tested four methods. The last three are tested in the dynamic selection framework for the first time. The first three use the k-nearest neighbor (knn) algorithm (ten nearest neighbors were used in all the three methods) but with different distance functions. The four methods we have tested are: – Knn using the Euclidean distance (knn-euc): the Euclidean distance is obtained from the function dist of the R [25] statistical package. – Knn using the kd-tree (knn-kdt): the kd-tree [2] is a generalization of the binary search tree. Just one input variable is tested at each split. It is particularly suited for high-dimensional data sets. Its main advantage is the low computational cost since it reduces the search space with little accuracy loss. We use the function nn from the library knnF inder [13] of the R-project [25]. This is an implementation of the approximate nearest neighbor algorithm [1]. This algorithm differs from the original kd-tree because instead of returning the k nearest neighbors, it returns an approximate number of k, more precisely k1 nearest neighbors where k1 ≤ k. This approach reduces meaningfully the computational cost with a low loss in accuracy. – Knn using the RReliefF algorithm (knn-RRF): the RReliefF algorithm [23] (see details in appendix A) weighs each one of the input variables. These weights are used to obtain a weighted distance attribute. The array of weights W returned by the RReliefF algorithm (one weight for each input variable) is used to calculate the distance between the test example and each one of the examples in the training set using a weighted average. The disadvantage of RReliefF is the high computational cost when compared with the other two distance functions. – CART’s leaf node examples (CART-ln): the fourth method to obtain similar data uses the well known CART algorithm [5]. After obtaining a tree with the CART algorithm on the training set, we apply the tree on the test example. The selected data are all the examples from the leaf node where the test example falls. We have used, as a CART implementation, the rpart function in R [25]. The pseudo-code for this method is, program leaf.node.approach(trainingSet, testExample) begin tree := rpart(trainingSet) leafNode := findleafnode(tree, testExample)
Ensemble Learning: A Study on Different Variants
197
members := getmembers(tree, leafNode) return(members) end. An important limitation of this method is that there is not an upper bound to the number of similar examples; it is possible to define a minimum number of examples in the leaf nodes, but not a maximum. This problem can be solved, but implies additional effort, such as, using a distance measure to select k examples from the leaf node, or by controlling in a different way the pruning process during the construction of the tree. In the first method, knn-euc, the distance between the examples only takes into account the value of the independent variables, ignoring the values of the target. In the remaining three methods, the distance between examples is affected also by the value of the target variable in the training examples. For instance, RReliefF weighs the input variables according to how much they explain the target values for the selected neighbors. For model selection (task 2) and model combination (task 4), we tested the following alternatives: – Selection of the model which obtains the best performance on the selected examples. In this case, only a single model is selected, so there is no need for model combination. The prediction of the ensemble is simply the prediction of the selected model. – Forward selection with replacement (FSwR) [6]. This method iteratively selects the model that, when included in the ensemble, maximizes the reduction in the error on the selected examples. It stops when performance of the ensemble does not improve further on the validation set. It is important to note that, although the predictions of the models are combined using a simple average (task 4), in practice it is not, because the same model can be selected more than once, so the weight of each model depends on the number of times this model is selected [6]. Results on Methods to Obtain Similar Data. Each one of the methods for model selection is tested with each one of the methods to obtain the final prediction. Results obtained by the base predictors (LM, CART, nn-1, nn-20) and the simple average of the base predictions (mean) are also presented for baseline comparison (Table 2). The ratio of the time each method takes to obtain similar data when compared with knn-euc is presented in Table 3. Several conclusions can be taken from the results: – Contrary to previous results [29,28], ours indicate that dynamic selection does not improve results when compared to a simple combination method of base models, such as the mean prediction. However, our results do show that dynamic selection obtains better overall results than any of the individual base-models. – By comparing the dynamic selection methods among themselves, we observe that although there is no absolute winner, in general, knn-RRF obtains the
198
J. Mendes-Moreira et al.
Table 2. The variation index (Eq. 2) using different methods to obtain similar data Base models Dynamic selection LM CART nn-1 nn-20 mean CART-ln knn-euc knn-kdt knn-RRF Diabetes 0.133 0.133 0.177 0.136 0.131 0.140 0.139 0.133 0.133 Best 0.139 0.137 0.133 0.131 FSwR Pyrim 0.261 0.166 0.143 0.162 0.149 0.171 0.170 0.169 0.166 Best 0.164 0.168 0.168 0.161 FSwR R wpbc 0.728 0.824 0.971 0.706 0.717 0.743 0.756 0.756 0.749 Best 0.731 0.731 0.731 0.736 FSwR M-CPU 0.654 0.932 0.587 0.592 0.591 0.745 0.743 0.706 0.725 Best 0.706 0.705 0.697 0.678 FSwR Housing 0.217 0.215 0.318 0.316 0.206 0.300 0.292 0.289 0.284 Best 0.294 0.280 0.279 0.274 FSwR
Table 3. Relative time compared to the knn with the Euclidean distance
Diabetes Pyrim R wpbc M-CPU Housing
CART-ln knn-euc knn-kdt knn-RRF 1.4 1.0 1.6 660.3 1.4 1.0 1.7 54072.6 1.5 1.0 1.9 27072.2 2.5 1.0 2.1 1297.7 0.6 1.0 1.3 1106.6
best results and knn-CART obtains the worst ones. We also observe that knn-kdt and knn-RRF obtain comparable or better results than knn-euc, although with some extra computational cost, particularly in the latter case. – By comparing the combination methods, these results indicate that the use of forward selection with replacement is systematically better than selecting the best model, although the differences are not very large; – Additionally, the simple combination method (mean) is the best overall method, although in some cases another algorithm may obtain slightly better results. It gives the best results when all the base predictors are reasonably accurate. This observation provides partial support to the ambiguity decomposition where it is stated that the base predictors must be accurate and with uncorrelated errors [15]. 3.2
The Effect of the Number of Similar Examples
The aim of the following experiment is to study of the effect of the neighborhood size on the final accuracy (Tables 4 and 51 ). We used just one method to select the similar data for the dynamic selection approach (task 1), the k-nearest neighbor with the Euclidean distance. This experiment uses as neighborhood size k ∈ 1
Tables 2 (left part), 5 and 7 contain results of the same set of methods on the same data sets, the values are slightly different because each experiment uses a different seed for pseudo-random number generation.
Ensemble Learning: A Study on Different Variants
199
Table 4. The variation index (Eq. 2) using different sizes of the similar data
2-nn Diabetes 0.147 0.145 Pyrim 0.174 0.168 R wpbc 0.800 0.787 M-CPU 0.735 0.718 Housing 0.285 0.279
4-nn 0.145 0.141 0.168 0.165 0.780 0.764 0.712 0.691 0.284 0.277
6-nn 0.140 0.139 0.163 0.160 0.763 0.752 0.734 0.722 0.289 0.280
8-nn 0.139 0.137 0.179 0.162 0.757 0.742 0.788 0.755 0.290 0.282
Dynamic Selection 10-nn 12-nn 14-nn 16-nn 0.136 0.136 0.133 0.133 0.134 0.134 0.133 0.135 0.161 0.161 0.161 0.160 0.161 0.160 0.158 0.158 0.748 0.745 0.740 0.725 0.734 0.732 0.723 0.720 0.794 0.781 0.797 0.814 0.753 0.732 0.746 0.752 0.292 0.293 0.295 0.296 0.282 0.282 0.282 0.286
18-nn 0.135 0.135 0.161 0.160 0.719 0.717 0.817 0.753 0.295 0.287
20-nn 25-nn 0.136 0.135 0.136 0.134 0.162 0.163 0.161 0.161 0.719 0.716 0.712 0.712 0.829 0.837 0.737 0.756 0.296 0.303 0.288 0.289
30-nn 0.135 0.135 0.163 0.162 0.714 0.710 0.862 0.770 0.305 0.293
Best FSwR Best FSwR Best FSwR Best FSwR Best FSwR
Table 5. The variation index (Eq. 2) using different sizes of the similar data (continuation)
LM Diabetes 0.132 Pyrim 0.249 R wpbc 0.727 M-CPU 0.672 Housing 0.217
Base models CART nn-1 nn-20 0.133 0.180 0.135 0.166 0.146 0.163 0.813 0.970 0.706 0.954 0.589 0.596 0.213 0.320 0.314
mean 0.131 0.149 0.714 0.616 0.206
{2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 25, 30}. The same two methods for the local choice of the predictors (task 2) are also used in this experiment as well as the apparent simple average for task 4 when more than one model is selected in task 2. Like in the first experiment, the average without dynamic selection is also presented for baseline comparison, as well as the results using the base models. The best neighborhood size, k, as well as the sensitivity to its value, are problem dependent, as expected. For each data set, the best value of k is the same for both tested methods, even if without statistical validation. 3.3
Testing Different Methods for Tasks 2 and 4
The goal of this section is to test different methods for: the selection of the models (task 2); and the combination of the predictions, when more than one model is selected (task 4). Here we consider a more extensive set of methods than the ones tested in Sect. 3.1. The methods selected previously obtained good results for both regression [24] and classification [29,28]. For task 1 we use, for each data set, the two most promising combinations according to the previous experiments, because our goal is to identify the methods that obtain the best possible performance for each data set. From the first set of experiments we select a data selection method for each data set and for each one of the two methods of choosing the models (Best and FSwR). From
200
J. Mendes-Moreira et al.
the second set of experiments we select the best value of k. When the CART-ln approach is used for data selection, the k value is determined automatically, as explained earlier. The methods we describe next accomplish both tasks 2 and 4: – Best and FSwR, already described in Sect. 3.1; – Dynamic weighting (DW), already mentioned in Sect. 2: it uses the vector w of weights (appendix B presents how w is calculated), and the matrix sqe with the squared error of the models for each similar data point, in order to obtain the weight for each model [22,24] according to: √ I 1 (w ∗sqei,k ) i=1 i wmk = K √ I 1 k1=1
,
(3)
i=1 (wi ∗sqei,k1 )
where i is the example’s index, k and k1 are the model’s indexes, and K is the ensemble size; – Dynamic weighting with selection (DWS) selects a subset of the models before executing DW. The selected models are the ones with a mean squared error not higher than a percentage threshold related to the best model. A similar approach is already known [29,24] but for a fixed threshold. We test 5 different values for the threshold: 10%, 30%, 50%, 70% and 90%. Results are presented in Tables 6 and 7. The best methods for each data set are in bold face. If more than one method is in bold face it means that the paired t-test does not reject the null hypothesis, i.e., the difference is not significantly different for α = 0.05. The cells with numbers formatted like 0.000 instead of 0.000, are the ones where both the KS-Lilliefors normality test [17] and the skewness test for small samples [31] were rejected also for α = 0.05. Consequently, for these cells, the paired t-test was not done. More runs of the experimental setup were needed. Several conclusions can be drawn: – DW and DWS are clearly better than the two methods used in earlier experiments for tasks 2 and 4. The results obtained by dynamic selection with the former two methods are also competitive with the simple combination method of base models (mean). – DW in combination with knn-RRF may be the best choice because it obtains similar or better results than DWS and has no parameters; – These results show that it is possible to improve accuracy using more than one model when compared to the use of just the best model on the similar data set; – It seems that there is a tendency for the knn-RRF approach to obtain better results in problems with more input variables. Although this is not surprising, it should be explored for problems with higher dimension.
Ensemble Learning: A Study on Different Variants
201
Table 6. The variation index (Eq. 2) using different methods for tasks 2 and 4
Best FSwR Diabetes 0.135 0.136 Pyrim 0.166 0.164 R wpbc 0.734 0.716 M-CPU 0.706 0.782 Housing 0.277 0.272
0.136 0.136 0.166 0.161 0.732 0.710 0.694 0.771 0.271 0.265
Dynamic selection DWS 10% 30% 50% 70% 90% 0.132 0.131 0.131 0.131 0.131 0.137 0.136 0.134 0.133 0.134 0.164 0.155 0.151 0.149 0.149 0.158 0.152 0.148 0.146 0.146 0.736 0.708 0.706 0.708 0.709 0.709 0.707 0.709 0.711 0.711 0.606 0.567 0.559 0.556 0.553 0.720 0.672 0.660 0.661 0.669 0.258 0.232 0.220 0.215 0.212 0.252 0.230 0.221 0.217 0.213
DW 0.131 0.133 0.149 0.146 0.711 0.711 0.539 0.630 0.207 0.208
knn-kdt-14 knn-RRF-14 CART-ln knn-RRF-16 CART-ln knn-kdt-30 knn-kdt-4 knn-RRF-4 knn-kdt-4 knn-RRF-4
Table 7. The variation index (Eq. 2) using different methods for tasks 2 and 4 (continuation)
LM Diabetes 0.132 Pyrim 0.256 R wpbc 0.728 M-CPU 0.674 Housing 0.216
4
Base models CART nn-1 nn-20 0.135 0.179 0.135 0.166 0.145 0.162 0.809 0.983 0.706 0.980 0.588 0.595 0.213 0.314 0.314
mean 0.132 0.149 0.717 0.622 0.205
Conclusions and Future Research Directions
This paper describes a study on different variants of the dynamic selection approach to ensemble learning. In particular we focus on the following tasks: example selection (task 1 in Section 2) and model selection and combination (tasks 2 and 4). For task 1, we have introduced three different methods (CART-ln, knn-kdt and knn-RRF) in order to test whether selection methods that take into account the value of the target in the training examples yields better results in comparison to methods which are only based on the independent variables (knn with Euclidean distance). Knn-RRF [23] was the most promising one. The experiments also confirm in all the five data sets we have tested that the dynamic combination of several models is better than selecting a single one. Despite this result has been claimed previously in [14] for the classification problem, this is the first experience using regression data sets. It seems that the dichotomy “selection vs. fusion”, commonly discussed in the ensemble learning literature [16], does not make sense because it is not proved that it is better to dynamically choose just one model instead of several models (Sect. 3.1). The
202
J. Mendes-Moreira et al.
reduction in the variance achieved by averaging several results can explain why the selection of several models is better than the selection of just one [28]. Although this is not the focus of this paper, some of our results can be used to compare constant and non-constant weighting functions. For which type of problems are non-constant weighting functions more adapted than constant weighting functions? This dichotomy was often used in the past [20], but has fallen in disuse, unfortunately. Maybe this was due to the possibility of better defining the ensemble members when using constant weighting functions. It can be said that instead of choosing the best method to combine the predictions, research was driven into the generation of the best ensemble (knowing, a priori, the integration function to use: typically the simple average). Some of the most successful approaches for ensemble generation, namely bagging [3], random forests [4] and negative correlation learning [18], use the simple average for integration. Our results provide further evidence in the sense that simple weighting provides competitive results. However, a more definitive answer to the question above needs further research. The use of non-constant weighting functions in concept drifting problems [28], i.e., problems changing with time, is a possible direction of research. In the experiments described in this paper there was no particular concern with ensemble generation. The characteristics that the ensemble should have are particularly well known for regression problems when using a constant weighting function as integration method [15]. Ensembles using non constant weighting functions are much less studied. However, given the nature of these methods, an analytical decomposition of the generalization error may not be possible (because the integration function is not known since it is non-constant) but, at least empirically, it is possible to obtain some insights. A possible approach to explore in the future is meta-learning [21].
Acknowledgments This work was partially supported by Programa de Financiamento Plurianual de Unidades de I&D.
References 1. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching. Journal of the ACM 45(6), 891–923 (1998) 2. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517 (1975) 3. Breiman, L.: Bagging predictors. Machine Learning 26, 123–140 (1996) 4. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) 5. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Chapman and Hall/CRC, Boca Raton (1984) 6. Caruana, R., Niculescu-Mozil, A., Crew, G., Ksikes, A.: Ensemble selection from libraries of models. In: International Conference on Machine Learning (2004)
Ensemble Learning: A Study on Different Variants
203
7. Didaci, L., Giacinto, G.: Dynamic classifier selection by adaptive k-nearest neighbourhood rule. In: Roli, F., Kittler, J., Windeatt, T. (eds.) MCS 2004. LNCS, vol. 3077, pp. 174–183. Springer, Heidelberg (2004) 8. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10, 1895–1923 (1998) 9. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: International Conference on Machine Learning, pp. 148–156 (1996) 10. Garc´ıa-Pedrajas, N., Herv´ as-Mart´ınez, C., Ortiz-Boyer, D.: Cooperative coevolution of artificial neural network ensembles for pattern classification. IEEE Transactions on Evolutionary Computation 9(3), 271–302 (2005) 11. Giacinto, G., Roli, F.: Adaptive selection of image classifiers. In: Del Bimbo, A. (ed.) ICIAP 1997. LNCS, vol. 1310, pp. 38–45. Springer, Heidelberg (1997) 12. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6), 607–616 (1996) 13. Kemp, S.E.: knnfinder: Fast near neighbour search. R package version 1.0 14. Ko, A.H.-R., Sabourin, R., Britto Jr., A.d.S.: From dynamic classifier selection to dynamic ensemble selection. Pattern Recognition 41, 1718–1731 (2008) 15. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. In: Advances in Neural Information Processing Systems, vol. 7, pp. 231– 238 (1995) 16. Kuncheva, L.I.: Switching between selection and fusion in combining classifiers: an experiment. IEEE Transactions on Systems, Man, and Cybernetics-Part B 32(2), 146–156 (2002) 17. Lilliefors, H.W.: On the kolmogorov-smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association 62(318), 399– 402 (1967) 18. Liu, Y., Yao, X., Higuchi, T.: Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation 4(4), 380–387 (2000) 19. Merz, C.J.: Dynamical selection of learning algorithms. In: Fisher, D., Lenz, H.J. (eds.) International Workshop on Artificial Intelligence and Statistics. Learning from Data: Artificial Intelligence and Statistics, vol. V. Springer, Heidelberg (1996) 20. Merz, C.J.: Classification and regression by combining models. Phd thesis, University of California, USA (1998) 21. Prudˆencio, R.B.C., Ludermir, T.B.: Meta-learning approaches to selecting time series models. Neurocomputing 61, 121–137 (2004) 22. Puuronen, S., Terziyan, V., Tsymbal, A.: A dynamic integration algorithm for an ensemble of classifiers. In: Ra´s, Z.W., Skowron, A. (eds.) ISMIS 1999. LNCS, vol. 1609, pp. 592–600. Springer, Heidelberg (1999) ˘ 23. Robnik-Sikonja, M., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 53(1-2), 23–69 (2003) 24. Rooney, N., Patterson, D., Anand, S., Tsymbal, A.: Dynamic integration of regression models. In: Roli, F., Kittler, J., Windeatt, T. (eds.) MCS 2004. LNCS, vol. 3077, pp. 164–173. Springer, Heidelberg (2004) 25. R. D. C. Team. R: A language and environment for statistical computing. Technical report, R Foundation for Statistical Computing (2006) ISBN 3-900051-07-0 26. Torgo, L.: Regression data repository 27. Tsymbal, A., Pechenizkiy, M., Cunningham, P.: Dynamic integration with random forests. Tech. Report TCD-CS-2006-23, The University of Dublin, Trinity College (2006)
204
J. Mendes-Moreira et al.
28. Tsymbal, A., Pechenizkiy, M., Cunningham, P.: Dynamic integration with random forests. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 801–808. Springer, Heidelberg (2006) 29. Tsymbal, A., Puuronen, S.: Bagging and boosting with dynamic integration of ˙ classifiers. In: Zighed, D.A., Komorowski, J., Zytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 116–125. Springer, Heidelberg (2000) 30. Woods, K.: Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 405–410 (1997) 31. Xycoon. Statistics - econometrics - forecasting 32. Yankov, D., DeCoste, D., Keogh, E.: Ensembles of nearest neighbor forecasts. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 545–556. Springer, Heidelberg (2006)
A
The RReliefF Algorithm
Our implementation of RReliefF follows [23], Program RReliefF(R, τ, m, k) {where each element of R (a vector of size M) is the input vector of each example from the training set; τ is a vector with the target values corresponding to R; m is the number of iterations; and k is the number of nearest examples} begin set NdC , NdA [A], NdC&dA [A], W [A] to 0, where A = 1, 2, · · · , a for (i:=1 to m) randomly select example Ri from R select k examples I nearest to Ri for (j:=1 to k) NdC := NdC + dif f (τ (.), Ri , I[j]) × d(i, j) for(A:=1 to a) NdA [A] := NdA [A] + dif f (A, Ri , I[j]) × d(i, j) NdC&dA [A] := NdC&dA[A] + dif f (τ (.), Ri , I[j])× dif f (A, Ri , I[j]) × d(i, j) end for end for end for for (A=1 to a) W [A] := NdC&dA[A]/NdC − (NdA [A] − NdC&dA[A]) × (m − NdC ) end for return(W) end. The variable a in the algorithm is the number of input variables of the data set. The parameter m is the number of iterations. Its choice is very important. The right value is problem dependent. The larger the m value is, more stable
Ensemble Learning: A Study on Different Variants
205
the weights estimations are. Additionally the computational cost also increases. The authors refer that the stability of the results is obtained, typically, for values between 20 and 50 iterations. We use m = 50. The value used for k (the number of nearest examples) was 10, as suggested by the authors. From the several distance functions proposed by the authors, we use one that quadratically increases the cost of the distance:
d(i, j) = k
d1(i, j)
l=1 (d1(i, l))
,
1 d1(i, j) = a , and ( l=1 dif f (Al , R[i], I[j]))2 ⎧ 0 : d ≤ teq ⎨ 1 : d > tdif f , dif f (A, I1 , I2 ) = ⎩ d−teq : teq ≤ tdif f tdif f −teq
(4) (5)
(6)
where the values of teq and tdif f are, respectively, 5% and 10% of the length of the input variable’s value interval, as suggested by the authors; and the value d represents the absolute difference of the input variable A for the two examples, I1 and I2 .
B
The Calculation of the Vector w of Weights for DW and DWS
The estimation of the error for each predictor weighs the error of each similar data point by the inverse of its distance to the input value, i.e., the weight used for each example is: wi = I
1 disti
i1=1
1 disti1
,
(7)
where dist is the vector of distances of the similar data examples to the input example and I is the number of examples, i.e., the size of the similar data set.
Relevance and Redundancy Analysis for Ensemble Classifiers Rakkrit Duangsoithong and Terry Windeatt Center for Vision, Speech and Signal Processing University of Surrey Guildford, United Kingdom GU2 7XH {r.duangsoithong,t.windeatt}@surrey.ac.uk
Abstract. In machine learning systems, especially in medical applications, clinical datasets usually contain high dimensional feature spaces with relatively few samples that lead to poor classifier performance. To overcome this problem, feature selection and ensemble classification are applied in order to improve accuracy and stability. This research presents an analysis of the effect of removing irrelevant and redundant features with ensemble classifiers using five datasets and compared with floating search method. Eliminating redundant features provides better accuracy and computational time than removing irrelevant features of the ensemble. Keywords: Feature selection, Ensemble classification, Redundant feature, Irrelevant feature.
1
Introduction
With improvements in information and healthcare technology, many medical information databases have been created. However, clinical data usually contains high dimensional features with small sample size and leads to degradation in accuracy and efficiency of system by curse of dimensionality. As Bellman stated in 1961 [1], high dimensional datasets lead to the degradation of classifier system performance because the more features, the more complexity, harder to train classifier and longer computational time. Over-fitting is another problem in high dimension features spaces. It usually occurs when the number of features is high compared to the number of instances. The resulting classifier works very well with training data but very poorly on testing data. In order to deal with high dimensional features, dimensionality should be reduced. Generally, there are two methods to reduce the dimension: feature extraction and feature selection. Feature extraction transforms or projects original features to fewer dimensions without using prior knowledge. Nevertheless, it lacks comprehensibility and uses all original features which may be impractical in large feature spaces. On the other hand, feature selection aims to select optimal feature subsets from original features by removing irrelevant and redundant P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 206–220, 2009. c Springer-Verlag Berlin Heidelberg 2009
Relevance and Redundancy Analysis for Ensemble Classifiers
207
features. It has the ability to reduce over-fitting, increase classification accuracy, reduce complexity, computational time and improve comprehensibility by preserving original semantic of datasets. Basically, feature selection can be divided into four categories: Wrapper, Filter, Hybrid and Embedded methods [2],[3],[4]. Wrapper method uses predetermined learning algorithm to evaluate selected feature subsets that are optimum for the learning process. This method has high accuracy but is computationally expensive. Filter method is faster than Wrapper method and is independent from learning method used in the classification process. It uses measurement techniques such as correlation, distance and consistency measurement to find a good subset from entire features. Nevertheless, the selected subset may or may not be appropriate with the learning method. Hybrid method combines advantage of both Filter and Wrapper method together. It evaluates features by using an independent measure to find the best subset and then using a learning algorithm to find the final best subset. Finally, Embedded method interacts with learning algorithm but it is more efficient than Wrapper method because the filter algorithm has been built with the classifier. Example of Embedded method is Recursive Feature Elimination (RFE) that is embedded with Support Vectors Machines. As has been illustrated by Liu and Yu(2005) [3], feature selection has four basic processes: Subset generation, subset evaluation, stopping criterion and subset validation. Subset generation produces candidate subset by complete (exhaustive), sequential (heuristic) or random search with three directions: forward (adding feature to selected subset that begin with empty set), backward (eleminate features from selected subset that begins with full original set) and bidirectional (both adding and removing features). After that, the candidate subset is evaluated based on criteria such as distance, dependency and information gain and consistency measurement. The process will stop when it reaches the stopping criterion. Finally, the selected subset is validated with validation data. There are many applications that applied feature selection as an important pre-processing step to improve systems efficiency, such as web text mining and e-mail classification, intrusion detection, biomedical informatics, gene selection in micro array data, medical data mining, and clinical decision support systems. An ensemble classifier or multiple classifier system (MCS) is another wellknown technique to improve system accuracy [5]. It is a group of base classifiers that learn a target function by combining their prediction together. Ensemble has ability to increase accuracy of system by combining output of multiple experts, improve efficiency by decomposing complex problem into multiple sub problems and improve reliability by reducing uncertainty. To increase accuracy, each classifier in the ensemble should be diverse or unique in order to reduce total error such as starting with different input, initial weight, random features and random classes [6]. There are many approaches to ensemble classifiers such as averaging, Bagging [7] and Boosting [8].
208
R. Duangsoithong and T. Windeatt
Fig. 1. Block diagram of Feature Selection with Ensemble classifier
Figure 1 presents the system block diagram of this research. The original dataset is sent to feature selection in order to remove redundant or irrelevant features. After that, selected features are passed through ensemble classifier using ensemble algorithm for training and predicting output. 1.1
Related Research
Feature selection and ensemble classification have interested researchers for many years. At the beginning of feature selection history, most researchers focused only on removing irrelevant features such as ReliefF [9], FOCUS [10] and Correlationbased Feature Selection(CFS) [11] In Yu and Liu (2004) [12], Fast Correlation-Based Filter (FCBF) algorithm was proposed to remove both irrelevant and redundant features by using Symmetrical Uncertainty (SU) measurement and was successful for reducing high dimensional features while maintaining high accuracy. Another technique of removing both irrelevant and redundant features using discriminant and redundancy analysis was proposed by Malarvili et al. (2007) [13] for Neonatal Seizure Detection. The technique is based on area under Receiver Operating Characteristic (ROC) curve for discriminant analysis and modified FCBF algorithm for redundancy analysis by meaning of predominant features based on discriminant power. According to Deisy et al. (2007) [14], SU does not have enough accuracy to quantify the dependency among features and does not take into account the effect of two features on the class label during redundancy analysis. Decision Independent Correlation (DIC) and Decision Dependent Correlation (DDC) were proposed instead of using SU to remove irrelevant and redundant features, respectively. DIC and DDC provide better performance than FCBF algorithm in terms of number of selected features, computational time and accuracy.
Relevance and Redundancy Analysis for Ensemble Classifiers
209
In Chou et al. (2007) [15], modified FCBF algorithm was used to eliminate both redundant and irrelevant features for intrusion detection. In redundancy analysis, they proposed to calculate SU between features and all original features. They found that FCBF algorithm possibly keeps redundant features in the final optimal subset because it considers only SU between selected features and the rest of features at a time. Biesiada and Duch (2008) [16] used SU to remove irrelevant features and proposed to use Pearson X 2 test to eliminate redundant features for biomedical data analysis. Moreover, Kolmogorov-Smirnov was also proposed [17] to reduce both redundant and irrelevant features. Although feature selection is widely used, there has been little work devoted to handling feature selection explicitly in the context of ensemble classifiers. Most previous approaches have focused on determining selecting optimal features, but rarely to combine with ensemble classification. Moreover, it is usual to remove both irrelevant and redundant features in contrast to the approach proposed in this paper. Moreover, generally, the number of features in feature selection analysis can be divided into three categories: small scale (the number of features is less than 19), medium scale (the number of features is between 20 and 49) and large scale (the number of features is equal or higher than 50 features) [18],[19]. The main purpose of this research is to find methods that can scale up to deal with hundreds or thousands of features.
2
Theoretical Approach
In our research as shown in Figure 1, Fast Correlation-Based Filter (FCBF) [12], an effective feature selection algorithm that removes both irrelevant and redundant features, is compared with Correlation-based Feature Selection with Sequential Forward Floating Search (CFS+SFFS) [11],[20]. Bagging [7] algorithm for ensemble classifiers, described in Section 2.2, is experimentally compared with different learning algorithms. 2.1
Feature Selection
FCBF algorithm has two stages, relevance analysis and redundancy analysis. Relevance Analysis. Irrelevant features are removed from original features and Correlation is widely used to analyze relevance. In linear systems, correlation can be measured by linear correlation coefficient (r) (xi − xi )(yi − yi ) r = i (1) 2 2 (x i i − xi ) i (yi − yi ) However, in real world applications, most systems are non-linear. We can measure correlation in non-linear systems by using Symmetrical Uncertainty (SU). SU = 2
IG(X|Y ) H(X)H(Y )
(2)
210
R. Duangsoithong and T. Windeatt
IG(X, Y ) = H(X) − H(X|Y ) H(X) = − P (xi )log2 P (xi )
(3) (4)
i
where IG(X|Y ) is the Information Gain of X after observing variable Y . H(X) and H(Y ) are the entropy of variable X and Y , respectively. P (xi ) is the probability of variable x. SU is the modified version of Information Gain that has range between 0 and 1. FCBF removes irrelevant features by ranking correlation (SU) between feature and class. If SU between feature and class equal to 1, it means that this feature is completely related to that class. On the other hand, if SU equal to 0, the features are irrelevant to this class. Redundancy Analysis. After ranking relevant features, FCBF eliminates redundant features from selected features based on SU between feature and class and between feature and feature. Redundant features can be defined from meaning of predominant feature and approximate Markov Blanket. In Yu and Liu (2004) [12], a feature is predominant (both relevant and non redundant feature) if it does not have any approximate Markov blanket in the current set. Approximate Markov blanket: For two relevant features Fi and Fj (i = j), Fj forms an approximate Markov blanket for Fi if SUj,c ≥ SUi,c and SUi,j ≥ SUi,c
(5)
where SUi,c is a correlation between any feature and the Class. SUi,j is a correlation between any pair of feature Fi and Fj (i = j) Correlation-Based Feature Selection (CFS). CFS [11] is one of well-known techniques to rank the relevance of features by measuring correlation between features and classes and between features and other features. Given number of features k and classes C, CFS defined relevance of features subset by using Pearson’s correlation equation M erits =
krkc k + (k − 1)rkk
(6)
Where M erits is relevance of feature subset, rkc is the average linear correlation coefficient between these features and classes and rkk is the average linear correlation coefficient between different features. Normally, CFS adds (forward selection) or deletes (backward selection) one feature at a time, however, in this research, we used Sequential Forward Floating Search (SFFS) as the search direction. Sequential Forward Floating Search (SFFS). SFFS [20] is one of a classic heuristic searching method. It is a variation of bidirectional search and sequential forward search (SFS) that has dominant direction on forward search. SFFS
Relevance and Redundancy Analysis for Ensemble Classifiers
211
removes features (backward elimination) after adding features (forward selection). The number of forward and backward step is not fixed but dynamically controlled depending on the criterion of the selected subset and therefore, no parameter setting is required. 2.2
Ensemble Classifier
Bagging. Bagging [7] or Bootstrap aggregating is one of the earliest, simplest and most popular for ensemble based classifiers. Bagging uses Bootstrap that randomly samples with replacement and combines with majority vote. Bootstrap is the most well-known strategy for injecting randomness to improve generalization performance in multiple classifier systems and provides out-of-bootstrap estimate for selecting classifier parameters [5]. Randomness is desirable since it increases diversity among the base classifiers, which is known to be a necessary condition for improved performance. However, there is an inevitable trade off between accuracy and diversity known as the accuracy/diversity dilemma [5].
3 3.1
Experimental Setup Dataset
The medical datasets used in this experiment were taken from UCI machine learning repository [21]: heart disease, hepatitis, diabetes, lung cancer and Parkinson dataset. Heart Disease Dataset was contributed by Cleveland Clinic foundation has 303 samples, 13 attributes with 138 samples presenting for heart disease class and 165 samples for absent class. Hepatitis Dataset was donated by G.Gong from Carnegie-Mellon University contains 155 instances, 19 attributes with live or die classes. Diabetes Dataset. Prima Indians Diabetes dataset was donated by John Hopkins University has 768 samples, 8 numeric features with tested positive and tested negative classes. Lung Cancer Dataset. Pathological lung cancer has 32 samples, 56 nominal features with three classes. Parkinson Dataset. Parkinson’s disease dataset is the speech signals recorded by Max Little from University of Oxford collaborated with the National Centre for Voice and Speech, Denver, Colorado. It has 197 samples, 23 features with two classes (healthy and Parkinson’s patient).
212
R. Duangsoithong and T. Windeatt
Fig. 2. FCBF algorithm
Fig. 3. Proposed FCBF algorithm
3.2
Proposed Algorithm
Our research separated FCBF results independently into relevance analysis (removing irrelevant features) and redundancy analysis (eliminating redundant features) as shown in Figure 2 and 3 and compared with original dataset and CFS+SFFS algorithm. 3.3
Evaluation
To evaluate feature selection process we use four widely used classifiers: NaiveBayes(NB), Multilayer Perceptron (MLP), Support Vector Machines (SVM) and Decision Trees (DT). The parameters of each classifier were chosen based on the highest accuracy of the experiment datasets of base classifier. MLP has one hidden layer with 16 hidden nodes, learning rate 0.2, momentum 0.3, 125 iterations and uses backpropagation algorithm with sigmoid transfer function. SVMs uses linear kernel and set the regularization value to 0.7 and Decision Trees use pruned C4.5 algorithm. The number of classifiers in Bagging is varied from 10, 20, 50, 100 to 250 classifiers. The threshold value of FCBF algorithm in our research is set at zero for heart disease, diabetes and Parkinson and 1.4 and 2.8 for hepatitis and lung cancer, respectively. The classifier results were validated by 10 fold cross validation with 10 repetitions for each experiment and evaluated by percent of test set accuracy and training CPU computational time.
Relevance and Redundancy Analysis for Ensemble Classifiers
4
213
Experimental Result
Table 1 shows the number of selected features in each analysis. For example, in heart disease dataset, redundancy analysis removed six redundant features, relevance analysis removed two irrelevant features and CFS+SFFS eliminates six weakly relevant [22] features . The relationship between number of selected features and threshold setting of FCBF algorithm with relevance and redundancy analysis of the five datasets is shown in figure 4. Figure 5 presents the average accuracy of the five datasets using four base classifiers with threshold setting of FCBF algorithm and figure 6 is the average accuracy of four base classifiers and threshold setting. According to these figures, zero threshold setting provides the best accuracy for heart disease, diabetes and Parkinson datasets. For hepatitis and lung cancer dataset, FCBF algorithm provides the best accuracy when the threshold is set at 0.14 and 0.28, respectively. Table 1. Number of selected features Dataset Heart Disease Hepatitis Diabetes Lung Cancer Parkinson
Original Feature Redundancy Analysis 13 7 19 3 8 4 56 2 23 5
Relevance Analy- CFS+SFFS sis 11 7 17 10 8 4 56 11 23 10
Selected Features 10 Heart Disease Hepatitis Diabetes Lung Cancer Parkinson
Selected Features
8
6
4
2
0 0
0.05
0.1
0.15 0.2 Threshold
0.25
Fig. 4. Selected Feature vs. Threshold
0.3
0.35
214
R. Duangsoithong and T. Windeatt MultipleLayer Perceptron
Percent Accuracy
Percent Accuracy
Naive Bayes 90 80 70 60 50 40 30 20 10 0 0
0.05
0.1
0.15
0.2
0.25
0.3
90 80 70 60 50 40 30 20 10 0
0.35
Heart Disease Hepatits Diabetes Lung cancer Parkinson 0
0.05
0
0.05
0.1 0.15 0.2 0.25 FCBF Threshold
0.15
0.2
0.25
0.3
0.35
0.1 0.15 0.2 0.25 FCBF Threshold
0.3
0.35
Decision Tree
90 80 70 60 50 40 30 20 10 0
Percent Accuracy
Percent Accuracy
Support Vector Machines
0.1
0.3
0.35
90 80 70 60 50 40 30 20 10 0 0
0.05
Fig. 5. Average Accuracy of four classifiers vs. Threshold Average Accuracy vs FCBF Threshold for five medical datasets 85 80 75
Average Accuracy
70 65 60 55 50 45 Heart Disease Hepatitis Diabetes Lung Cancer Parkinson
40 35 0
0.05
0.1
0.15 0.2 Threshold
0.25
0.3
0.35
Fig. 6. Average Accuracy vs. Threshold
The two examples of selected features of Heart disease and Hepatitis dataset with FCBF and CFS+SFFS compared to Polat and Gunes research [23] are shown in Figure 7-8, respectively. In [23], the features were selected by using
Relevance and Redundancy Analysis for Ensemble Classifiers
215
Fig. 7. Selected Features for Heart disease dataset
expert knowledge system (decision tree) without ranking them. In FCBF, the features were ranked and eliminated both irrelevant and redundant features while in CFS+SFFS, the features were selected based on merit value. According to figure 7, although the selected features for each analysis are not exactly the
Fig. 8. Selected Features for Hepatitis dataset
216
R. Duangsoithong and T. Windeatt
Naive Bayes
MultipleLayer Perceptron 80
Original Redundancy Relevance CFS+SFFS
78
Percent Accuracy
Percent Accuracy
80
76 74 72 70
78 76 74 Original Redundancy Relevance CFS+SFFS
72 70
1
10
50
100
250
1
80
78
78
76 74
Original Redundancy Relevance CFS+SFFS
72
50
100
250
100
250
Decision Tree
80 Percent Accuracy
Percent Accuracy
Support Vector Machines
10
70
76 74
Original Redundancy Relevance CFS+SFFS
72 70
1
10
50
Number of Classifiers
100
250
1
10
50
Number of Classifiers
Fig. 9. Average accuracy of five datasets for each classifier
same, there are some common eliminated features for heart disease dataset of three algorithms are fbs (fast blood sugar), trestbps (blood pressure) and chol (cholesterol). In hepatitis dataset, the common removed features of three algorithms are antirivals, fatigue, liver firm and spleen palpable. Figure 9 presents the average test accuracy of the five datasets using four base classifiers comparing with ensembles of size 10, 50, 100 and 250 (in logarithmic scale). Solid line presents original data set, short-dashed line is the result of removing irrelevant features, eliminating redundant features is shown as dashed line and CFS+SFFS result is presented in dotted line. Figure 10 shows the average computational time of the five datasets. Using all features in original set takes longest time to compute while using feature selection reduces the computational time. Ensemble classifiers take longer time to calculate than the base classifier and depend on number of classifiers. Figures 11-12 show the overall average accuracy and computational time of five datasets and four classifiers, respectively. Moreover, statistical significance (T-Test) using base classifier for one of the example datasets, lung cancer dataset, is shown in figure 13. From the number of significant win of column compare to row, redundancy analysis has more significance than CFS+SFFS and original dataset respectively. (There are no irrelevant features in this dataset.)
Relevance and Redundancy Analysis for Ensemble Classifiers
Naive Bayes
MultipleLayer Perceptron 250
Original Redundancy Relevance CFS+SFFS
0.12 0.1
Computation time (sec)
Computation time (sec)
0.14
0.08 0.06 0.04 0.02 0
Original Redundancy Relevance CFS+SFFS
200 150 100 50 0
1
10
50 100
250
1
Support Vector Machines 10 9 8 7 6 5 4 3 2 1 0
10
50 100
250
50 100
250
Decision Tree 2.5
Original Redundancy Relevance CFS+SFFS
Computation time (sec)
Computation time (sec)
217
Original Redundancy Relevance CFS+SFFS
2 1.5 1 0.5 0
1
10
50
100
250
1
Number of Classifiers
10
Number of Classifiers
Fig. 10. Average computational time of five datasets for each classifier
Average Accuracy for five medical datasets and four classifiers 77
76.5
Percent Accuracy
76
75.5
75
74.5
74
Original Redundancy Relevance CFS+SFFS
73.5
73 1
10 Number of Classifiers
50
100
Fig. 11. Average accuracy of five datasets and four classifiers
250
218
R. Duangsoithong and T. Windeatt
Average Computational Time for five medical datasets and four classifiers 60
Computation time (sec)
50
40
30
20
10 Original Redundancy Relevance CFS+SFFS 0 1
10 Number of Classifiers
50
100
250
Fig. 12. Average computational time of five datasets and four classifiers
Fig. 13. Example statistic significance of lung cancer dataset
5
Discussion
From our experimental result, redundancy analysis provides better accuracy than CFS+SFFS, relevance analysis and original dataset, respectively. Both relevance and redundancy analysis clearly reduce complexity and calculation time compared to full original dataset. As the number of classifiers is increased, the MLP makes beneficial use of irrelevant features. It would appear that the diversity induced by irrelevant features assists the MLP optimization algorithm. Given
Relevance and Redundancy Analysis for Ensemble Classifiers
219
that this effect is classifier-dependant, it is proposed that the out-of-bootstrap error estimate could be used to determine when eliminating irrelevant features is desirable [6].
6
Conclusion
In this paper, we introduce a comparison between redundancy analysis and relevance analysis combined with ensemble classifier for five medical datasets. From experimental result, redundancy analysis is more important than relevance analysis for ensembles. Further work is aimed at increasing the number of datasets and reducing training set size to investigate small sample size effects.
References 1. Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961) 2. Liu, H., Dougherty, E., Dy, J., Torkkola, K., Tuv, E., Peng, H., Ding, C., Long, F., Berens, M., Parsons, L., Zhao, Z., Yu, L., Forman, G.: Evolving feature selection. IEEE Intelligent Systems 20(6), 64–76 (2005) 3. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005) 4. Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 5. Windeatt, T.: Ensemble MLP Classifier Design. LNCS, vol. 137, pp. 133–147. Springer, Heidelberg (2008) 6. Windeatt, T.: Accuracy/diversity and ensemble MLP classifier design. IEEE Transactions on Neural Networks 17(5), 1194–1211 (2006) 7. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 8. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148– 156. Morgan Kaufmann, San Francisco (1996) 9. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 10. Almuallim, H., Dietterich, T.G.: Learning with many irrelevant features. In: Proceedings of the Ninth National Conference on Artificial Intelligence, pp. 547–552. AAAI Press, Menlo Park (1991) 11. Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proceeding of the 17th International Conference on Machine Learning, pp. 359–366. Morgan Kaufmann, San Francisco (2000) 12. Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004) 13. Malarvili, M., Mesbah, M.: Hrv feature selection based on discriminant and redundancy analysis for neonatal seizure detection. In: 6th International Conference on Information, Communications and Signal Processing, p. 15 (2007) 14. Deisy, C., Subbulakshmi, B., Baskar, S., Ramaraj, N.: Efficient dimensionality reduction approaches for feature selection. In: International Conference on Computational Intelligence and Multimedia Applications, vol. 2, pp. 121–127 (2007)
220
R. Duangsoithong and T. Windeatt
15. Chou, T., Yen, K., Luo, J., Pissinou, N., Makki, K.: Correlation-based feature selection for intrusion detection design. In: IEEE on Military Communications Conference, MILCOM 2007, pp. 1–7 (2007) 16. Biesiada, J., Duch, W.: Feature Selection for High- Dimensional Data - A Pearson Redundancy Based Filter, vol. 45, pp. 242–249. Springer, Heidelberg (2008) 17. Biesiada, J., Duch, W.: A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds.) ICONIP 2007, Part II. LNCS, vol. 4985, pp. 285–294. Springer, Heidelberg (2008) 18. Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classifiers. Pattern Recognition 33, 25–41 (2000) 19. Zhang, H., Sun, G.: Feature Selection using Tabu search. Pattern Recognition 35, 701–711 (2002) 20. Pudil, P., Novovicova, J., Kitler, J.: Floating Search Methods in Feature Selection. Pattern Recognition Letters 15, 1119–1125 (1994) 21. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/mlearn/MLRepository.html 22. John, G., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem, pp. 121–129. Morgan Kaufmann, San Francisco (1994) 23. Polat, K., Gunes, S.: A hybrid approach to medical decision support systems: Combining feature selection, fuzzy weighted pre-processing and airs. Computer Methods and Program in Biomedicine 88(2), 164–174 (2007)
Drift-Aware Ensemble Regression Frank Rosenthal, Peter Benjamin Volk, Martin Hahmann, Dirk Habich, and Wolfgang Lehner Technische Universit¨ at Dresden Database Technology Group 01062 Dresden, Germany [email protected]
Abstract. Regression models are often required for controlling production processes by predicting parameter values. However, the implicit assumption of standard regression techniques that the data set used for parameter estimation comes from a stationary joint distribution may not hold in this context because manufacturing processes are subject to physical changes like wear and aging, denoted as process drift. This can cause the estimated model to deviate significantly from the current state of the modeled system. In this paper, we discuss the problem of estimating regression models from drifting processes and we present ensemble regression, an approach that maintains a set of regression models— estimated from different ranges of the data set—according to their predictive performance. We extensively evaluate our approach on synthetic and real-world data. Keywords: Ensemble Method, Regression, Process Drift.
1
Introduction
Regression models are important tools in scientific, industrial, and many other application areas. They are used to learn a functional dependency between a numeric target variable and other variables of a given data set. Knowledge of such dependencies is often required to control industrial production processes. A concrete example in semiconductor manufacturing is etching, a process where material is physically removed with means like acid or plasma to create the layout of integrated circuits. In this setting, the time that is required to etch a specific amount of material is an important process parameter. The time heavily influences the width of the structures that are etched, which defines the electrical properties of the final integrated circuit. Hence, the correct etch time has to be determined to attain a product that fulfills defined specifications on this most important quality measure. Naturally, etch times that deviate from the ideal time will result in lower product quality. Semiconductor manufacturing is technologically growing fast. Companies enhance their manufacturing processes continuously while creating products of smaller and smaller structure widths. Maintaining production processes that operate in such small P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 221–235, 2009. c Springer-Verlag Berlin Heidelberg 2009
222
F. Rosenthal et al.
dimensions—e.g., 22nm—requires extensive use of regression models for control, since theoretical research alone cannot provide the required models [1,2,3]. Creating regression models for process control requires a sufficiently large data set. In single-lot semiconductor manufacturing, where only few expensive products of a certain design are manufactured in complex and long-running processes, this creates a new challenge. Since the acquisition of the data set expands over a long period of real time, elements of the production process change in the meantime and the examples basically represent different processes. In our example of etching, chemicals used in the process are subject to aging. Hence, an acid might react more slowly, thereby removing less material per unit of time. A control model that predicts etch time therefore ideally has to consider the age of the chemical. However, the knowledge to describe such aging processes is typically not readily available. In this paper, we consider drifting processes, i.e., processes that observably change over time. Aging is an example of a continuous drift, since it continuously changes the process. Estimating a regression model from examples of a drifting process violates the basic assumption that the data set has been drawn from one joint distribution of the involved variables. If regression is done anyway, the resulting model will not reflect the current state of the process but an average state over time. Note that such drifts occur in many industrial processes, since every machine wears and ages. In the context of single-lot semiconductor manufacturing, drift may result in direct monetary loss, since even small errors in the estimates of process parameters are critical. Besides continuous drift, there are also abrupt drifts. In our example, they may result from a change of the etching chemical. The effect on estimated regression models is similar to the effect of continuous drift, i.e., the estimated model does not reflect the current state of the modeled system. One basic approach to handle drift is to restrict the data set to the most recent examples, which represent the current state of the system best. However, this restricted data set may not contain enough information to estimate complex functional dependencies, like those that occur in semiconductor manufacturing. Therefore, a tradeoff between a current model and a stable model, in terms of average prediction error, must be found. In this paper, we describe drifting processes and how drift influences regression models that have been estimated from examples generated by such processes (Section 2). In Section 3, we then propose ensemble regression, an approach that learns a composite regression model that reduces the prediction error when used as a model of the current state of the system. Ensemble regression uses a set of regression models estimated from different ranges of the complete data set and maintains this set according to the predictive performance of the models. Predictions are drawn from the ensemble using a weighted average of the predictions of all ensemble members. Our approach can be used in connection with any regression method. We extensively evaluate the effects of certain types of drift as well as the predictive performance of models estimated by our approach on synthetic and real-world data in Section 4. Section 5 describes related work and Section 6 concludes the paper.
Drift-Aware Ensemble Regression
2
223
Drifting Processes and Regression
We consider the problem of regression on data from drifting processes. Regression is the task of estimating or learning a function yˆ = f (x) from a data set D consisting of n examples di = (yi , xj,i ), j = 1, . . . , k, i = 1, . . . , n where y is the numeric label (the dependent variable) and xj are the k features (the independent variables). In our etching example, the time of etching is the dependent variable, while the structural width—and thereby the amount of material—is the independent variable. Each time the etching process is executed, we can acquire an example of an etch time and the structural width that has been reached. The basic assumption in regression is that the data set has been drawn from exactly one joint distribution of dependent and independent variables [4] that captures the dependencies between them. If the joint distribution is stationary when the examples are acquired, the examples represent exactly one functional dependence, i.e., one function, disregarding the uncertainty introduced by noise that may be introduced by measurement equipment. In this paper, we consider the case when the data set is acquired from a drifting process. Such a data set is not distributed according to just one distribution. Instead, the distribution function is a function of time and drift is the change that occurs over time. Therefore, the examples acquired from the process represent a set of functional dependencies between dependent and independent variables. Each of these functions governed the process at a time when an example was drawn. Therefore, we denote a system that yields data according to a changing underlying functional dependency as a drifting process. The effect of drift is that the same input values may result in different output values when determined at different points in time. In terms of the governing function, drift may be viewed as a change in functional form or as a change in the parameters. We now illustrate this view with a simple example from physics. The electrical resistance of an object is a measure of the amount of opposition against an electric current, which is described by Ohm’s law: R = U I −1 . If we want to determine the electrical resistance of an object experimentally, one way is to put it into an electrical circuit, set a certain voltage U and measure the current I. Performing this repeatedly for different values of U enables the estimation of R using linear regression, where U = y and I = x. However, electrical resistance also depends on the temperature of the resistor; higher temperatures will result in higher resistance. Therefore, if the temperature during the experiment is not constant, the examples (U, I) will represent different underlying functions U = Ri I, where Ri is the resistance at the time a particular measurement was taken. In physics, this problem has been solved by analyzing the impact of temperature on the electrical resistance and extending the basic dependence in a form that treats the resistance as a function of temperature R(θ). However, deriving such laws requires significant insight into the observed system and is not feasible for complex real-world processes like single-lot semiconductor manufacturing. Even if the influence factors and their functional form were all known, some factors might not be observable at all or not observable with acceptable cost. Additionally, real-world drifts may behave non-deterministically with abrupt
224
F. Rosenthal et al.
●
1st underlying function
20
2nd underlying function model from complete dataset
10
model from window
y
● ●
0
● ●
−10
●
−4
−2
0
2
4
6
8
x
Fig. 1. Effect of an abrupt drift from the first to the second true dependency on a regression model estimated from mixed samples
changes and may be subject to random fluctuations. In our example of etching, these obstacles circumvent the modeling of the behavior of the drift. Therefore, an approach is required to estimate regression models from the data of drifting processes while minimizing the error introduced by the drift. We now present the characteristics of this error using another simple example that is depicted in Figure 1. Consider a linear functional dependence y = f (x) = ax + b from univariate input x to univariate output y. Suppose further that the functional form stays constant, but drift occurs in parameter a. Hence, a(t) is a function of time and y = ft (x) = a(t)x + b. Figure 1 depicts two such functions at different times t. An abrupt drift has occurred while gathering the examples. For the first five examples (the circles) a(t) = 1 and for the last five examples (the triangles) a(t) = 3. For presentation purposes, we excluded the presence of any measurement noise. In general, n examples are acquired that can be indexed by 1, . . . , n in the order of the time they have been drawn from the process. When using the complete data set, i.e., d1 , . . . , dn , to estimate the parameters a ˆ and ˆb of the linear model, an average over the underlying functions is the result, since the discrepancies between the examples from the different underlying functions are implicitly discarded as noise by the regression method. Applying this approach to our example results in the dotted line in Figure 1. The estimated model and the current underlying function (line with triangles) are different and there will be an error when using the estimated model for prediction. The dashed line in Figure 1 represents a model that has been estimated from a window of size 6, i.e., the more recent half of the data set and the last example from the first underlying function. This model resembles the recent state much closer. The sum of the squared prediction errors of this model is about a tenth
Drift-Aware Ensemble Regression
225
of the error of the model that had been estimated from the whole data set. This example gives an indication that using older data may result in a greater prediction error. We can minimize the prediction error by using only the last example. However, this is not feasible when the functional dependence is of a more complex nature (e.g., polynomial) or when non-parametric regression is used.
3
Ensemble Regression for Drifting Processes
We are given a data set D = {di }, i = 1, . . . , n, where di are examples from a drifting process and di = (yi , xj,i ), j = 1, . . . , k, y is the numeric label and xj are the k features to be used in the calculation of the prediction yˆ. The core idea to reduce the effect of drift on the estimated model is to restrict the data set used for parameter estimation to a window of examples, thereby excluding older examples that do not represent the current state of the modeled system. The basic approach consists of estimating a regression model from a window of examples d(n−w) , . . . , dn , where n is the index of the most recent example and w is the number of examples in the window. Whenever a new example is acquired, n is incremented to represent the expanded data set. The right border of the estimation window is fixed to the current value of i, since the most recent example represents the current state of the modeled system best. Since n increases with each example, the whole window is shifted and therefore older data is excluded. The difficulty with this approach lies in determining the window size w that results in a model with a small prediction error. An analytic solution requires knowledge about the characteristics of the drift of the modeled process, i.e., rate of drift or cycle length. In our application setting—single-lot semiconductor manufacturing—this knowledge is often not available. Additionally, drift may be discontinuous, e.g., abrupt and random, which causes the optimal window size to change as well within the lifetime of a prediction model. Therefore, we need an adaptive solution, where the window size is set as part of the training process. Changes in the characteristics of the drift can then be compensated. Since an exact determination of the window size is not possible in our setting, we avoid determining one best-effort window size. Instead, we use several windows of different size for training several regression models that are maintained according to predictive performance, i.e., when a model is not sufficient anymore, it is replaced. To maintain this set of models, the following statistics are needed for each model modelm . First of all, there is lm , the left border of the window of examples for estimating the associated model. Second, there is the number of positive posm and negative negm predictions that were made with the associated model. The counters are based on tests performed on new examples and a local threshold te for the acceptable error. Finally, there is the weight wm , which is used for calculating overall predictions from the ensemble. We describe the statistics more detailed in the following. The statistics and the model form a member Mm that can be uniquely identified by an index m. A set of members forms an ensemble E = {Mm }, m = 1, . . . , |E|.
226
F. Rosenthal et al.
Algorithm 1. Training algorithm for ensemble regression. Require: emax : the maximum number of members in the ensemble Require: te : threshold for prediction error Require: tr : threshold for the performance ratio 1: E ← ∅; D ← ∅; n ← 1 2: for each new d do 3: n ← n + 1; dn ← d 4: D ← D ∪ dn 5: for each Mm ∈ E do 6: if not is stable(modelm ) then 7: modelm ← train(lm , n); next 8: end if 9: err ← error(yn , predict(modelm , xn )) 10: if err < te then 11: posm ← posm + 1 12: else 13: negm ← negm + 1 14: end if 15: if posm (posm + negm )−1 < tr then 16: E ← E\M 17: else 18: modelm ← train(lm , n) 19: end if 20: end for 21: posT = |E| m=1 posm 22: for each stable Mm ∈ E do 23: wm ← posm pos−1 T 24: end for 25: if |E| < emax then 26: E ← E ∪ new member(dn ) 27: end if 28: end for
We now present the algorithm (see Algorithm 1) that is used to maintain an ensemble by describing several iterations and the resulting state of the ensemble after each of them. The example is based on the data set depicted in Figure 1, which contained an abrupt drift after the fifth of ten examples. As initialization, we start with an empty data set D and an empty ensemble E. The main loop of the algorithm (line 2) is executed whenever a new example d is acquired. The index for the most recent example n is increased by one and the example is added to the data set D as dn . Hence, in the first iteration, D = {d1 }. Since the ensemble is empty, lines 5 to 24 have no effect and the first member is added to the ensemble in line 26. Members are added to the ensemble as long as the maximum number of members emax is not reached. This is a technical parameter to limit the required computational resources. The method new member(dn ) initializes the statistics of the new member and estimates the associated regression model from a minimal window containing only the most recent example dn .
Drift-Aware Ensemble Regression
227
The current value of n will be stored as left border lm of the estimation window. After this first iteration, E = {M1 }. When the next example d is acquired, it is added to the total data set as d2 and therefore D = {d1 , d2 }. Since the ensemble now contains one member, line 6 is executed for member M1 . The method is stable(modelm ) tests whether a member may be considered as stable, i.e whether the estimation window contains enough examples. The exact definition for stability depends on the used regression method. For linear regression, a stable fit requires at least as many examples as coefficients to be estimated. In our example, we use linear regression and two coefficients have to be estimated. Hence, M1 is still unstable in step two, because the current window for estimation contains only one example. In line 7, the model is now estimated from a larger window [lm , n] = [1, 2]. In general, the window associated with a member grows by one example for each example acquired. After this training, all other steps involving M1 are skipped. The second iteration of the algorithm is completed after adding the second member M2 , which is associated with a model that was estimated from d2 . Hence, E = {M1 , M2 }. In the third step, M1 has finally become stable and therefore the following steps from lines 9 to 19 are performed on it. First, M1 is tested by using the associated model modelm to predict the known label of the new example (line 9). Then, the prediction error is calculated using a given metric error, e.g., the root mean squared error. We use this testing scheme, since the most recent example represents the current underlying function best and since the most recent example has not been used in model estimation so far. In our example, no drift has occurred in this third iteration. Therefore, the error is err = 0. In lines 10 to 14, the error determined in the test is classified as positive or negative and the according counter (posm or negm ) is incremented. The parameter te that defines this local error threshold is intuitive to set. A larger threshold will result in more predictions classified as positive. In some application areas, like semiconductor manufacturing, this parameter is derivable from application knowledge, like quality specifications. In our example, the number of positive tests for M1 is pos1 = 1 in the third iteration, since err = 0 and because we set te = 1. In line 15, the test for eviction is performed. We propose a threshold-based criterion, where a member is evicted when posm (posm + negm )−1 < tr , i.e., when the ratio of positive test predictions to all test predictions falls below a given value 0 ≤ tr ≤ 1. This criterion has several important properties. It is resistant to outliers, since single negative predictions, e.g., those caused by noise, have limited influence when a member has reached a stable state. A member may even recover from a series of negative predictions, e.g., in cases of cyclic drift. Most importantly, members that have accumulated a large number of positives, and therefore have been in the ensemble for a long time, are harder to evict than members with fewer positives, since they can accumulate more negatives before fulfilling the eviction criterion. Therefore, mature members are favored but can still be evicted if a permanent drift occurs that makes them unusable.
228
F. Rosenthal et al.
P N
P N
1 0
1
3 0
1
0 0
2
2 0
2
0 0
3
1 0
3
0 0
4
0 0
5
(a) State at n=3 P N
(b) State at n=5 P N
3 2
1
2 1
5
2 2
2
2 0
6
1 0
5
1 0
7
0 0
6
0 0
8
7
0 0
0 0
(c) State at n=7
9
(d) State at n=9
Fig. 2. Window size, number of positive and negative predictions of the ensemble members. Data set from Figure 1.
Note that the prediction weight of such a member can decrease even while it is still in the ensemble, which accounts for the fact of decreasing predictive performance. In our example, M1 stays in the ensemble since the ratio of positive test predictions to all test predictions is one and we set tr = 0.5. Members with sufficient predictive performance remain in the ensemble and the associated model is trained on the window [lm , n], thereby including the new example. The threshold on the performance ratio influences how fast the ensemble reacts to drift, whereby a larger ratio implies that few negative predictions are tolerated and eviction may happen early. Figure 2 depicts the further evolution of the ensemble. White squares represent examples from the first underlying function, while gray squares represent examples from the second underlying function. The examples are ordered from left to right with ascending index. The rectangles below depict the window that each member in the ensemble used to estimate its associated model. The number in the rectangle is the member identification m and at the left, the column P shows posm , while N shows negm . Figure 2(a) shows the state of the ensemble after the completion of the third iteration. Three members have been added so far. M1 uses the largest window and is the only stable member. Figure 2(b) depicts the state after the fifth example has been incorporated. There are five members now, which means that the set emax is reach. No further member will be added. Each member has been estimated from a different window, with the first member still using the largest window. It also has the largest number of positive predictions, since no drift has occurred yet. Member M5 is still considered unstable. Member M4 is stable but has not been tested yet on a new example, since the fourth and fifth examples were used in parameter estimation. No member has made any negative predictions.
Drift-Aware Ensemble Regression
229
Figure 2(c) depicts the state after the seventh example has been incorporated. There are again five members, but members M4 and M3 have been replaced by the new members M6 and M7 . Member M4 was evicted in the previous step, while member M3 was evicted in the seventh step. Interestingly, member M5 is still in the ensemble, although its window covers an example from the previous underlying function. Figure 2(d) depicts the state after the ninth example has been incorporated. The members M1 and M2 have been evicted, since they were unable to perform any more positive predictions, thereby dropping below the required performance ratio. Member M5 remained in the ensemble, since more and more current examples helped to improve the fit of the underlying function. So far we have not discussed the determination of overall predictions yˆ from the ensemble. yˆ is determined by calculating yˆm = predict(modelm , x) for each member in the ensemble and then calculating a weighted average yˆ = |E| m=1 wm yˆm , whereby the weights wm are determined as part of Algorithm 1 in two steps. First, in line 21, the total number of positive predictions posT = |E| m=1 posm of the members in the ensemble is determined. Then, this sum is used to determine the prediction weight of each member in lines 22 to 24. It is defined as the ratio of positive predictions of the member to the sum of positive predictions of all members: posm posT −1 . Hence, if a member has made only a small number of positive predictions, while other members have made large numbers of positive predictions, its weight will be low and vice versa. Mature members are therefore favored. Most importantly, the weight of a member can decrease if it does not make positive predictions while other members do (e.g., newer members).
4
Evaluation
In this section, we evaluate our approach on several types of synthetic data as well as a real-world data set. Our goal is to determine the prediction error when using different approaches to estimate regression models from drifting processes. We compare ensemble regression using different weighting schemes with the baseline (performing regression on the complete data set) as well as with the sliding window approach. A simulation environment with these algorithms was implemented in R. We used lm, included in the standard stats package, to estimate linear regression models. Since we restricted our experiments to the use of linear models in the ensembles, we also restricted our experiments to linear underlying functions. This ensures that the models are capable of representing the governing functions and no error can be induced from an improper functional form. Our systematic evaluation is therefore only valid for data sets from drifting processes that are governed by a linear dependency. However, we also present results based on a non-linear, real-world data set that indicate that our approach performs well in that setting, too. Formally, our Ksynthetic data sets were generated using a governing function y = f (x) = k=1 wk (t)xk + N (0, σ 2 ) where K is the number of independent
230
F. Rosenthal et al.
variables, wk (t) is the time-dependent weight of attribute xk and N (0, σ 2 ) is a normal random variable that represents measurement noise. In this setting, drift manifests itself in changing values of wk (t), while the functional form of f (x) stays fixed. We define three different functional forms of wk (t): Linear Drift. wk (t) = at + b, where slope a determines the speed of the drift. We fix b since it has no influence on the prediction error, while a is varied. Results are presented in Figures 3(a) and 3(b). Autoregressive Drift. wk (t) = wk (t − 1) + N (μ, σ 2 ); wk (0) = N (μ, σ 2 ), where wk (t − 1) is the weight determined for the previous example and wk (0) is the random starting weight. N (μ, σ 2 ) is a normally distributed random variable with mean μ and variance σ 2 . In our experiments, we vary μ while we fix σ. Results are presented in Figures 3(c) and 3(d). Mixture Drift. wk (t) = wk (t − 1) + a + I(p)N (μ, σ 2); wk (0) = N (μ, σ 2 ), where wk (t − 1) is the weight determined for the previous example, wk (0) is the random starting weight and a is the slope. I(p) yields either 1 with probability p or 0 with probability (1 − p), i.e., I(p) follows a binomial distribution with n and k fixed to 1. I(p)N (μ, σ 2 ) represents abrupt drift of random magnitude. This drift contains a fixed linear and a stochastic component and is therefore more realistic than the other two. In our experiments, we vary a and μ while we fix σ and p. Results are presented in Figures 3(e) and 3(f). For each of these three types of drift, we fixed a set of values for the parameters that are varied, which resulted in 97 parameter sets. For each of them, 10 data sets were created, with each data set containing 1,000 examples. The values of the xn were drawn from a uniform distribution, while the wk (t) were calculated and the weighted sum yielded the associated yn as defined above. We then simulated the repeated estimation of a regression model from each of these data sets and recorded the root squared error for overall predictions from the ensemble. The results are distributions of the error per regression approach over all parameter sets and they are depicted in Figure 3 with one row of box plots per drift type. Since the variance in the error differs significantly, the results are shown on two scales. The maximum displayed root squared error RSEmax is 1, 000 and 100 in the left and right column respectively. In each diagram, five box plots are shown. The labels on the x-axis correspond to the following five approaches: bl. This is the baseline approach of estimating a regression model from the complete data set whenever a new example is added. This approach is clearly the worst because of the large median error and large error variance for all three drift types. This approach is neither accurate nor stable. However, even for this approach small errors were observed, which can be explained by the dependence of the error on the value of the independent variables xn . This was illustrated in Figure 1, where the smallest error for the model estimated from the whole data set is zero at the intersection with the most recent underlying function. win. The windowing approach achieved very good results on the linear data set, where the median and the first quartile are the lowest of all approaches.
Drift-Aware Ensemble Regression
231
This changes for the autoregressive data set where the median and the first quartile are equal to our proposed approach (labeled as ens-g). For the mixture drift, window is the second-best approach. These results also confirm our assumption that a restriction of the data set used for estimation can result in a smaller prediction error. ens-e. This approach is identical to our proposed ensemble regression scheme, except that it does not use the calculated weights. Instead, the overall predictions are determined as an equally weighted linear combination. The approach outperforms baseline on the linear and the autoregressive data sets, but it is worse on the mixture drift data set. This can be explained by the relatively strong changes caused by the abrupt component in the mixture drift, which invalidates members very fast. Since it takes some time until the members are evicted, they contribute a large error that is propagated into the final prediction, since all weights are equal. ens-l. This approach is identical to our proposed ensemble regression scheme, except that it employs the local performance ratio posm (posm + negm )−1 as weight, which seems to be an intuitive choice. However, the variance of the prediction error is even larger than when using equal weights. ens-g. This is the approach proposed in this paper and it uses the global weight posm pos−1 T . It yields the best overall performance, since it has the lowest error variance for all three drift types. For linear and autoregressive drift, the median error is slightly worse or equal to the median error of the windowing approach. However, the median error is significantly lower for the mixture drift. In a second set of examples, we examined the influence of the maximum number of ensemble members emax and the eviction ratio tr on the distribution of the prediction error. We executed the simulation using different values for emax and tr on the data sets of all three drift types. The results were similar with respect to the observed influence. We therefore show only the error distribution based on the linear drift data sets. In Figure 4(a) the error distribution using different values of emax is depicted. It can be seen that emax has only a limited influence on the prediction error, although the median error and the error variance seem to increase slightly with increasing emax . However, we cannot conclude that using fewer members is a good strategy, since a smaller value of emax causes an increased chance of a completely unstable ensemble. This situation occurs when all stable members are evicted at once and new unstable members—being introduced one by one— still have to gather examples to be regarded as stable. In our experiments, we defined members with a window size smaller than 4 as unstable, since we used linear regression and since 3 coefficients had to be estimated. Otherwise, the underlying equation system would be under-determined and the resulting model insufficient. Using larger emax can reduce the likelihood of this situation. We can only conclude that future work is required for application settings were this behavior of our approach is unacceptable. In the first set of experiments, depicted in Figure 3, emax was set to 10.
232
F. Rosenthal et al.
(a) Linear drift; RSEmax = 1, 000
(b) Linear drift; RSEmax = 100
(c) Autoreg. drift; RSEmax = 1, 000
(d) Autoreg. drift; RSEmax = 100
(e) Mixture drift; RSEmax = 1, 000
(f) Mixture drift; RSEmax = 100
Fig. 3. Distribution of root squared error for different types of drift
Drift-Aware Ensemble Regression
(a) Influence of emax
233
(b) Influence of tr
Fig. 4. Parameter influence on the distribution of root squared error
● ● ●
● ● ●
baseline
ensemble
−10
0
5
10 15 20
In Figure 4(b), the influence of the eviction ratio is depicted. Obviously, a higher value of tr results in a lower median error and a decreased error variance. In the first set of experiments, depicted in Figure 3, tr was set to 0.2. We did not study the impact of te further, since—as we argued in Section 3—if possible, it should be set using application knowledge. In the first set of experiments, we fixed it to be 0.1% of the total spread of the values of y, which translates to an absolute value of te = 5. In Figure 5, we report results on a real-world data set from semiconductor manufacturing. To prevent any deductions about the underlying processes, we just report signed, relative errors. The data set is high-dimensional, non-linear and contains a relatively small continuous drift. The baseline approach yields a median error of about 1.3%, which is reduced to −0.1% by using ensemble regression . The model yielded by ensemble regression is therefore much better centered, while the error variance is identical.
Fig. 5. Distribution of relative error for a real-world data set
234
5
F. Rosenthal et al.
Related Work
In this paper, we consider the problem of estimating a regression model from data of a drifting process, whereby the most recent underlying function is to be approximated. To the best of our knowledge, this problem has not been considered so far. Current approaches in industrial applications can often use specially designed control models [5] to correct process drift and therefore eliminate the cause of the type of error investigated in this paper. Hence, standard regression techniques can be applied there. However, these approaches are not feasible in our application setting because of the indeterministic nature of drift and still ongoing theoretical research on the underlying physical phenomenons [1,2,3]. Concept drift is a related problem in classification and was introduced in [6]. Concept drift occurs when a hidden context exists and changes while examples are gathered. Changes in the hidden context can induce changes in the target concept. Analogous, in our problem setting, the underlying function changes and therefore influences the regression models estimated from a yielded data set. A number of approaches for handling concept drift exist. In [7] a system is presented that uses a sliding window of examples to construct concept descriptions using a description language. In [8] an algorithm for mining decision trees from a sliding window on a continuously changing data stream was presented. Adaptation to concept drift is performed by replacing subtrees or by building alternative subtrees. All these approaches enhance particular classification methods to be able to adapt to concept drift. In [9,10] an approach is presented where ensembles of base classifiers are built from sequential chunks of data. Using such jumping windows is not appropriate in our application context, since the newest examples always have to be incorporated. In [11] dynamic weighted majority was proposed for tracking concept drift. It uses an ensemble of base classifiers that are built from growing windows. However, the different learning task made it necessary to use a different eviction criterion and weighting scheme. Regarding the latter, members start with a fixed weight and each false prediction is penalized by reducing the weight by a constant fraction and normalizing the weights afterwards. In this approach, new members have a larger influence than in our approach, where the predictive weight is gathered slowly.
6
Conclusion and Outlook
Regression is an important, widely used tool, although the implicit assumption of a stationary joint distribution may not be met. Using regression on data from drifting processes results in a model that does not represent the current state of the process well. In application areas like semiconductor manufacturing, this induces an error that can be critical. In this paper, we presented ensemble regression, an approach that uses a set of regression models estimated from different ranges of the complete data set. The ensemble is maintained according to the predictive performance of the members, thereby yielding a low prediction error and a low error variance. In future work, we plan to widen our empirical study
Drift-Aware Ensemble Regression
235
to the examination of the effects of drift and the performance of our approach on non-linear dependencies as well. This will be accompanied by the use of nonlinear and nonparametric regression techniques as base learners. Additionally, we plan to investigate the use of sophisticated heuristics for adding new members to the ensemble.
References 1. Spitzlsperger, G., Schmidt, C., Ernst, G., Strasser, H., Speil, M.: Fault detection for a via etch process using adaptive multivariate methods. IEEE Transactions on Semiconductor Manufacturing 18(4), 528–533 (2005) 2. Bunday, B.D., Bishop, M., Donald, W., McCormack, J., Villarrubia, J.S., Vladar, A.E., Dixson, R., Vorburger, T.V., Orji, N.G., Allgair, J.A.: Determination of optimal parameters for cd-sem measurement of line-edge roughness. Metrology, Inspection, and Process Control for Microlithography XVIII 5375(1), 515–533 (2004) 3. Yue, H.H., Qin, S.J., Wiseman, J., Toprac, A.: Plasma etching endpoint detection using multiple wavelengths for small open-area wafers. Journal of Vacuum Science & Technology A: Vacuum, Surfaces, and Films 19(1), 66–75 (2001) 4. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, Heidelberg (2003) 5. DiRaddo, R., Girard, P., Chang, S.: Process drift and model-based control of forming operations. In: American Control Conference, 2002. Proceedings of the 2002, vol. 5, pp. 3588–3593 (2002) 6. Schlimmer, J.C., Granger, R.H.: Incremental learning from noisy data. Machine Learning 1, 317 (1986) 7. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69–101 (1996) 8. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: KDD 2001: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 97–106 (2001) 9. Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification. In: KDD 2001: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 377–382. ACM, New York (2001) 10. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 226–235. ACM, New York (2003) 11. Kolter, J., Maloof, M.: Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Third IEEE International Conference on Data Mining (ICDM), November 2003, pp. 123–130 (2003)
Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees Peipei Li1,2 , Xuegang Hu1 , Qianhui Liang2 , and Yunjun Gao2,3 1 2
School of Computer Science and Information Technology, Hefei University of Technology, China, 230009 School of Information Systems, Singapore Management University, Singapore, 178902 3 College of Computer Science, Zhejiang University, China, 310027
Abstract. Although a vast majority of inductive learning algorithms has been developed for handling of the concept drifting data streams, especially the ones in virtue of ensemble classification models, few of them could adapt to the detection on the different types of concept drifts from noisy streaming data in a light demand on overheads of time and space. Motivated by this, a new classification algorithm for Concept drifting Detection based on an ensembling model of Random Decision Trees (called CDRDT) is proposed in this paper. Extensive studies with synthetic and real streaming data demonstrate that in comparison to several representative classification algorithms for concept drifting data streams, CDRDT not only could effectively and efficiently detect the potential concept changes in the noisy data streams, but also performs much better on the abilities of runtime and space with an improvement in predictive accuracy. Thus, our proposed algorithm provides a significant reference to the classification for concept drifting data streams with noise in a light weight way. Keywords: Data Streams, Ensemble Decision Trees, Concept Drift, Noise.
1
Introduction
As the definition of data streams described in [23], it is an ordered sequence of tuples with certain time intervals. And as compared with the traditional data source, it always presents various new characteristics as being open-ended, continuous and high-volume etc.. It is hence a challenge to learn from these streaming data for most of traditional inductive models or classification algorithms[18,19,9]. Especially, it is intensively challenging for them oriented to the issues of concept drifts and noise contamination in the real applications, such as web search, online shopping or stock market and alike. To handle these problems, massive models and algorithms of classification have been proposed. The representative ones are based on ensemble learning, including an early ensemble algorithm of SEA[1] addressed the concept drift of data streams, a general framework P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 236–250, 2009. c Springer-Verlag Berlin Heidelberg 2009
Concept Drifting Detection on Noisy Streaming Data
237
for mining concept-drifting data streams using weighted ensemble classifiers[2], a discriminative model based on the EM framework for fast mining of noisy data streams[4], decision tree algorithms for concept drifting data streams with noise[5,11] and a boosting-like method for adaptation to different kinds of concept drifts[6] etc.. However, for these algorithms referred above, the limitations mainly rely that on one hand, little attention is paid to handle various types of concept drifts in data streams impacted from noise. On the other hand, the overheads of space and runtime are probably demanded heavily while without a prominent improvement on predictive accuracy. Therefore, to address the aforementioned issues, we present a light-weighted ensemble classification algorithm of CDRDT for Concept Drifting data streams with noise. It is based on random decision trees evolved from semi-random decision trees in [14]. Namely, it adopts the strategy of random selection to solve split-test for the nodes with numerical attributes instead of the heuristic method. In comparison to other ensembling model of random decision trees for concept drifting data streams, there are four significant contributions in CDRDT: i) the basic classifiers are constructed incrementally with small various chunks of streaming data. ii) the inequality of Hoeffding Bounds[7] is adopted to specify two thresholds, which are used in the concept drifting detection from noise. It benefits distinguishing the different types of concept drifts from noise. iii) the sizes of data chunks are adjusted dynamically with the bound limit to adapt to the concept drifts. It is beneficial to avoid the disadvantage of too large or too small sizes of data chunks in the detection of data distribution, especially in the case with the classification method of majority class. iv) the effectiveness and efficiency of CDRDT in the detection on concept drifts from noisy data streams are estimated and contrasted with other algorithms, including the state-of-the-art algorithm of CVFDT[10] and the new ensemble algorithms of MSRT (Multiple Semi-Random decision Trees)[11] based on semi-random decision trees. And the experimental results show that CDRDT performs in a light demand on the overheads of time and space with higher predictive accuracy. The rest of the paper is organized as follows. Section 2 reviews related work based on ensemble classifiers of random decision trees learning from concept drifting data streams. Our algorithm of CDRDT for the concept drifting detection from noisy data streams is described in details at Section 3. Section 4 provides the experimental evaluations and Section 5 is the conclusion.
2
Related Work
Since the model of Random Decision Forests[12] was first proposed by Ho in 1995, the random selection strategy of split-features has been applied into the model of decision trees popularly. And many developed or new random decision trees have appeared, such as [24, 25, 17]. However, it is not suitable for them to handle data streams directly. Sub-sequentially, a random decision tree ensembling method[3] for streaming data was proposed by Fan in 2004. It adopts the crossvalidation estimation for higher classification accuracy. Hu et al. designed an incremental algorithm of Semi-Random Multiple Decision Trees for Data Streams
238
P. Li et al.
(SRMTDS )[14] in 2007. It uses the inequality of Hoeffding bounds with a heuristic method to implement split-test. In the following year, an extended algorithm of MSRT in [11] was further introduced by authors to reduce the impact from noise in the concept-drifting detection. At the same year, H. Abdulsalam et al. proposed a stream-classification algorithm of Dynamic Streaming Random Forests[13]. It is able to handle evolving data streams with the underlying class boundaries drift using an entropy-based drift-detection technique. In contrast with the algorithms based on decision trees ensembling mentioned above, our classification algorithm of CDRDT for concept drifting data streams proposed here behaves with four prominent characteristics. Firstly, the ensemble models of random decision trees developed from semi-random decision trees are generated incrementally in variable sizes of streaming data chunks. Secondly, to avoid the oversensitivity to the concept drifts and reduce the noise contamination, two thresholds are specified to partition their bounds in the inequality of Hoeffding Bound. Thirdly, the check period are adjusted dynamically for adaptation to concept drifts. Lastly, it presents better performances on the abilities of space, time and predictive accuracy.
3
3.1
Concept Drifting Detection Algorithm Based on Random Ensemble Decision Trees Algorithm Description
The classification algorithm of CDRDT to be proposed in this section is for the detection of concept drifts from the data streams with noise. It first generates multiple classifiers of random decision trees incrementally with variable chunks of data streams. After seeing all streaming data in a chunk (i.e., the check period is reached), a concept drifting detection is installed in this ensembling model. By means of the pre-defined thresholds in the Hoeffding Bound inequality, the difference of the average error rates classified in the method of Na¨ıve Bayes or majority-class at leaves are taken to measure the distribution changes of streaming data. Further different types of concept drifts from noise are distinguished. Once a concept drift is detected, we correspondingly adjust the check period to adapt to the concept drift. Finally, a majority-class voting or Na¨ıve Bayes is utilized to classify the test instances. Generally, the process flow of CDRDT mentioned above could be partitioned into three major components: i) the incremental generation of random decision trees in the function of GenerateClassifier. ii) the concept drifting detection methods adopted in ComputeClassDistribution. iii) the adaptation strategies to concept drifts and noise in CheckConceptChange. The related details will be illustrated as follows respectively. Ensemble Classifiers of Random Decision Trees In different from the previous algorithms involved in [11, 14], on one hand, CDRDT utilizes various magnitudes of streaming data chunks to generate ensemble classifiers of random decision trees. Here, random indicates that the split-test
Concept Drifting Detection on Noisy Streaming Data
239
Input: Training set: DSTR; Test set: DSTE ; Attribute set: A; Initial height of tree: h 0 ; The number of minimum split-examples: n min ; Split estimator function: H (·); The number of trees: N ; The set of classifiers: CT ; Memory Constraint: MC and Check Period: CP. Output: The error rate of classification Procedure CDRDT {DSTR, DSTE, A, h 0 , n min , H (·), N, CT, MC, CP } 1. For each chunk of training data streams S j ∈ DSTR (|CP | =|Sj |, j ≥ 1) 2. For each classifier of CT k (1 ≤ k ≤ N ) 3. GenerateClassifier (CT k , S j , MC, CP ); 4. If all streaming data in S j are observed 5. averageError = ComputeClassDistribution(); 6. If the current chunk is the first one 7. fError = averageError ; 8. Else 9. sError = averageError ; 10. If ( j ≥ 2 ) 11. CheckConceptChange(fError, sError, CP, S j ); 12. fError = sError ; 13. For each test instance in DSTE 14. For each classifier of CT k 15. Travel the tree of CT k from its root to a leaf; 16. Classify with the method of majority class or Na¨ıve Bayes in CT k ; 17. Return the error rate of voting classification.
method adopted in our algorithm selects an index of the discretization intervals consisted in ordered values of a numerical attribute randomly and sets the mean value of this interval to a cut-point. On the other hand, it won’t split continuously for nodes with the discrete attributes until the count of instances collected meets the specified threshold (a default value is initialized to two). However, the remainder details of trees’ growing are similar to the descriptions in [11, 14]. Concept Drifting Detection In this subsection, we first introduce several basic concepts relevant to concept drift. Definition 1. A concept signifies either a stationary distribution of class labels in a set of instances at the current data streams or a similar distribution rule about the attributes in the given instances. According to the divergence of concept drifting patterns, the change modes of a concept could be divided into three types of concept drift, concept shift and sampling change as involved in [15].
240
P. Li et al.
Definition 2. The types of concept drift and concept shift belong to the pattern with distinct change speed in the attribute values or class labels of databases. The first one refers to the gradual change and the other one indicates the rapid change. Definition 3. sampling change is mostly attributed to the pattern change in the data distribution of class labels (in this paper all changes are called concept drifts instead.). In CDRDT, a concept drifting detection on the distribution changes of streaming data is installed after a data chunk traverses all of random decision trees. And various types of concept drifts are distinguished from noise in virtue of the relation between the difference of average error rates of classification at leaves and the specified thresholds. Here, the thresholds are specified in the inequality of Hoeffding Bound, whose detailed description is given below: Consider a real-valued random variable r whose range is R. Suppose we have made n independent observations of this variable, and computed their mean r¯, which shows that, with probability 1 - δ , the true mean of the variable is at least r¯ - ε. P (r ≥ r¯ - ε) = 1 - δ, ε=
R2 ln(1/δ)/2n
(1)
Where R is defined as log(M (classes)) and M (classes) indicates the count of total class labels in the current database, the value of n refers to the size of the current streaming data chunk, the random variable of r specifies the expectation error rate classified in the method of Na¨ıve Bayes or majority-class at leaves over all classifiers of random decision trees in CDRDT. Suppose the target object of r¯ is the history classification result in the i th -chunk (denoted as e¯f ) and the current observation object refers to the estimation result of classification in the (i+1)th chunk (marked as e¯s ). The detailed definition of e¯f (¯ es ) is formalized below. k k Mleaf Mleaf e¯f (¯ es ) = 1/N · N [p · n / ki ki k=1 i=1 i=1 nki ]
(2)
In this formula, N signifies the number of total trees, Mkleaf refers to the count of leaves at the k th classifier, nki is the count of instances at the i th leaf in the classifier of CT k and pki is the error rate estimated in 0-1 loss function at the ith leaf in CT k . In terms of Formula (2), we utilize the difference between e¯s and e¯f ( i.e., Δe = e¯s − e¯f ) to discover the distribution changes of class labels. More specifically, if the value of Δe is nonnegative, a potential concept drift is taken into account. Otherwise, it is regarded as a case without any concept drift. This is based on the statistics theory, which guarantees that for stationary distribution of the instances, the online error of Na¨ıve Bayes will decrease; when the distribution function of the instances changes, the online error of the Na¨ıve Bayes at the node will increase[16]. However, for the classification results in the method of majority-class, a similar rule could be concluded from the distribution changes
Concept Drifting Detection on Noisy Streaming Data
241
of class labels in small chunks of streaming data but with sufficient instances as well (In this paper, the minimum size of a data chunk marked as n min is set to 0.2k, 1k = 1000. This is obtained from the conclusion in [22].). It is also verified in our experiments on the tracking of concept drifts in Section 4. Hence, Eq.(1) could be transformed into Eq.(3). P (¯ es - e¯f ≥ ε0 ) = 1 - δ0 , ε0 =
R2 ln(1/δ0 )/2n
(3)
To distinguish diverse concept drifts from noise, it is necessary to specify different values of ε0 to partition their bounds, which refer to the tolerant bounds of deviation between the current error rate and the reference error rate. Evidently, the larger the variance of ε0 the higher the drifting likelihood is. In other words, it is more probable that the previous model won’t adapt to the current data streams due to the deficiency in the accuracy of classification. Correspondingly, the value of δ0 will decrease while the confidence of 1-δ0 will increase. Therefore, with the evocation from [8], two thresholds are defined in the inequality of Hoeffding Bound to control the classification deviation of error rates, i.e., T max and T min . Considering the demand on the predictive ability of the current models, their values are specified as follows. P (¯ es - e¯f ≥ Tmax ) = 1 - δmin , Tmax = 3ε0 δmin = 1 / exp[Tmax 2 ·2n/R 2 ] P (¯ es - e¯f ≥ Tmin ) = 1 - δmax , Tmin = ε0 δmax = 1 / exp(Tmin 2 ·2n/R 2 )
(4)
(5)
Adaptation to Concept Drifts Contaminated by the Noise In accordance with the related analysis mentioned above and the definitions of thresholds specified in Eqs.(4) and (5), four types of concept drifting states would be partitioned, including the ones of a non-concept drift, a potential concept drift, a plausible concept drift and a true concept drift. Namely, if the value of Δe is negative, it is taken as a non-concept drift. Otherwise, it is in a case of other three possible concept drifts. More precisely, if the value of Δe is less than T min , a potential concept drift is considered (potential indicates that the slower or much slower concept drift is probably occurring). And if greater than T max , a true concept drift is taken into account, which is resulted from a potential concept drift or an abrupt concept drift. Otherwise, it is attributed to the state of plausible concept drift considering the effect from the noise contamination. It spans the transition interval between a potential concept drift and a true concept drift. As regards this fuzzy status, it is beneficial to reduce the impact from the noise in data streams and avoid over-sensitivity to the concept drifts. Correspondingly, different strategies are adopted to handle various types of concept drifts. More specifically, for the case of non-concept drift, maintain the size of the current data chunk in a default value (e.g., n min ). For the potential concept drift, increase the chunk size by the number of instances-m min (e.g., m min = n min = 0.2k). However, for a plausible concept drift, shrink the size of
242
P. Li et al.
streaming data chunk and the check period by one third respectively. Because it is necessary to further observe the change of data streams for a deterministic type of concept drift. Otherwise, for a true concept drift, reduce the sizes into a half of the original values. Regarding the disadvantages of streaming data chunks with too large or too small sizes, the maximum bound (e.g., m max = 10*n min ) and the minimum one (e.g., m min ) are specified to control the change magnitude of a data chunk for better adaption to the concept changes. It indicates that if a bound is reached, the check period remains invariable until a new concept drift occurs. Furthermore, to improve the utility of each tree, those sub-branches whose error rates of classification are lower than the average level (e.g., 50%) will be pruned. 3.2
Analysis
Generation Error Rate for the Concept Drifting Data Streams According to the theorem of generation error analyzed in [17], as the number of trees increases, for almost surely all sequences Θ1 . . ., the generation error of PE will converge to P X,Y (P Θ (h(X, Θ)=Y ) - maxj =Y (P Θ (h(X, Θ)=j ))<0)
(6)
where X is the training set; Y means the class label, Θ specifies the random vector of feature generated from the attribute set, P (X, Y ) indicates the probability over the X, Y space and h(X, Θ) refers to the classifier. Eq.(6) is concluded on the assumption that the sequences of Θ1 . . . are independent identically distributed random vectors. However, regarding the concept drifting data streams, it is probable that the streaming data distributions are not uniform any more as the time flows. As a result, it is improper to judge the convergence of generation error in this case. Therefore, in the analysis of our ensembling model, we give an infimum bound of generation error. Because each detection on concept drift is installed every a certain period of instances, the training data are divided into small sequences, i.e., θt (t ∈{1, 2, ..., |B|}, |B| is the maximum index of sequence.). In an overall consideration, each generation error of our model in a chunk of θt could be expressed below: PE Tθtt = P θt (V (T t , θt ) = Y )-Maxj=Y P θt (V (T t , θt )=j )
(7)
where T t specifies the current decision tree ensemble and each tree is generated or updated with the data chunks composed of {θk , 1 ≤ k ≤ t }; V (·) signifies the voting function, which acts on the data chunk of θt classified by the current ensemble random decision trees. Considering the worst case, the generation error would be defined in Formula (8). Tt t PE ∗ = Max(P (Tt ,θt ) (PE T θt <0)) ≥ P (Tt ,θt ) (PE θt <0)
(8)
Concept Drifting Detection on Noisy Streaming Data
243
Due to M axj=Y Pθt (V (Tt , θt ) = j ) ≤ 1 − Pθt (V (Tt , θt ) = Y ), Eq.(8) hence could be written as Eq.(9): PE ∗ ≥ P (Tt ,θt ) (P θt (V (T t , θt )=Y )≤ 0.5)
(9)
Based on the analysis on the probability of optimal ensemble model in [14], i.e., h0 P (M (Attr ), N, h 0 ) = 1-(1-1/M (Attr ))N·2 −1 , we take it as an estimation on the
classification accuracy. Therefore, the generation error referring to the probability of P (M (Attr ), N, h 0 ) ≤ 0.5 would be formalized into Eq.(10), PE ∗ ≥ P (P (M (Attr ), N, h 0 )≤0.5)
(10)
It clearly shows that the higher the optimal probability for an ensemble model the less the generation error rate is. As a consequence, we could adjust the number of trees or the heights of trees to improve the predictive accuracy for the adaptation to concept drifts.
4
Experiments
To verify the efficiency and effectiveness of CDRDT in the detection on different types of concept drifts from noisy data streams, extensive experiments are conducted on the diverse benchmark concept drifting databases and the real streaming data obtained from Yahoo! Shopping Web Service. And the experimental study presents that not only could CDRDT detect concept changes timely and effectively with a certain resilience to the noise, but also outperforms on the abilities of runtime & space and the predictive accuracy as compared with CVFDT and MSRT. Therefore, this section would be divided into two parts correspondingly. The first one discusses the characteristics of all conceptdrifting databases used in our experiments. And the second one analyzes the drifting track in CDRDT and the performances on runtime, space and predictive accuracy (all experiments referred here are performed on a P4, 3.00GHz PC with 1G main memory, running Windows XP Professional. Furthermore, all algorithms used in our experiments are written in Visual C++.). Due to the limited space, only partial experimental results are given as follows. 4.1
Data Source
Synthetic Data HyperPlane. HyperPlane is a benchmark database of data streams with the gradual concept drift, which has been used in a numerous of references included [10, 3, 11, 2]. A HyperPlane in a d -dimensional space (d =50) is denoted by equa tion: d i=1 w i x i =w 0 . Each vector of variables (x 1 , x 2 , · · ·, x d ) in this database is a randomly generated instance and is uniformly distributed in the multidimen sional space[0, 1]d . If d i=1 w i x i ≥ w 0 , the class label is 1, or else is 0. The bound
244
P. Li et al.
of coefficient w i is limited to [-10, 10]. For a weight of w i , each initial value is generated at random. And then it increases or decreases continuously by the value of Δw i = 0.005 till it is up or down to the boundary, further changes the direction with the probability of p w =10%). Meanwhile, in order to simulate the concept drifting case, we select 5-dimension to change their weights in the database with the noise rate of r =10%. SEA. The artificial data of SEA first described in [1] is a well-known data set of concept shift with numerical attributes only. It is composed of 60k random points in a three-dimensional feature space with two classes. All three features have values between 0 and 10 but only the first two features are relevant. Those points are divided into four chunks with different concepts. In each chunk, a data point belongs to class 1 if f 1 +f 2 ≤ θ, where f 1 and f 2 represent the first two features and θ is a threshold value between these two classes. In this database, there are four thresholds of 8, 9, 7 and 9.5 to divide data chunks. Each chunk reserves 2.5k-sized records as test sets containing 10% class noise for the different concepts. And the rest 50k-sized points are treated as the training data, in which each of concept shift appears every 12.5k-sized instances. KDDCup99. The KDDCup99 database[20] is a database for network intrusion detection, which is selected here because it has been simulated as streaming data with sampling change in [15]. In this database, the count of attributes is 41 dimensions with 34-dimensions of numerical attributes and the number of class labels is 24 totaly. Due to the skew distribution of class labels, the data with minor rates of class labels (i.e., the total number is lower than the value of n min ) are taken as the noise data. Hence, the data set without noise contains 490k-sized instances with 12 class labels. Real Data Yahoo! Shopping Data. The web shopping data used in our experiments are obtained via the interface of Yahoo! web services. They are sampled from Yahoo! shopping databases relevant to catalog listing, product search and merchant search. The basic feature of this data set contains 17 dimensions of attributes and the total number of numerical ones is 10 dimensions. Meanwhile, it is composed of the product information with the attribute set of (NnumeratingofProduct, AverageRating, etc.) and the related information of merchants included attributes of (NumRatingsofMerchant, Price-SatisfactionRating, OverallRating etc.). The correlation between a product and a merchant is connected by the catalog listing (see [21] for more details). To mine the relation between the credibility of merchants and possible factors, the attribute of OverallRating with different scores are defined as our class labels, which are divided into five class labels. By the label distribution, we extract the number of 84k-sized instances randomly from the obtained records as a training set and the rest 28k-instances as a test set. 4.2
Experimental Evaluation on Synthetic Databases
Before introducing the experimental results evaluated on synthetic databases, several symbols involved in our experiments would be given firstly in the above
Concept Drifting Detection on Noisy Streaming Data
245
Symbol Max/Bayes
Description the classification method of majority-class/Na¨ıve Bayes. “∗ ∗ ∗” refers to the name of a database. The whole symbol refers that the ex∗ ∗ ∗-Max/Bayes perimental results are classified by the method of Max/Bayes. Error rate the error rate estimated in a test set or a specified data chunk, Unit: (%). The value in the former is averaged over 20 runs classified by 20*N -trees. the training + test time (algorithms based on ensemble decision trees run in T+C time a simulated parallel environment of multiple PCs. Thus, the time overhead is calculated by the largest one of N trees, Unit: (s). But for CDRDT, the “T” time is computed by the total generation time of N -trees here. Memory the total memory consumption of all trees, Unit: (M). All results are averaged over 20 runs as similarly as the results with Error rate. structure: the name + the size of training data + the size of test data + the Database name database type (i.e., C: numerical, D: discrete, CD: hybrid) + the number of attribute dimensions, e.g., SEA-50K-2.5K-C-3. Drift-Track the process of detection on the concept drifts in a certain time-interval. for KDDCup99, it refers to the index of different class labels, marked as ClaDrift-Level ss label; For HyperPlane, it stands for the difference rate of class labels in comparison with an original database after introducing the noise, marked as Drift-rate, Unit: (%). A unified name is called Drift Level. Period-Change a short for the size of a data chunk in the concept drifting detection, i.e., the count of instances for each detection, Unit: (k).
table. Secondly, on the setting of parameters, the parameters in CVFDT and MSRT still follow the original definitions in [10, 11] respectively. While for CDRDT, the parameter values are specified below: N = 10, h 0 = M (attr)/2 as defined in [22], an initial size of a streaming data chunk of |S j | = 0.2k (i.e., |CP |), MC = 500k and δmax = 0.1 (ε0 and δmin are calculated in Eqs.(5) & (4) respectively). Now, the experimental details will be described as follows. Tracking Concept Drifts The tracking curves of concept drifts are plotted in Figures 1∼4, which present various detection cases on the databases with different drifting characteristics. In these figures, the curves of Drift-Track are drawn in solid lines with the scale of the left y-axis and the Drift-Level curves are described in dotted lines with the scale of the right y-axis. While for the changing values of Period-Change, we use a plus sign of “+” to represent in the corresponding figures. Its definition in scale is specified below: the lowest value starts from 1k and the basic unit (denoted as BU ) is set to 0.5k based on the scale of the left y-axis. In Figure 1-a, due to the different magnitudes of Drift-Level, the corresponding tracking curve fluctuates variously. Especially at the beginning of detection, there are larger discrepancies owing to the insufficiency of training data. Furthermore, when the current drifting level transfers to another one, a jump will appear in the tracking curve from a local minimum point to a local maximum one. However, due to the gradual concept drifts in HyperPlane, little deviation occurs between the adjacent detection results. As a result, most of check periods are maintained stably. Moreover, with the increasing of streaming data, the fluctuation trend is gradually converging. Considering the concept shifting detection on SEA, the tracking results fluctuate with three shifting occurrences as shown in Figure 2. As compared with the detection cases in the methods of Max and Bayes, there are several common characteristics: i) the fluctuations are frequent at the beginning of the training
246
P. Li et al.
Fig. 1. (left:) Drift track on HyperPlane (right:) Classification results
SEA-database-Max 36.0 35.4
2.0
37.6
1.8
37.0
) % (
34.8
Drift-Track
e t a r r o r r E
34.2 33.6 33.0
Period-Change
32.4
1.6
36.4
1.4
35.8
1.2
35.2
) k 1.0 (
34.6
d o 0.8 i r
34.0
e p 0.6
31.8
k c 0.4 e h C 0.2
31.2 30.6 30.0
0.0 1
9
17
25
33
41
49
Index of data chunk Fig. 4-a
57
65
73
81
SEA-database-Bayes
2.0 1.8
) % (
1.6
e t a r r o r r
1.4 1.2
E
Drift-Track
) k 1.0 ( d o 0.8 i r e p 0.6
Period-Change
33.4
k c 0.4 e h C 0.2
32.8 32.2 31.6
0.0 1
9
17
25
33
41
49
57
65
73
81
89
97
105
Index of data chunk Fig. 4-b
Fig. 2. Drift track over sequential data chunks for SEA
even if without any concept change. This is similar to the case in HyperPlane; ii) the upward and downward trends in curves alternately take place with the shifting of the concepts. This is mainly resulted from the distribution changes with only two class labels in this database, which contains four concepts in total varying from the rate of 1.8: 1 to 0.85:1 (In this figure, both of the minimum period and the unit of period refer to 0.2k-sized instances, which take the left y-axis as the scale axis). With respect to the detection on the sampling change, the tracking curves drawn in Figures 3∼4 describe the detection results on KDDCup99 in the classification methods of Max and Bayes respectively. In the observations, we can see that the more the frequent changes in the distribution of class labels the fiercer the fluctuations occur. It is demonstrated noticeably in the curve segment spanning the interval from the 1st data chunk to the 121th one (For KDDCup99, the default value of check period is set to 1k and the unit of BU is 0.2k. In Figure 3, the distance of two basic units in the left y-axis signifies one BU for a conspicuous description while a unit refers to one BU in Figure 4.). However, if the stable distribution of a class label is reached, the trend of tracking curves will fall down, such as during the curve segment between the 121th chunk and
Concept Drifting Detection on Noisy Streaming Data
5.6
4.8 4.4 4.0 3.6
22
KDDCup99-database-Max
e t a r ) r % o ( r r E
5.2
247
s s a l C
l e b a l
14
Drift-Level
2.8
12 10
Period-Change
2.4
18 16
Drift-Track
3.2
20
2.0
8
1.6
6
1.2
4
0.8
2
Index of data chunk
0.4 0.0
0 1
41
81
121
161
201
241
281
321
361
401
441
Fig. 3. Drift track over sequential data chunks classified with Max for KDDCup99 3.9
3.3 3.0 2.7
22
KDDCup99-database-Bayes
e t a ) r r % o ( r r E
3.6
s s la
C
l 20 e b 18 la 16
Drift-Track
2.4
14
Drift-Level
2.1
12
Period-Change
1.8
10
1.5
8
1.2
6
0.9 4
0.6 0.3 0.0
2
Index of data chunk 1
41
81
121
161
201
241
281
321
361
401
441
0
Fig. 4. Drift track over sequential data chunks classified with Bayes for KDDCup99
the 321th one. Hence, the check period maintains invariably because the state of non-concept drift is considered in this case. Predictive Accuracy and Overheads of Time and Space Except of the drifting detection on the training data, we also evaluate the classification abilities of CDRDT on the different test sets in comparison to CVFDT and MSRT. To be concrete, firstly, the results classified on Hyperplane are plotted in Figure 1-b, including the overheads of runtime & space and the mean error rates of classification added variances. It is clear that CDRDT outperforms other two algorithms on all abilities involved above. Secondly, the classification results on SEA and KDDCup99 are also concluded in Tables 1 and 2 respectively. As shown in Table 1, though in the Max classification method, CDRDT does not perform as well as other algorithms on the predictive accuracy, the error rate could be reduced by 6.15% at least if adopting the method of Bayes. Furthermore, as similar to the case in HyperPlane, the overheads of space and runtime are lowest, which the largest deviation is up to dozens of times. However, in Table 2, the predictive accuracy in CDRDT is improved largely by 14.76% and 36.61% respectively in Max as compared with CVFDT and MSRT. Meanwhile, in the case with the Bayes method, the superiority of predictive accuracy in CDRDT is also prominent. Moreover, the performances on both of the runtime and space consumption are much better as well.
248
P. Li et al. Table 1. Classification results on SEA-50k-2.5k-C-3 Algorithm
CDRDT MSRT CVFDT
Error rate(%) Max Bayes mean variance mean variance 44.32 3.762 13.53 1.354 25.98 2.770 19.66 2.344 25.24 /
(T+C)time(s)
Memory(M)
Max 0+0 0+0 8+0
Max <1
Bayes 0+0 0+0 /
41
Bayes 5 7*5 /
Table 2. Classification results on KDDCup99-490k-310k-CD-41 Algorithm
CDRDT MSRT CVFDT
4.3
Error rate(%) Max Bayes mean variance mean variance 8.87 0.169 9.06 0.488 44.48 22.350 28.60 19.896 23.48 /
(T+C)time(s)
Memory(M)
Max 32+19 93+18 75+18
Max Bayes 6 7 65 ∗ 5 20 /
Bayes 41+123 92+504 /
Experimental Evaluation on Web-Shopping Data
For the real-world data streams, it is hard to judge whether the current data streams carry a potential concept drift or when a concept drift occurs. Meanwhile, it is inevitable to be affected from the noise. Therefore, to verify the feasibility and utility of our algorithm, we conduct several comparison experiments on real data streams of Yahoo! shopping data with other algorithms as well. The experimental results listed in Table 3 show that CDRDT is superior to CVFDT and MSRT in the predictive accuracy and the overheads of runtime and space. For instance, the predictive accuracy in CDRDT is highest, which is improved by 9.32% averagely even though in the worst case. And for the training time, it is reduced by the times of 3/2 at least with an approximate overhead on the test time in contrast to other two algorithms. In addition, on the space consumption, the maximum rate only takes a half of the space consumption in CVFDT while the minimum one is only 1/25 of MSRT. Table 3. Classification on Yahoo!-shopping-data-84k-28k-CD-16 Algorithm
CDRDT MSRT CVFDT
5
Error rate(%) Max Bayes mean variance mean variance 14.02 5.545 4.45 1.936 33.52 17.891 44.42 39.166 23.34 /
(T+C)time(s)
Memory(M)
Max 2+1 11+1 10+3
Max Bayes <1 6 30 ∗ 5 13 /
Bayes 6+5 15+20 /
Conclusion
In this paper, we have proposed an ensembling classification algorithm named CDRDT for Concept Drifting detection from noisy data streams, which is based on Random Decision Trees. In contrast to previous efforts on ensemble classifiers of decision trees or random decision trees, small data chunks with unfixed sizes are adopted in CDRDT to generate the classifiers of random decision tress incrementally. To effectively distinguish different types of concept drifts from noise,
Concept Drifting Detection on Noisy Streaming Data
249
two thresholds are defined in virtue of the inequality of Hoffeding Bounds. Furthermore, for better adaptation to the concept drifts, the check period is adjusted dynamically and timely. Moreover, extensive experiments are conducted on three types of synthetic concept drifting databases and a real-world database of Yahoo! shopping data. And the experimental results demonstrate that CDRDT could adapt to various concept drifts efficiently in the noisy data streams. In addition, as compared with the state-of-the-art algorithm of CVFDT and an ensemble algorithm of MSRT, it outperforms on the abilities of runtime & space and the predictive accuracy. Hence, a conclusion is drawn in the study that CDRDT is a light-weighted ensembling algorithm of classification. It would provide an efficient method for the detection on a variety of concept drifts in data stream. However, how to model the noise data to discern the concept drifts from noise accurately and how to deal with the cases with the skewed distribution of class labels in data streams are still challenging and interesting issues for our future work.
References 1. Street, W., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification. In: 7th ACM SIGKDD international conference on Knowledge Discovery and Data mining, KDD 2001, pp. 377–382. ACM Press, New York (2001) 2. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining Concept-Drifting Data Streams Using Ensemble Classifiers. In: 9th ACM SIGKDD international conference on Knowledge Discovery and Data mining, KDD 2003, pp. 226–235. ACM Press, New York (2003) 3. Fan, W.: Streamminer: a classifier ensemble-based engine to mine concept-drifting data streams. In: 30th international conference on Very Large Data Bases, VLDB 2004, pp. 1257–1260. VLDB Endowment (2004) 4. Chu, F., Wang, Y., Zaniolo, C.: An adaptive learning approach for noisy data streams. In: 4th IEEE International Conference on Data Mining, pp. 351–354. IEEE Computer Science, Los Alamitos (2004) 5. Gama, J., Fernandes, R., Rocha, R.: Decision trees for mining data streams. Intelligent Data Analysis 10, 23–45 (2006) 6. Scholz, M., Klinkenberg, R.: Boosting Classifiers for Drifting Concepts. Intelligent Data Analysis (IDA), Special Issue on Knowledge Discovery from Data Streams 11(1), 3–28 (2007) 7. Hoeffding, W.: Probability inequalities for sums of bounded random variabless. Journal of the American Statistical Association 58(301), 13–30 (1963) 8. Castillo, G., Gama, J., Medas, P.: Adaptation to Drifting Concepts. In: Pires, F.M., Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 279–293. Springer, Heidelberg (2003) 9. Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.Y.: Boat-optimistic decision tree construction. In: 1999 ACM SIGMOD International Conference on Management of Data, pp. 169–180. ACM Press, New York (1999) 10. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: 7th ACM SIGKDD international conference on Knowledge Discovery and Data mining, KDD 2001, pp. 97–106 (2001)
250
P. Li et al.
11. Li, P., Hu, X., Wu, X.: Mining concept-drifting data streams with multiple semirandom decision trees. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds.) ADMA 2008. LNCS, vol. 5139, pp. 733–740. Springer, Heidelberg (2008) 12. Ho, T.K.: Random decision forests. In: 3rd International Conference on Document Analysis and Recognition, pp. 278–282. IEEE Computer Society, Los Alamitos (1995) 13. Abdulsalam, H., Skillicorn, D.B., Martin, P.: Classifying Evolving Data Streams Using Dynamic Streaming Random Forests. In: Bhowmick, S.S., K¨ ung, J., Wagner, R. (eds.) DEXA 2008. LNCS, vol. 5181, pp. 643–651. Springer, Heidelberg (2008) 14. Hu, X., Li, P., Wu, X., Wu, G.: A semi-random multiple decision-tree algorithm for mining data streams. Journal of Computer Science and Technology 22(5), 711–724 (2007) 15. Yang, Y., Wu, X., Zhu, X.: Combining Proactive and Reactive Predictions for Data Streams. In: 11th ACM SIGKDD international conference on Knowledge Discovery in Data mining, KDD 2005, pp. 710–715. ACM Press, New York (2005) 16. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997) 17. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) 18. Quinlan, R.J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 19. Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data mining. In: 22th International Conference on Very Large Data Bases, VLDB 1996, pp. 544–555. Morgan Kaufmann, San Francisco (1996) 20. KDDCUP 1999 DataSet, http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html 21. Yahoo! Shopping Web Services, http://developer.yahoo.com/everything.html 22. Li, P., Liang, Q., Wu, X., Hu, X.: Parameter Estimation in Semi-Random Decision Tree Ensembling on Streaming Data. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 376–388. Springer, Heidelberg (2009) 23. Wikipedia, http://en.wikipedia.org/wiki/Data_stream 24. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees. Neural Computation 9, 1545–1588 (1997) 25. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
Mining Multiple Level Non-redundant Association Rules through Two-Fold Pruning of Redundancies Corrado Loglisci and Donato Malerba Dipartimento di Informatica Universita’ degli Studi di Bari Via Orabona 4, 70125, Bari - Italy {loglisci,malerba}@di.uniba.it
Abstract. Association rules (AR) are a class of patterns which describe regularities in a set of transactions. When items of transactions are organized in a taxonomy, AR can be associated with a level of the taxonomy since they contain only items at that level. A drawback of multiple level AR mining is represented by the generation of redundant rules which do not add further information to that expressed by other rules. In this paper, a method for the discovery of non-redundant multiple level AR is proposed. It follows the usual two-stepped procedure for AR mining and it prunes redundancies in each step. In the first step, redundancies are removed by resorting to the notion of multiple level closed frequent itemsets, while in the second step, pruning is based on an extension of the notion of minimal rules. The proposed technique has been applied to a real case of analysis of textual data. An empirical comparison with the Apriori algorithm proves the advantages of the proposed method in terms of both time-performance and redundancy reduction. Keywords: Association Rules, Multiple Level Rules, Redundant Information.
1
Introduction
Association rules are a class of patterns which describe regularities or co-occurrence relationships in a set of data (e.g., transactions) [1]. Formally, an association rule is expressed in form of A⇒C, where both the antecedent A and the consequent C are sets of items (or itemsets) such that A∩C =. The meaning of an association rule is quite intuitive: a transaction which contains A is likely to contain C as well. To quantify this likelihood, two statistical parameters are usually used, namely support and confidence. The former, denoted as s(A ⇒ C), indicates the portion of the transactions where the conjunction A∪C occurs, and hence estimates the probability P(A∪C). The latter, denoted as c(A ⇒ C), indicates the portion of the transactions where C occurs out of the transactions in which A is present, thus it estimates the posterior probability P(C|A). The problem of association P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 251–265, 2009. c Springer-Verlag Berlin Heidelberg 2009
252
C. Loglisci and D. Malerba
rule mining consists of finding out all rules whose support and confidence values exceed two user-defined thresholds, called minsup and minconf respectively. The blueprint for most association rule mining algorithms proposed in the literature is a two-stepped procedure, according to which given a dataset D of transactions and the values of minsup and minconf, the method 1. finds out all frequent itemsets, i.e., the itemsets with support greater than or equal to minsup; 2. generates all association rules X ⇒ Y − X, where Y is a frequent itemset, X ⊂ Y and the confidence is greater than or equal to minconf. At the end of the second step, the rules generated are called valid (or strong) rules. Association rules are largely applied to many domains, but their usage remains problematic because of the huge number of association rules typically discovered and the resulting difficulty met in their analysis and interpretation. This problem is exacerbated in presence of redundant rules, i.e., AR which convey the same information conveyed by other rules of the same usefulness and the same relevance [5]. This important issue has attracted the attention of the research community (see [12] for a recent overview) and several methods have been proposed to obtain non redundant sets of rules which are in some way representative of the whole space of valid rules. Two main approaches are well identifiable: i) the usage of user subjective criteria and ii) the selection of relevant rules on the basis of statistical metrics. The former is followed, for instance, by Baralis et al. [4] who propose a framework to extract rules on the basis of user specifications described in form of templates. In [11] the user criteria are rather used to define a notion of interestingness which filters out all the uninteresting rules. The second approach resorts to post-processing techniques which return compact subsets with respect to some heuristic or statistical measures [9,3]. A different but not deeply investigated research line is that followed by the current work, that is, eliminating redundancies during the mining process, namely when frequent itemsets and/or valid rules are generated: discussion on the works proposed in the literature is postponed in Section 7. The problem of the redundancy in the AR is even more serious in the case of multiple level (or generalized) association rules [7],[2],[14], which represent an important extension of the AR and are discovered by exploiting a pre-defined taxonomic arrangement of the items. However, the organization of the items on several taxonomic levels leads to a combinatorial increase of discovered rules, many of which do not add further information to that expressed by a subset of them. More importantly, redundant rules makes the amount of work of the end user unbearable and, in ultimate analysis, it can cause the failure of a data mining project. In this paper, a novel method for the discovery of non-redundant AR at several taxonomic levels is proposed. It follows the usual two-stepped procedure for AR mining and prunes redundancies in each of two steps as sketched in the following. In the first step, a set of non-redundant frequent itemsets, named as multiple
Mining Multiple Level Non-redundant Association Rules
253
level closed frequent itemsets, is found out by extending the concept of closed itemsets. In the second step, the set of minimal association rules is produced on the basis of frequent itemsets previously found. The rest of the paper is organized as follows. In the next section, we illustrate the motivation of this work and explain our contribution. In Section 3 the problem of redundancy reduction is re-formulated two sub-problems: pruning the itemsets and pruning the set of generated rules. An algorithm which solves the former is proposed in Section 4, while a solution to the latter is reported in Section 5. In Section 6, we present the application to a task of textual data analysis where the multiple level non-redundant rules are exploited for discovering meaningful associations of biomedical concepts from biomedical literature. In Section 7 works related to ours are shortly presented and discussed. Finally, some conclusions close this paper.
2
Motivation and Contribution
Most approaches to keeping association rule redundancy under control operate either by pruning the set of rules which are deemed irrelevant according to some subjective notions of interestingness, or by preventing the generation of redundant rules during the mining process. Poor attention has been paid to the problem of removing redundant information when a taxonomic arrangement of the items is available. A taxonomy categorizes the items over several hierarchical levels by groups or classes. Its consideration in AR mining permits the generation of rules with hierarchically related items (multiple level ARs). Algorithms for multiple level AR mining do not face the problem of redundant information: they are basically focused on the strategies to scan the taxonomy. Our contribution aims to fill this gap through an approach that revises the usual two-stepped procedure. More precisely, we introduce two additional pruning criteria: – For the first step, by extending the concept of closed itemsets [13] to the case of hierarchically organized items. An itemset is closed if none of its supersets has the same support. For instance, if two itemsets A, B and A, B, C have the same support, then A, B is not closed, while if A, B, C has a strictly lower support than A, B, then A, B is closed. The interest toward closed itemsets is that they preserve information and help to keep computational complexity under control. – For the second step, by applying the notion of minimal rules [5]. For instance, if the following three rules R1 : A ⇒ B, C, R2 : A ⇒ B, C, D, and R3 : A ⇒ B, C, D, E have identical support and confidence, then R3 is a minimal rule in the set of three, since R1 and R2 can be derived from by R3 and convey no additional information. In the following sections we first provide a formal description of the problem, then we describe our two-fold contribution to redundancy removal during the mining process.
254
3
C. Loglisci and D. Malerba
Formal Statement
Let I: {x1 ,. . . , xn , y1 ,. . . ,ym , z} be a set of distinct literals called items and X be an itemset that contains k items (k-itemset), |X|=k. Given D a transaction set, each transaction T supports an itemset X if X⊆ T. The portion of transactions T⊆ D supporting an itemset X is the support of X, s(X). X is called frequent itemset if s(X)≥ minsup.
Fig. 1. Representation in form of hierarchical structure of a taxonomic arrangement of the items
Let G be a taxonomy represented in form of hierarchical structure over the items in I (see Figure 1) and organised by subtype-supertype (or is-a) relationships. The items at lowest levels are named as leaf items, those at intermediate levels as inner items, while the items at highest level as root item. An edge xi ← yj in G denotes a subtype relationship: the item xi is the direct subtype of yj , or conversely, the item yj is the direct supertype of xi . An item z is a supertype of xi if there is a path from xi to z (xi is a subtype of z). A transaction T supports an item yj ∈ I when yj ∈ T or ∃ xi ∈ T such that yj is a supertype of xi . An itemset Y is called multiple level itemset if it does not contain an item and its supertypes as well. An itemset Y is a supertype itemset of an itemset X, if Y can be obtained by replacing one or more items in X with one of their supertypes and |X| = |Y| (conversely X is a subtype itemset of Y). Given an itemset Y and Z one of its subsets, Z⇒ Y-Z is called multiple level association rule if no one of the items contained in the antecedent Z (conversely, consequent Y-Z) is supertype of any item contained in the consequent Y-Z (conversely, antecedent Z). Interestingly, rules containing items from the upper taxonomic levels represent more generalized information, conversely, those containing items from the lower levels represent more specialized information. Now we introduce some preliminary notions for explaining our contribution: – dXi be a function which maps I into a set of positive integer numbers, where dXi (xi )=1 iff xi is the root item in G, otherwise dXi (xi )= dXi (direct supertype (xi )) +1; – dX be a function which maps I into a set of positive integer numbers, where dX (X)= max(dXi (xi )), xi in X. – h be the composition f˚g where f and g are defined as follows:
Mining Multiple Level Non-redundant Association Rules
255
• f(D’): ℘(D’)→ ℘(I’), which associates with D’ the items common to all the transactions tk ∈ R and returns the set { xi ∈ I’ | ∀ tk ∈ D’, (tk , xi ) ∈ R }; • g(I’): ℘(I’) → ℘ (D’), which associates with I’ the transactions related to all items xi ∈ I’ and returns the set { tk ∈ D’ | ∀ xi ∈ I’, (tk , xi ) ∈ R }; where R⊆ D×I is a binary relation with I’⊆ I, D’⊆ D. The function h is commonly called Galois closure operator [6],[13]. Now it is worth of reminding that: i. an itemset X ⊆ I is closed iff h(X)=X. Moreover from [13] it follows that X is closed if none of its supersets has the same support as X (i.e., s(h(X))=s(X)); ii. given an itemset X and xi ∈ I: g(X)⊆ g( xi ) ⇔ xi ∈ h(X) [12]; The concept of closed itemset is fundamental in this work: we extend it to the case of taxonomically organized items for removing redundant multiple level itemsets. In formal terms we have: Definition 1 (Multiple Level Closed Itemset). Let Y be a multiple level itemset, Y= y1 , y2 ,. . . , yj ,. . . ,yh , it is called multiple level closed itemset(mCI) iff: 1. item w, w ∈ I, w ∈ / Y, w is not supertype of yj ∈Y; 2. dX (Y) ≥ dXi (w); 3. | g(Y) - g(w )| ≤ , positive user-defined threshold (when =0 g(Y) ⊆ g(w )). It can be argued in this way. Assume =0 (for the sake of simplicity) and suppose an item w which is neither contained in Y and nor supertype of yj (yj ∈Y), and Y y1 , y2 ,. . . , yh a closed itemset where g(Y)⊆g(w). This means that an itemset Y’ so composed y1 , y2 ,. . . , yh , w exists and meets the relationships g(Y’)=g( y1 , y2 ,. . . , yh , w)= g(Y) ∩ g(w). Since g(Y) ∩ g(w)=g(Y) it follows that s(Y’)= s(Y), and then, according to the relation s(h(Y))=s(Y), Y is not closed itemset: this is in contradiction with the initial hypothesis, and then validates the conditions in Definition 1. Moreover, by the third condition of Definition 1 we know that an itemset X is redundant w.r.t Y if Y is supported by at least the same number of transactions that supports X, and if X and Y are supported from common transactions (respect to the threshold). An explanatory toy example follows. Example 1. Consider the taxonomy in Figure 2, the transaction set in Table 1, as 0 and the item w=B. It can be derived that the itemset A is a mCI while the itemset A11 ,A12 does not. Indeed, since g(A) is {1,2,3,4,5,6} and g(B) is {1,2,3,5} the relationship g(A)⊆ g(B) does not hold, hence B ∈ / h(A), finally A is a mCI (point ii presented above). On the contrary, g(A11 ,A12 )={3,5}, hence the relationship g(A11 ,A12 )⊆ g(B) does hold, then B ∈ / h(A11 ,A12 ), finally A11 ,A12 is not a mCI. Trivially it holds for any subtype of B.
256
C. Loglisci and D. Malerba
Table 1. A toy set of boolean transaction on leaf items. The taxonomic arrangement is reported in Figure 2. Transaction 1 2 3 4 5 6
A11 1 0 1 1 1 0
A12 0 1 1 0 1 0
B11 1 0 1 0 1 0
B12 1 0 1 0 1 0
A1 1 1 1 1 1 0
A2 1 0 0 1 1 1
B1 1 0 1 0 1 0
B2 1 1 1 0 1 0
A 1 1 1 1 1 1
B 1 1 1 0 1 0
Fig. 2. Representation of the taxonomy over the items for the example in Table 1
Considering the formal framework thus far described, the problem of interest in this work can be formulated as follows: Given: D transaction set, I set of items, G a taxonomy over the set I, minsup, minconf user-defined minimal thresholds for support and confidence, and as user-defined value. Goal: 1. Finding out the set of multiple level closed frequent itemsets (mCFI) w.r.t. to such that the support of each mCFI exceeds minsup; 2. Generating the set of multiple level association rules (mAR) exceeding the value of minconf. In the following two sections we present the computational solutions to these sub-problems.
4
An Algorithm for Finding Out the Multiple Level Closed Frequent Itemsets
The first sub-problem is here solved with an algorithm, named DELIS (DEscending Levels and Increasing Size) which scans the taxonomy in top-down way while, at each level, generates mCFIs with increasing size (or length). In particular the scanning strategy is based on a suitable order relation ”≺” over the items, which meets the following:
Mining Multiple Level Non-redundant Association Rules
given a set of items
1
257
I, a taxonomy G over I, a function O: I→ {1,2,. . . ,|I|},
i. xi ≺ xj (∀ xi ,xj ∈ I) iff dXi (xi ) = dXi (xj ) and O(xi )< O(xj ) (e.g., in Figure 2 O(A1 )=2, O(A2 )=6: A1 ≺ A2 ); ii. xi ≺ xik ≺ xj (∀ xi ,xj ∈ I, ∀ xik direct subtype of xi ) iff dXi (xi )= dXi (xj ), dXi (xik )= dXi (xi )+1 and O(xi ) < O(xik )< O(xj ) (e.g., in Figure 2 O(A1 )=2, O(A2 )=6, O(A11 )=3: A1 ≺ A11 ≺ A2 ); m iii. given two itemsets X1 , X2 , |X1 | = |X2 | = n, dX (X1 )≤ dX (X2 ), ∃ xm i xj , m m m≤n, the m-th items of X1 , X2 respectively, xi ≺ xj : X1 ≺ X2 . As we will see in the description of the algorithm 1 such a order relation permits: – to return frequent itemsets only if their supertypes are frequent too; – to generate frequent k-itemsets by joining frequent (k-1)-itemsets; – to consider closed frequent itemsets only if their supertypes are closed frequent too. At this aim the algorithm exploits the theorem reported below. Theorem 1. Let X1 = x1 , x2 ,. . . , xj ,. . . ,xh be an itemset, X’1 = x1 , x2 ,. . . , x’j ,. . . ,xh its subtype itemset, xj supertype of x’j , X1 is not a closed itemset: X’1 is not a closed itemset. Proof : Assume as 0 and X1 is not a closed itemset. By resorting to the point ii in the previous section it is possible considering that it exists an item of I xk ∈ / X1 such that the relationship g(X1 ) ⊆ g(xk ) holds and xk ∈ h(X1 ). Hence it follows that g(X’1 ) ⊆ g(X1 ) ⊆ g(xk ), namely g(X’1 ) ⊆ g(xk ) holds, where xk ∈ h(X’1 ) and xk ∈ / X’1 : it results that X’1 is not a closed itemset. Finally we report an high-level description of DELIS algorithm (see Algorithm 1), where – h represents the size of the current multiple level frequent itemsets; – k indicates the current level of the taxonomy; – mFIh,k represents a set containing the multiple level frequent itemsets with size h generated at the level k ; – mCFI represents a set containing the multiple level closed frequent itemsets; – lowerLevel(G,k) returns the lower level of the taxonomy G w.r.t. the level k ; – frequentItemsGeneration(k) returns frequent items at level k. It also exploits the set of frequent items in mFI1,k−1 if k=lowerLevel(G,); – join(Xi ,Xj ) joins a la apriori Xi ,Xj [1], where Xi ≺ Xj ; – frequentItemsetsGeneration(mFIh,k−1 ) returns the multiple level frequent itemsets with size h generated at the level k: ∀ Xi ∈ mF I h,k−1 ’ Xi = join(X’i ,X”i ), |X’i |=|X”i |=h-1, dX (X’i )≤k-1 and dX (X”i )≤k-1, it joins Xdi with X”i , Xdi with X’i and Xdi with Xdi where dX (Xdi )=k or dX (Xdi )=k, Xdi subtype itemset of X’i , Xdi subtype itemset of X”i respectively; 1
Without generality loss, we assume that the set of transactions D contains only leaf-items of the taxonomy.
258
C. Loglisci and D. Malerba
– getClosed(mFIh,k ) returns the multiple level closed frequent itemsets from mFI by exploiting Definition 1 and Theorem 1; – getGenerator(mCFIh,k ) returns the generator frequent itemsets from mFIh,k . We provide the definition of generator in the next section, where it is also used for the rule generation; – CG indicates the total set of generators; – leafItemLevel represents the lowest level of the taxonomy. A trace of the algorithm 1 follows. Consider the toy example reported in Table 1, minsup=0.4 and for the sake of simplicity =0. First (step 4), given h=1 and k=taxonomy level 1, mFIh,k = {A,B} is determined. Then (steps 5-12), at the current level of taxonomy, the frequent itemsets with larger length are determined (i.e., mFIh,k ={A,B}), moreover, by exploiting the Definition 1, the step 10 finds the closed ones (i.e., {A,A,B}) that will be also used at k=taxonomy level 2. From the step 13 the algorithm iteratively goes down the lower levels of taxonomy by scanning only the direct subtypes of closed frequent itemsets: at the step 15, given h=1 and k=taxonomy level 2, mFIh,k ={A1 , A2 }, B2 } where A2 , B2 are not closed because of A1 . Then, at the step 21 with h=2 mFIh,k consists of {A1 ,A2 ,A1 ,B2 }, A2 , B2 }, and at the step 23 the algorithm updates mFIh,k with {A,B2 ,A1 ,B,A2 ,B}. By exploiting Definition 1 at the step 24 only the frequent itemsets which are subtype of closed itemsets have to be proven as closed. Finally, given h=3 mFIh,k consists of {A1 ,A2 ,B2 ,A1 ,A2 ,B}.
5
Operative Definitions for Generating Multiple Level Minimal Association Rules
The algorithm DELIS generates the mCFIs which indeed constitute a subset of the complete set of mFIs: in [5] has been proven that mFIs can be inferred from mCFIs. Moreover, the rules generated with mCFIs (mAR) are indeed a minimal set of those discovered with mFIs, or, in other words, they are a non-redundant subset. In formal terms the notion of minimal rule [5] states that Definition 2 (Minimal Rule). An association rule R1 :A1 ⇒C1 is called minimal iff there does not exist R2 :A2 ⇒C2 with s(R1 )=s(R2 ), c(R1 )=c(R2 ), A2 ⊆ A1 , C1 ⊆ C2 . Now, our interest is to generate the rules from mCFIs and at this aim we provide a set of operative definitions. In literature it has also been proven that the mAR can be seen as the union of two sets, namely ExactRuleSet and ApproximateRuleSet respectively, which are computed by exploiting the following: Definition 3 (ExactRuleSet and ApproximateRuleSet). Let mCFI be the set of multiple level closed frequent itemsets and CG be the set of generators: the sets of rules ExactRuleSet and ApproximateRuleSet consist of rules with confidence respectively equals to 1 and smaller than 1. They correspond to the sets
Mining Multiple Level Non-redundant Association Rules
259
{Y⇒X-Y| X ∈ mCFI ∧ Y∈ CG ∧ Y generator of X ∧ s(X) = s(Y)} {Z⇒X-Z| X ∈ mCFI ∧ Z ∈ CG ∧ h(Z)⊆h(X)} A generator is defined as: Definition 4 (Generator Itemset). Let X be a closed itemset, and h the Galois closure operator: an itemset Y, |Y| >1, Y⊆I, is called generator of X (closed itemset) iff h(Y)=X and Y’⊆I with Y’⊂ Y such that h(Y’)=X and s(Y’)=s(Y). Operatively ExactRuleSet is produced with rules of the form Y⇒X-Y where the relationships s(X)=s(Y) and h(X)=h(Y) hold, given c(Y⇒X-Y)= s(X∪Y)/ s(Y)=1, Y⊂ X and X as a mCI. Moreover, rules of the form Z ⇒X-Z can be derived from X and Y⇒X-Y considering that for each Z, where Y⊂ Z⊂ X, the relationship s(Z)=s(X)=s(Y) has to hold. Analogously, ApproximateRuleSet is composed of rules of the form Y⇒X-Y where the relationships s(X)<s(Y) and h(Y)⊆h(X)=X hold, given c(Y⇒X-Y)=s(X∪Y)/s(Y)<1, Y⊂ X. Additionally, rules of the form Z⇒X-Z can be considered as approximate if h(Z)⊆h(X) is true for each Z, Z⊂ X. By implementing this framework of definitions, our two-fold approach returns a subset of the total set of valid rules: it contains only those non-redundant, namely the rules that represent the greater set. More interestingly, the latter can be derived from the former.
6
Experiments
In this section we illustrate the application of the mining of non-redundant rules for discovering meaningful, and non-superfluous relationships among named entities reported in biomedical literature. The arrangement of the biomedical entities (also called MeSH terms2 , MEdical Subject Headings) in several is-a hierachies (see Figure 3) makes suitable the discovery of multiple level AR. In particular, association rules have been discovered on two datasets obtained by retrieving publication abstracts through PubMed-Medline3 with two queries: Neuropathy Ataxia Retinitis Pigmentosa and Alzheimer Drug Interactions: the first query returned 455 abstracts (NARP dataset) while the second 135 abstracts (ADI dataset). The returned abstracts have been initially annotated by the BioTeKSIBM UIM Architecture [?]. For each set a relational database was created and fed with MeSHs occurring in the corresponding set of retrieved abstracts. In particular, each transaction of a single table was associated to an individual abstract and it was described in terms of items that correspond to MeSHs. The simplest representation, namely the boolean representation, was adopted in order to represent the occurrence of a MeSH term in an abstract: NARP dataset was more sparse than ADI. 2 3
http://www.nlm.nih.gov/mesh/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
260
C. Loglisci and D. Malerba
Fig. 3. An excerpt of the pre-defined taxonomies on the MeSH items: the lower levels are more indented than the higher ones
In order to evaluate the ’goodness’ of our approach we performed a comparison with the well-known Apriori algorithm [1] along three different criteria: running times for generating both frequent itemset and association rules, the number of itemsets and association rules, and the information expressed by the rules. Frequent itemsets and valid multiple level AR were generated for both algorithms by varying minsup value (while only for DELIS algorithm, but for the sake of brevity here we report the simpliest case: =0).
Fig. 4. Running times on the varying of the minimum support
Execution time of both algorithms for the two dataset is reported in Figure 4 where we can observe that the Apriori algorithm requires more execution time on the sparse datasets while running times of DELIS seem to be not affected by the dataset. Data sparseness usually implies a low correlation in data and however, the correlation is more evident on the lowest level of the taxonomies. Apriori does not take into account this information since it basically combines frequent itemsets. On the contrary, DELIS generates itemsets at low levels of the taxonomies and it proceeds by generating subtype itemsets from their supertypes at more and more deep levels of the taxonomies. This allows DELIS to gain time when the mining process is performed on sparse datasets. Considering the number of frequent itemsets that have been generated (see Figure 5), we find that by decreasing minsup in both data sets, DELIS generates a lower number of frequent itemsets than Apriori. Moreover, in the case of ADI, DELIS generates less itemsets than in the case of NARP. This is due to the the fact that ADI is more dense than NARP, hence the positive effect of the closed itemsets generation is more evident. Actually, the mechanism subsumed by the closed itemset discovery allows to generate compact subsets of itemsets by preserving information on correlation among items. Consequently, the number of
Mining Multiple Level Non-redundant Association Rules
261
Fig. 5. Number of generated itemsets on the varying of the minimum support
association rules that have been generated is also reduced (see Figure 6). For instance, by scanning the rules reported below discovered for ADI (where the effect of redundancy elimination can be more evident) benefits coming from the use of the DELIS algorithm can be observed. Apriori rules: (minsup=0.3 and minconf=0.5 ) R82 : Alzheimer Disease ⇒ Chemical and drugs, s=0.53, c=0.79 R1343 : Alzheimer Disease ⇒ Biological sciences,Chemical and drugs, s= 0.35, c=0.52 R6004 : Alzheimer Disease, Mental disorders ⇒ Biological sciences, Chemical and drugs, s= 0.35, c=0.52 R6009 : Alzheimer Disease ⇒ Biological sciences,Chemical and drugs,Mental disorder, s=0.35, c=0.52 R6253 : Alzheimer Disease, Tauopathies ⇒ Biological sciences,Chemical and drugs, s=0.35, c=0.52 R6259 : Alzheimer Disease ⇒ Biological sciences,Chemical and drugs,Tauopathies, s=0.35, c=0.52 R10739 : Alzheimer Disease,Mental disorders,Tauopathies ⇒ Biological sciences,Chemical and drugs, s= 0.35, c=0.52 R10751 : Alzheimer Disease,Tauopathies ⇒ Biological sciences,Chemical and drugs,Mental disorders, s=0.35, c=0.52 R10752 : Alzheimer Disease,Mental disorders ⇒ Biological sciences,Chemical and drugs,Tauopathies, s=0.35, c=0.52 R10758 : Alzheimer Disease ⇒ Biological sciences,Chemical and drugs,Mental disorders,Tauopathies, s=0.35, c=0.52
DELIS rules: (minsup=0.3 and minconf=0.5 ) R5145 : Alzheimer Disease⇒Chemical and drugs,Mental disorders,Tauopathies, s=0.53, c=0.79 R5759 : Alzheimer Disease⇒Biological sciences, Chemical and drugs, Mental disorders,Tauopathies, s=0.35, c=0.52
As we can see, Apriori discovers a set of the rules which convey the same information: R1343 , R6009 , R6259 can be subsumed by R10758 which however expresses additional information. Differently, DELIS algorithm mines a reduced set of rules than the previous ({R5759 }), prunes the redundant rules and returns only the more informative one. Moreover, by reminding that association rules are monotone with respect to pruning on the consequent side (i.e., if X ⇒ Y ∪ {A} is valid, then the rule X ⇒ Y is valid too) some considerations can be drawn. Rules R1343 , R6004 , R6009 , R6253 , R6259 , R10739 , R10751 , R10752 extracted by Apriori can be obtained by the rule R10758 , since they have the same support and confidence and this rule results to have the shortest antecedent and the longest consequent. This means that moving an item from the consequent side to the antecedent side the meaning of the rule is not affected when the resulting rule has the same support
262
C. Loglisci and D. Malerba
Fig. 6. Number of the discovered rules on the varying of the minimum support
Algorithm 1. Top-level description of DELIS algorithm. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:
input: D, G, minsup; output:mCF I, CG; mCF I ← ; CG ← ; mF I h,k ←; k ←lowerLevel(G,); h ←1; mF I h,k ← f requentItemsGeneration(k); while mF I h,k = do h ← h + 1; mF I h,k ←; for each Xi , Xj ∈ mF I h−1,k and Xi = Xj do mF I h,k ← mF I h,k ∪ join(Xi ,Xj ); end for mCF I ← mCF I ∪ getClosed(mF I h,k ); CG ← CG ∪ getGenerator(mF I h,k ); end while while k ≤ leaf ItemLevel do h ← 1; k ← lowerLevel(G, k); mF I h,k ← f requentItemsGeneration(k); mCF I ← mCF I ∪ getClosed(mF I h,k ); CG ← CG ∪ getGenerator(mF I h,k ); while mF I h,k = do h ← h + 1; mF I h,k ←; for each Xi , Xj ∈ mF I h−1,k and Xi = Xj do mF I h,k ← mF I h,k ∪ join(Xi ,Xj ); end for mF Ih,k ← mF Ih,k ∪ f requentItemsetsGeneration(mF Ih,k−1); mCF I ← mCF I ∪ getClosed(mF I h,k ); CG ← CG ∪ getGenerator(mF I h,k ); end while end while
and confidence. Therefore, by running DELIS, a single rule, that is the R5759 rule, is obtained and it is representative as well. On the contrary, in the case of the rule R82 , although the consequent of the rule is contained in the consequent of the rule R10758 , both of them should be preserved since the confidence and support values are not different, and hence contains different information. For this reason, DELIS discovers another rule, R5145 , that has the same support
Mining Multiple Level Non-redundant Association Rules
263
and confidence of the R82 extracted by Apriori, and as consequent a superset of the rule R82 . By compacting the consequent of the R5145 rule the rule R82 is obtained.
7
Related Work
Literature shows that much work has been dedicated for eliminating redundancies in the mining process only after that the all possible AR (included those redundant) have been discovered. Pruning redundancies during the generation of AR is a strategy which has attracted the attention of few researchers. First of all, Pasquier et al.[13] solve only the problem of reducing the number of frequent itemsets. They resort to the Galois closure mechanism for creating a smaller itemsets lattice and searching within those non-redundant. This allows to reduce the set of final frequent itemsets. The same authors exploit that idea for the step of rule generation [5]: here the set of valid rules is generated considering two base sets which contain the most informative rules, namely those with maximal antecedent and minimal consequent. Interesting alternatives for the second step of AR mining have been proposed by Zaki[16] and Kanda et al.[10]. In the former the non redundant rules are discovered as those most general: rules which have less items w.r.t. others with identical support and confidence are returned. Although more compact, however these rules often do not represent the most informative ones. Kanda et al. [10] focus on the rule generation and propose a concept of approximate redundancy: the non-redundant AR are those rules from which all valid ones can be approximately reconstructed. However, in these works the problem of redundancy is basically studied only for the second step of AR mining, and, moreover, the analysis is not extended to the case more general of items organized over taxonomies. Indeed, this aspect is considered in [15] only for finding out frequent itemsets. Here the concept of generalized closed itemsets is introduced to deal with items at several levels of generalization. Thus, the authors describe a technique for extracting generalized frequent closed itemsets under some constraints and conditional properties, while nothing is provided about the removing of redundancies on the generalized rules. Differently, in our work redundancy is pruned on both generalized rules and frequent itemsets.
8
Conclusions
The presence of redundant information in the association rules (or more generally, in the patterns) can be misleading for the task of the end-user and, thus, can cause the failure of a data mining project. In this paper we faced this issue operating on the process of discovery, namely directly in each of the two steps of AR mining. Moreover, the proposed computational solution takes into account of situations more general and often real, namely when background information on the items is available. The basic idea was of pruning redundancies on both frequent itemsets and association rules while these are found out. Experiments showed that the advantages of our approach are reflected in the generation of
264
C. Loglisci and D. Malerba
reduced sets of itemsets and rules, the determination of more significant itemsets and rules, and finally the lowering of computational costs. This could be beneficial for the understandability of the results of a data mining process. Acknowledgments. This work is supported in partial fulfillment of the research objectives of “Ateneo 2009” project “Modelli e metodi computazionali per la scoperta di conoscenza in dati biomedici“ and “FAR” project “MBLab Laboratorio di bioinformatica per la biodiversita’ molecolare. The authors are grateful to the anonymous reviewers for their helpful comments.
References 1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th Int’l Conference on Very Large Data Bases, pp. 478–499 (1994) 2. Agrawal, R., Srikant, R.: Mining Generalized Association Rules. In: Proc. of the 21st Int’l Conf. on Very Large Databases, pp. 407–419 (1995) 3. Ashrafi, M., Taniar, D., Smith, K.: Redundant Association Rules Reduction Techniques. In: Zhang, S., Jarvis, R.A. (eds.) AI 2005. LNCS, vol. 3809, pp. 254–263. Springer, Heidelberg (2005) 4. Baralis, E., Psaila, G.: Designing Templates for Mining Association Rules. J. Intell. Inf. Syst. 9(1), 7–32 (1997) 5. Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., Lakhal, L.: Mining minimal non-redundant association rules using frequent closed itemsets. In: Palamidessi, C., Moniz Pereira, L., Lloyd, J.W., Dahl, V., Furbach, U., Kerber, M., Lau, K.-K., Sagiv, Y., Stuckey, P.J. (eds.) CL 2000. LNCS, vol. 1861, pp. 972–986. Springer, Heidelberg (2000) 6. Davey, B.A., Priestley, H.A.: Lattices Theory. Introduction to Lattices and Order, 4th edn. Cambridge University Press, Cambridge (1994) 7. Han, J., Fu, Y.: Mining Multiple-Level Association Rules in Large Databases. IEEE Trans. Knowl. Data Eng. 11(5), 798–804 (1999) 8. Hernandez Palancar, J., Fraxedas Tormo, O., Feston Cardenas, J., HernandezLeon, R.: Distributed and Shared Memory Algorithm for Parallel Mining of Association Rules. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 349–363. Springer, Heidelberg (2007) 9. Hilderman, R.J., Hamilton, H.J.: Knowledge Discovery and Interest Measures. Kluwer Academic, Boston (2002) 10. Kanda, K., Haraguchi, M., Okubo, Y.: Constructing Approximate Informative Basis of Association Rules. Discovery Science, 141–154 (2001) 11. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I.: Finding Interesting Rules from Large Sets of Discovered Association Rules. In: Proc. of the 3rd Int. Conf. on Information and Knowledge Management, pp. 401–407 (1994) 12. Kotsiantis, S., Kanellopoulos, D.: Association Rules Mining: A Recent Overview. GESTS Int. Trans. on Computer Science and Eng. 32(1), 71–82 (2006) 13. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Proc. of 7th Int. Conf. on DB Theory, pp. 398–416 (1999)
Mining Multiple Level Non-redundant Association Rules
265
14. Sharma, L.K., Vyas, O.P., Tiwary, U.S., Vyas, R.: A Novel Approach of Multilevel Positive and Negative Association Rule Mining for Spatial Databases. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 620–629. Springer, Heidelberg (2005) 15. Sriphaew, K., Theeramunkong, T.: Mining Generalized Closed Frequent Itemsets of Generalized Association Rules. In: Proc. of 7th Int. Conf. on Knowledge-Based Intell. Inf. and Eng. Systems, pp. 476–484 (2003) 16. Zaki, M.J.: Generating non-redundant association rules. In: Proc. of The Ninth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 34–43 (2000)
Pattern Mining with Natural Language Processing: An Exploratory Approach Ana Cristina Mendes1 and Cl´ audia Antunes2 1
Spoken Language Systems Laboratory - L2 F/INESC-ID Instituto Superior T´ecnico, Technical University of Lisbon R. Alves Redol, 9 - 2o – 1000-029 Lisboa, Portugal [email protected] 2 Department of Computer Science and Engineering Instituto Superior T´ecnico, Technical University of Lisbon Av. Rovisco Pais 1 – 1049-001 Lisboa, Portugal [email protected] Abstract. Pattern mining derives from the need of discovering hidden knowledge in very large amounts of data, regardless of the form in which it is presented. When it comes to Natural Language Processing (NLP), it arose along the humans’ necessity of being understood by computers. In this paper we present an exploratory approach that aims at bringing together the best of both worlds. Our goal is to discover patterns in linguistically processed texts, through the usage of NLP state-of-the-art tools and traditional pattern mining algorithms. Articles from a Portuguese newspaper are the input of a series of tests described in this paper. First, they are processed by an NLP chain, which performs a deep linguistic analysis of text; afterwards, pattern mining algorithms Apriori and GenPrefixSpan are used. Results showed the applicability of sequential pattern mining techniques in textual structured data, and also provided several evidences about the structure of the language.
1
Introduction
Recent years have witnessed a great increase of the information available to the main public: information that was formely only accessible on books is now everywhere, and digitally available to everyone. This new paradigm, mainly due to the leverage of the Web, triggered the emergence of several research areas, namely: Information Extraction, Information Retrieval and Question-Answering, in the field of the Natural Language Processing (NLP), and Text Mining in the Knowledge Discovery and Data Mining (KDD) field. Bridging the gap between humans and computers has always been the goal behind NLP, by attempting to provide the laters with the ability of understanding and interacting using natural language. Even if this goal was not completely achieved, yet, a significant difference can be perceived between techiques that use NLP and other data processing techniques: the first ones have a much deeper understanding of language. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 266–279, 2009. c Springer-Verlag Berlin Heidelberg 2009
Pattern Mining with Natural Language Processing
267
The need for discovering unknown knowledge in text documents set off the interest in Text Mining. Also known as Knowledge Discovery in Text (KDT), it aims at discovering and extracting knowledge from textual data, presented in a semi-strutured and unstructured way, and “automatically identifying interesting patterns and relationships” [1]. Text Mining techniques have a surface knowledge about the language in which data is presented (for instance, in terms of its stopwords and stemming procedures); usually, they do not possess or deal with any knowledge about the structure of the language. In this paper we study the application of pattern mining algorithms in textual stuctured data. We aim at going a step further from what it is nowadays accomplished with Text Mining. Our hypothesis is that pattern mining over written data will benefit if we take advantage of the deeper understanding of the language that NLP provides. Our approach relies on pattern discovery over the processed texts using state-of-the-art NLP tools; therefore, in this paper we present the results of a set of studies concerning the usage of typical pattern mining algorithms over text data to which was included linguistic knowledge. Since NLP is the starting point in our pipeline, traditional pre-processing techniques, based on bag-of-words approaches, can not be applied; raw text is used instead. Natural language is ambiguous and profits from an incomparable richness and variety; hence, the usage of NLP can imply the introduction of certain errors, namely misclassification of words in morphological classes and wrong division of sentences into its syntatic constituents. Performance, however, should not be considered an obstacle to NLP anymore due to the massification of applications implemented in distributed environments. The remainder of this paper is organized as follows: Section 2 makes a literature review on this topic; Section 3 formalizes the problem of mining patterns in textual structured data; Section 4 describes the studies we performed on a real-world dataset. The paper finishes in Section 5, where conclusions are drawn and future work directions are presented.
2
Literature Review
Data Mining is an interdisciplinary field that brings together techniques from statistics, database, artificial intelligence and machine learning. Data Mining is all about mining patterns in large databases in order to extract hidden knowledge. It gave its first steps in the business domain, when companies experienced an enormous increase of their structured data digitally stored. A different perspective to pattern mining arose with the outgrowth of large amounts of unstructured text data: Text Mining. Although common approaches to Text Mining are based on a flat representation of words (usually with little linguistic processing), the close interaction between Text Mining and NLP is an indesputable fact. They share a strong common ground: text presented in an unstrutured way, written in natural language.
268
A.C. Mendes and C. Antunes
Recent discussion suggests that the integration of linguistic concepts and NLP techniques to benefit Text Mining applications is still in its fresh start. Not long ago, in 2005, the usefulness of NLP in Text Mining was an open issue [2,3] and an hard question to be answered. The general conclusion arose that more experiments were still necessary. Researchers could not agree on a common opinion about whether NLP helps or not the results already achieved by Text Mining. There is a significant amount of approaches to the problem of Text Mining stated in the literature that claim the use of results from NLP. For instance, NLP tags are used in the InFact [4] system for text analysis and search. Linguistic knowledge, namely grammatical and syntactic roles, is used to build subjectaction-object triples, which are then utilized as indexes. The subsequent search is based on a linguistically enriched query language and allows users to search for actions, entities, relationships and events. However, and unlike the InFact system, common approaches do not make direct manipulation of NLP morpho-syntactic tags to benefit Text Mining and search. They rather employ techniques borrowed from Information Extraction (IE) or are related with Text Classification. In what concerns IE, it regards the extraction of specific data from natural language texts, often by applying NLP techniques, like named entities recognition. The usage of IE in Text Mining can be seen especially in the field of bioinformatics [5,6]. Concerning the Text Classification task, it aims at assigning a correct class to previously unseen records, given a collection of already labeled records. Text Classification is studied, for instance, in [7], in which pattern mining over syntactic trees is used to make an opinion analysis of documents. Text Mining techniques for document classification are applied mainly to spam-filtering. Although the previously cited works only regard the implications of NLP in pattern mining, it is worth to mention that the interaction between both is, in fact, bidirectional. The work by [8,9], for instance, devise methods for named entity recognition and co-reference resolution using association rules.
3
Problem Definition
Natural languages can be formally modelled by Context-Free Grammars (CFG). A CFG is a tuple G = (N, Σ, R, S), where: N is a set of non-terminal symbols; Σ is a set of terminal symbols; R is a set of rules, each of the form A → β, in which A is a non-terminal and β is a string of symbols from the set (Σ ∪ N ); and S ∈ N is the start symbol. A CFG can be thought of in two ways: a generator of sentences or a device that assigns a structure to a sentence [10]. Besides, notice that, in the NLP domain, the set of terminals is related with the lexicon, or vocabulary, constituted by the words and symbols of a given natural language. Also, non-terminals represent either lexical classes of words (Part-Of-Speech (PoS)1 ) or multi-word constituents of the language 1
Traditional grammar classifies words according to eight PoS, or lexical classes: verb, noun, pronoun, adjective, adverb, preposition, conjunction, and interjection.
Pattern Mining with Natural Language Processing
269
Grammar G0 S→ NP → VP → PP →
NP VP | NP VP PP P ronoun | N oun | Determiner N oun | Determiner Adjective N oun V erb | V erb NP P reposition NP
Lexicon for G0 N oun → P ronoun → V erb → Adjective → Determiner → P reposition →
car | man | plane | morning | panter | airport I | me | you | it saw | drove | want | like big | pink | green | tall the | a | an | this on | to | f rom
Fig. 1. Set of rules for the grammar G0 and its lexicon
(chunks). 2 Figure 1 shows a lexicon and a set rules that belong to a sample grammar G0 3 . A labeled rooted tree is defined as an acyclic connected graph, denoted as the quintuple T = (V, E, Σ, L, v0 ) where: V is the set of nodes; E is the set of edges that connect nodes; Σ is the set of labels; L : V → Σ is the labeling function, that assigns labels from Σ to nodes in V ; and v0 is the root node. The level of a node v is the length of the shortest path from v to the root node v0 . Each node can be either a leaf or an internal node: leafs have no children and internal nodes have one or more child nodes. An ordered labeled rooted tree is a labeled rooted tree where the children of each internal node are ordered. A natural language sentence can be generated by a sequence of rule expansions, commonly represented by a parse tree. Thus, a parse tree is defined as an ordered labeled rooted tree τ , in which v0 = S and tree nodes are labeled according to the morphological and syntactic classes of each corresponding unit. Figure 2 shows the tree structure of the sentence “I saw a pink plane on the airport” according to G0 . Mining patterns using algorithms which do not focus on this specific data structure (like sequential pattern mining algorithms) implies a transformation of parse trees into other structures with different characteristics. In this context, the l-sequence of an ordered tree is the preorder transversal label sequence of all nodes; the preorder transversal label-level sequence, l2 -sequence, is the cannonical representation of a tree, composed not only by the label of each node, but also by its level in the tree. Both concepts are introduced in [11], where authors also prove the uniqueness of l2 -sequences. 2 3
For the sake of simplicity, we will refer to the terminal and non-terminal symbols of a given sentence as lexical units and morpho-syntactic units, respectively. Symbols NP (noun phrase), VP (verb phrase) and PP (prepositional phrase) are language constituents (chunks) with a specific syntactic function within a sentence. Readers interested on more information about this topic are suggested to refer to [10].
270
A.C. Mendes and C. Antunes
S
NP
Pronoun
VP
PP
NP
Verb
Determiner Adjective
I
saw
a
pink
Preposition
Noun
plane
on
NP
Determiner
Noun
the
airport
Fig. 2. Parse tree of “I saw a pink plane on the airport” according to G0
Knowledge
Natural Language Processing Raw Text Data
Textual Structured Data
Pattern Mining Patterns
Fig. 3. Our exploratory approach to discovering patterns in text
In the l-sequence of a parse tree each item has a linguistic meaning and is related with a morpho-syntactic or semantic property of a determined unit within a sentence. Likewise, the l2 -sequence of a parse tree provides the knowledge and a unique representation of its structure. Therefore, transforming sentences into l-sequences (or l2 -sequences) represents the transformation of the unstructured dataset into its corresponding structured version. Given an input sentence S, the preorder transversal sequence of its associated parse tree is a new sentence S provided with a deeper linguistic knowledge, which captures the morpho-syntactic structure of S. We define the sentence S as a textual structured sentence since its base constituents refer to natural language units, either lexical (words) or morpho-syntactic (PoS and/or chunks). Given a database D of transactions and a user-specified minimum support σ, a pattern is said to be frequent if it is contained in, at least, σ transactions in D. Problem statement. The application of pattern mining algorithms over textual structured data provides deeper knowledge about a natural language inner organization than tradicional Text Mining techniques. Given a text database and a user-specified minimum support, the problem of mining textual structured data is to discover the frequent structures of a natural language. Being so, this exploratory study attempts to bring the understanding of the natural language structure that NLP provides to pattern mining. We propose to test typical algorithms on textual structured data in order to discover patterns on structured data. The high-level view of the proposed approach is shown in Figure 3.
Pattern Mining with Natural Language Processing
4
271
Case Study
Experiments described in this section were conducted on a dataset composed by news articles from a Portuguese daily newspaper, P´ ublico, dated from years 1994 and 1995. Since the news domain is far too broad, only the news belonging to the category “National News” were used. Moreover, from these, paragraphs with less than 100 characters were discarded, since these often refer to articles’ title. The characteristics4 of the dataset are shown in Table 1. Table 1. Characteristics of the dataset National News Size (MB) ≈ 27
Words (#) ≈ 3 000 000
Distinct Words (#) 85 770
Paragraphs (#) 58 744
Sentences (#) 176 447
Text data was processed using the Spoken Language Systems Laboratory (L2 F) NLP chain for Portuguese, running on a computacional grid [12]. The chain involves the following processing steps: morphological analysis (by Palavroso [13]), morphological disambiguation (by MARv [14]), multi-word contraction and term spliting (by RuDriCo [15]) and, finally, chunking, named entities recognition and dependencies extraction (by XIP [16]). The chain is responsible for creating the parse tree, morpho-syntactically disambiguated, for each sentence received as input. An illustrative example of the output of the NLP chain for the input sentence “O primeiro-ministro Jos´e S´ocrates inicia uma visita oficial `a Madeira.”5 is depicted in Figure 4. Thus, the parse tree used in our experiments is defined as τ , and nodes in τ are labeled according to a function that marks a node v with a PoS or a chunk if v is an internal node, or with a language word or a language symbol if v is a leaf. In addition, PoS belong to the set of symbols: {NOUN, VERB, ART, ADJ, ADV, CONJ, PRON, INTERJ, PREP, PUNCT, PASTPART, REL, FOREIGN, SYMBOL, NUM}; a chunk is one of {TOP, VP, NP, ADVP, AP, PP, SC}. Considering the reference sentence, its associated textual structured sentence is the sequence: TOP NP ART O NOUN NOUN primeiro-ministro NOUN Jos´e NOUN S´ ocrates VP VERB inicia NP ART uma NOUN visita AP ADJ oficial PP PREP a ART a NOUN Madeira PUNCT . Since the NLP chain introduce many features in each tree node, associated with the morphological, syntactic and semantic analysis (some of which are shown in small rounded boxes in Figure 4), those unwanted are discarded by 4
5
Notice that the Paragraphs column concerns a rough text spliting by carriage return, while the Sentences column is related with the correct natural language sentences identified by the morpho-syntactic analyser. Thus, one paragraph might contain several sentences. “Prime-minister Jos´e S´ ocrates starts official visit to Madeira.” For clarity reasons, this sentence will be referred to and used as reference sentence throughout the rest of this paper.
272
A.C. Mendes and C. Antunes
Fig. 4. Parse tree of “O primeiro-ministro Jos´e S´ ocrates inicia uma visita oficial ` a Madeira”
a post-processing stage. This post-processing aims at building the alphabet for the mining algorithms with only the relevant symbols for our experiments. If all features were used, a much larger alphabet would be created. However, this would lever the discovery of many frequent noisy patterns, not serving the purpose of our current research studies. 4.1
Pattern Mining Algorithms over Textual Structured Data
There are several algorithms to mine patterns in data. They can follow specific categorizations depending on the characteristics of the input data: items in a basket, sequences, tree structures, and so forth. In these studies, however, we slightly transformed our dataset to be used as input for two pattern mining algorithms, which belong to different strands on KDD. The first algorithm employed in the experiments was Apriori [17]. Apriori is based on a candidate generation and test philosophy, following the anti-monotone property, according to which if a pattern is not frequent, then none of its supersets is frequent. In this case, each textual structured sentence is seen as a different transaction in the database and each of it components is considered is a different item. The second algorithm used was GenPrefixSpan [18]. This algorithm for sequential pattern mining represents a generalization of PrefixSpan [19] to deal with gap constraints, while maintaining a pattern-growth philosophy: avoids the candidate generation step altogether, and focus the search on a specific set of the initial database. Here, and like with Apriori, the textual structured sentences are transactions and its components are items. However, in this case, the position of each component is related with a discrete time instant; being so, each component occurs immediately after its left side neighbour. This algorithm was used to mine sequences of contiguous components in textual structured sentences.
Pattern Mining with Natural Language Processing
273
A sample input for each pattern mining algorithm is presented in Table 2. Each line contains a textual structured sentence, configured for the correspondent algorithm, as previously defined in Section 4. Since we will present results from tests over different input data, we will refer to this input configuration as the default configuration. Table 2. Input data for each pattern mining algorithm (default configuration) Algorithm
Input
Apriori
TOP NP ART O NOUN NOUN primeiro-ministro NOUN, Jos´e NOUN S´ ocrates VP VERB inicia ...
GenPrefixSpan
TOP(1) NP(2) ART(3) O(4) NOUN(5) NOUN(6) primeiro-ministro(7) NOUN(8) Jos´e(9) NOUN(10) S´ ocrates(11) VP(12) VERB(13) inicia(14) ...
4.2
Results
The results of each pattern mining algorithm, Apriori and GenPrefixSpan, are displayed on Table 3 and Table 4, respectively. A detailed analysis of the results allow us to draw further conclusions about the language. It is worth to remind, however, that these experiments were conducted on a very distinctive dataset, composed by news articles focused on a particular domain: the “National News”. These are commonly written in a well-formed language, with reliable and formal discourse structures and few spelling errors. Tests on other datasets with different characteristics would probably lead to different results. One may argue that, regardless the text’s characteristics, it is still Portuguese we are dealing with, and its structure remains unchangeable; however, some conclusions we draw shall not be extended to the language as a whole, as it is spoken and written in other real-world situations. Table 3. Results of the Apriori algorithm Support (Transactions) Min (%) Max (%) 75 50 25 25 10
100 100 100 50 25
Discovered Patterns (#) 1 41 1 454 1 413 28 721
227 079 135 056 028
Regarding the results from the Apriori algorithm, the great increase of patterns discovered as the values of minimum support decrease should be pinpointed and deserves a careful analysis. Actually, since language PoS and chunks are not in great amount (≈ 20), some of which populating every or almost every sentence in the textual structured data (consider, for instance, constituents TOP, the top-level chunk that refers to a sentence, and PUNCT, the PoS that refers
274
A.C. Mendes and C. Antunes
to punctuation), it is not a surprise to verify that a significant part of the frequent patterns are composed only by morpho-syntactic units and stopwords. This, however, does not seem to bring any reliable knowlegde about language’s structure. Also, most of these patterns have no linguistic meaning or sense at all, since they lack on order. Table 4. Results of the GenPrefixSpan algorithm (default configuration) Min Support (Transactions) % # 50 25 10 5 2.5 1 0.5 0.25 0.1
88 44 17 8 4 1
224 112 645 822 411 764 882 441 176
Discovered Sequences (#) Total Size > 1
2 4 11 36
42 95 235 401 790 139 962 875 375
1 4 10 33
20 66 197 346 698 885 405 715 807
In what concerns the results from the GenPrefixSpan algorithm, the number of discovered sequences is much smaller when compared to the number of discovered patterns discovered with Apriori, given the same support. Indeed, the philosophy behind GenPrefixSpan avoids the generation of patterns with no linguistic meaning. This algorithm mined only subsequences of our textual structure sentences; therefore with linguistic meaning. Also, it is worth to mention that in 50% of our dataset only 22 one-unitsize sequences were discovered. From these 22, only 6 were language symbols or language words, namely: comma, dot, determiners “a” and “o” (English word “the”), preposition “em” (corresponding to English words “in”, “on” and “at”) and preposition “de” (English word “of”). All chunks are frequent given a mininum support of 50% and, having in mind that the NLP chain labels nodes according to 15 different PoS, the following lexical classes were not considered as being frequent: INTERJ, interjections, PARTPART, words in past participle, REL, relative pronouns, FOREIGN, words written in other language than Portuguese, SYMBOL, special symbols, and NUM, numerals. A sample output of both algorithms is shown in Table 5. Besides the default configuration, similar other experiments were conducted on the same dataset with GenPefixSpan. These, however, differ on the amount of encoded linguistic knowledge, in order to discover how mined patterns evolve given different domain conditions. The tested configurations are listed bellow: (I) The textual structured sentences contain the level of each unit; the cannonical representation of each tree was used: TOP-L0 NP-L1 ART-L2 O-L3 NOUN-L2 NOUN-L3 primeiro-ministro-L4 NOUN-L3 Jos´e-L4 NOUN-L3
Pattern Mining with Natural Language Processing
275
Table 5. Patterns discovered by each pattern mining algorithm (default configuration) Algorithm
Pattern
Support (%)
Apriori
SC ADVP a de PP . PUNCT CONJ AP a PRON PP . NP TOP AP ADJ a PREP VP VERB NOUN PUNCT TOP o . PREP VP VERB
25.8 31.6 50.1 65.4
GenPrefixSpan
NP NOUN NOUN dur~ ao NOUN barroso VP VERB era NP PREP em NOUN lisboa que VP VERB
0.5 1.0 1.1 22.8
S´ ocrates-L4 VP-L1 VERB-L2 inicia-L3 NP-L1 ART-L2 uma-L3 NOUN-L2 visitaL3 AP-L1 ADJ-L2 oficial-L3 PP-L1 PREP-L2 a-L3 ART-L2 a-L3 NOUN-L2 Madeira-L3 PUNCT-L1 .-L2 (II) The same as the default configuration, however PoS components are removed. Therefore, each textual structured sentence is composed only by language words and symbols and chunks: TOP NP O primeiro-ministro Jos´e S´ ocrates VP inicia NP uma visita AP oficial PP a a Madeira . (III) The same as the default configuration, however only language words and symbols are used. Being so, instead of textual structured data, only the linguistically processed text is used as input: O primeiro-ministro Jos´e S´ocrates inicia uma visita oficial a a Madeira . (IV) Only typical Text Mining pre-processing techniques are applied: stopwords and punctuation removal. The dataset is not linguistically processed; therefore, no linguistic knowledge is included at all. Moreover, albeit the dataset was the same, the input for the pattern mining algorithm is slighlty different: instead of each transaction having a direct correspondence with a correct natural language sentence, an entire paragraph is now related with a transaction, since the NLP chain is not used: primeiroministro Jos´e S´ocrates inicia visita oficial Madeira encontra-se amanh˜a presidente governo regional The two graphs in Figure 5 compare the results of applying the algorithm GenPrefixSpan with different input configurations, either in terms of total frequent sequences and in terms of frequent n-unit-size sequences (n > 1). As it can be perceived from the analysis of both graphs, no significant differences exist between the number of discovered patterns using configurations Default and I. However, the cardinality of the alphabets used in the two configurations is fairly distinct: 85 792 and 116 634 symbols, respectively. This can bring us to the conclusion that the structure of the language in our dataset is somewhat static, and frequent nodes do not often change their level in the parse tree. A careful analysis of the results showed that the discovered sequences are indeed almost the same in both configurations. Differences are mainly due to two main reasons: for a determined minimum support value x, one sequence in
276
A.C. Mendes and C. Antunes Sequences
Sequences (Size > 1)
100000
10000
1000
100
Default I
10
II III IV
1
50
25
10
5 2,5 1 Minimum Support (%)
0,5
0,25
0,1
50
25
10
5 2,5 1 Minimum Support (%)
0,5
0,25
0,1
Fig. 5. Number of discovered frequent sequences given different input configurations to the GenPrefixSpan algorithm
the default configuration is associated with several sequences in configuration I, and their support is higher that x; for a determined support value x, one sequence in the default configuration is associated with more than one sequence in configuration I, but their support is lower that x. For instance, [NP PRON] (support 47.0%) is a sequence in the default configuration associated with sequences [NP-L1 PRON-L2] (39.7%) and [NP-L2 PRON-L3] (13.0%) in configuration I. For a minimum support of 10% the later sequence is frequent; this does not occur, however, for a minimum support of 25%. The comparison of one-unit-size frequent sequences between configurations default and I provides some clues about the level of frequent words, PoS tags and chunks in the parse tree. Ultimately, and although this is not this work’s main purpose, these results can be used to uncover potential errors in the textual structured data, introduced by the morpho-syntactic analyser: for instance, when it comes to conjunctions, for a minimum support of 0.1%, they are frequent oneunit-size sequences in more than 60% of the transactions ([CONJ] (61.2%)); if we consider also the levels of nodes, conjunctions are frequent when placed in levels 1, 2, 3 and 4 of the parse tree ([CONJ-L1] (54.9%), [CONJ-L2] (15.0%), [CONJ-L3] (1.0%) and [CONJ-L4] (0.2%), respectively). Given these results, one might wonder whether the last pattern is a misclassification of words in their morphological class or an error in the sintactic division of sentences, introduced by the NLP chain when processing the raw text dataset. Regarding the other configurations, greater differences exist among results. In every configuration in which the NLP chain was applied, the number of discovered sequences got smaller with the decrease of linguistic knowledge in the input data. For instance, with a minimum support of 2.5%, 790 sequences were discovered in the default configuration, 263 in II and 131 in III. Configuration II allow us to understand how sentences are composed regarding its bigger substructures: – in terms of those substructures inner organization, for instance: • a Noun Phrase usually starts with determiners “o” and “a” ⇒ Frequent patterns: [NP o] (support 45.2%) and [NP a] (36.1%).
Pattern Mining with Natural Language Processing
277
• the sequence of words “o presidente de a rep´ ublica” (“the president of the republic”) is a frequent Noun Phrase ⇒ Frequent pattern: [NP o presidente de a rep´ ublica] (0.4%) – in terms of those substructures interaction with other sentence constituents, for instance: • in almost 40% of the cases, a sentence begins with a Noun Phrase ⇒ Frequent pattern: [TOP NP] (38.8%). • it is more likely that a Noun Phrase appears after the inflected form of the verb to be “´e”, than a Verb Phrase ⇒ Frequent patterns: [´ e NP] (6.0%) and [´ e VP] (2.0%). Although a strict comparison can not be made between configuration IV and the others (after all, transactions are different), results show similar numbers in terms of discovered sequences for configurations III and IV; however, the number of sequences with size bigger than one is much lower when the stopwords and the punctuation were removed. Indeed, the usage of sentences without any pre-processing technique, led to the appearance of many n-unit-size sequences composed uniquely by those small and frequent words, with no content or semantic value (stopwords) and by punctuation. 4.3
Critical Analysis
The application of pattern mining algorithms over textual structured data brought us diverse clues about the natural language inner organization: similar results could not be achieved if only a flat representation of words were used. Also, recall that the algorithms were applied on the output of an already mature and stettled NLP chain. No other knowledge was manual or automatically inserted on our test data, besides some small modifications we made in order to test the various input configurations. Moreover, these studies prove the fact that some algorithms better apply in this specific domain than others: Apriori led to a combinatorial explosion of results, since it does not keep track of the input’s order; on the contrary, GenPrefixSpan seemed to discover reliable frequent language substructures. However, the discretization of parse trees into sentences and subsequent application of a sequential pattern mining algorithm implies the discovery of incomplete substructures (like, for instance, the sequence PP PREP a ART). In fact, other mining algorithms should be further explored in this specific domain, given the output of state-of-the-art linguistic analysers.
5
Conclusions and Future Work
In this paper we presented a series of exploratory studies we conducted in order to discover patterns on text data. The notion of textual structured data was introduced as being a database of text sequences which contain specific knowlegde about a sentence’s morpho-syntactic structure.
278
A.C. Mendes and C. Antunes
We took advantage of the knowledge of the language that NLP provides and used it as input for two different pattern mining algorithms: Apriori and GenPrefixSpan. On one hand, Apriori discovered a large amount of frequent patterns; however, these had little meaning since the order between the units within the natural language sentences was not preserved. On the other hand, GenPrefixSpan output several sequences that led to a better understanding of the natural language, Portuguese in this case. As future work we intend to explore the syntactic organization of the language and employ pattern mining algorithms over tree structures. We would like to test how these algorithms apply to the NLP domain. In addition, we aim at investigating the impact of including other morphological features (for example, words’ gender, number and lemma) in the mining process. Also, our results are confined to the output of an NLP chain, that includes itself the understanding of the language in the form of a grammar (for Portuguese); therefore, the results reflect the structure of that grammar. Since the application of pattern mining algorithms to textual structured data is a language independent approach, we would like to test it on other languages, with distinct morpho-syntactic structures.
Acknowledgments Authors would like to thank Prof. Lu´ısa Coheur for the thorough reviews during the writing of this paper, and Tiago Lu´ıs for the technical support and shared knowledge on parallel computing. Ana Cristina Mendes is supported by a PhD fellowship from Funda¸c˜ao para a Ciˆencia e a Tecnologia (SFRH/BD/43487/2008).
References 1. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006) 2. Kao, A., Poteet, S.: Report on KDD Conference 2004 Panel Discussion Can Natural Language Processing Help Text Mining? SIGKDD Exp. Newsl. 6(2), 132–133 (2004) 3. Kao, A., Poteet, S.: Text mining and natural language processing: introduction for the special issue. SIGKDD Explor. Newsl. 7(1), 1–2 (2005) 4. Liang, J., Koperski, K., Nguyen, T., Marchisio, G.: Extracting Statistical Data Frames from Text. SIGKDD Explor. Newsl. 7(1), 67–75 (2005) 5. Leser, U., Hakenberg, J.: What Makes a Gene Name? Named Entity Recognition in the Biomedical Literature. Briefings in Bioinformatics 6(4), 357–369 (2005) 6. Otasek, D., Brown, K., Jurisica, I.: Confirming protein-protein interactions by text mining. In: SIAM Conference on Text Mining (2006) 7. Matsumoto, S., Takamura, H., Okumura, M.: Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees. In: Ho, T.B., Cheung, D., Li, H. (eds.) PAKDD 2005. LNCS, vol. 3518, pp. 301–311. Springer, Heidelberg (2005)
Pattern Mining with Natural Language Processing
279
8. Budi, I., Bressan, S.: Association rules mining for name entity recognition. In: WISE 2003: Proceedings of the Fourth International Conference on Web Information Systems Engineering, Washington, DC, USA, p. 325. IEEE Computer Society, Los Alamitos (2003) 9. Budi, I., Bressan, S., Nasrullah: Co-reference resolution for the indonesian language using association rules. In: Kotsis, G., Taniar, D., Pardede, E., Ibrahim, I.K. (eds.) iiWAS, vol. 214, pp. 117–126. Austrian Computer Society (2006) 10. Jurafsky, D., Martin, J.H.: 12. In: Speech and Language Processing: An Introduction to Natural Language Processing. In: Computational Linguistics and Speech Recognition. Prentice Hall, Englewood Cliffs (2008) 11. Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., Shi, B.: Efficient pattern-growth methods for frequent tree pattern mining. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS, vol. 3056, pp. 441–451. Springer, Heidelberg (2004) 12. Lu´ıs, T.: Paraleliza¸ca ˜o de Algoritmos de Processamento de L´ıngua Natural em Ambientes Distribu´ıdos. Master’s thesis, Instituto Superior T´ecnico, Universidade T´ecnica de Lisboa, Portugal (2008) 13. Medeiros, J.C.: An´ alise morfol´ ogica e correc¸ca ˜o ortogr´ afica do portuguˆes. Master’s thesis, Instituto Superior T´ecnico, Universidade T´ecnica de Lisboa, Portugal (1995) (in Portuguese) 14. Rodrigues, D.J.: Uma evolu¸ca ˜o no sistema ShRep: optimizac˜ ao, interface gr´ afica e integrac˜ ao de mais duas ferramentas”. Master’s thesis, Instituto Superior T´ecnico, Universidade T´ecnica de Lisboa, Portugal (2007) (in Portuguese) 15. Paulo, J.: Extrac¸ca ˜o Semi-Autom´ atica de Termos. Master’s thesis, Instituto Superior T´ecnico, Universidade T´ecnica de Lisboa, Portugal (2001) (in Portuguese) 16. A¨ıt-Mokhtar, S., Chanod, J.P., Roux, C.: A multi-input dependency parser. In: IWPT. Tsinghua University Press (2001) 17. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994) 18. Antunes, C.M.: Pattern Mining over Nominal Event Sequences using Constraint Relaxations. PhD thesis, Instituto Superior T´ecnico, Universidade T´ecnica de Lisboa, Portugal (2005) 19. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: ICDE 2001: Proceedings of the 17th International Conference on Data Engineering, Washington, DC, USA, pp. 215–226. IEEE Computer Society, Los Alamitos (2001)
Is the Distance Compression Effect Overstated? Some Theory and Experimentation Stephen France1 and Douglas Carroll2 1
Lubar School of Business, UW – Milwaukee, 3202 N. Maryland Avenue., Milwaukee, Wisconsin, 53201-0742 2 Rutgers University, Graduate School of Management, Newark, New Jersey, 07102-3027 [email protected], [email protected]
Abstract. Previous work in the document clustering literature has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. This unsuitability is put down to the effect of “compression” of the distances created using the Minkowski-p metrics on high dimensional data. Previous experimental work on distance compression has generally used the performance of clustering algorithms on distances created by the different distance metrics as a proxy for the quality of the distance representations created by those metrics. In order to separate out the effects of distances from the performance of the clustering algorithms we tested the homogeneity of the latent classes with respect to item neighborhoods rather than testing the homogeneity of clustering solutions with respect to latent classes. We show the theoretical relationships between the cosine, correlation, and Euclidean metrics. We posit that some of the performance differential between the cosine and correlation metrics and the Minkowski-p metrics is due to the inbuilt normalization of the cosine and correlation metrics. The normalization effect decreases with increasing dimensionality and the distance compression effect increases with increasing dimensionality. For document datasets with dimensionality up to 20,000, the normalization effect dominates the distance compression effect. We propose a methodology for measuring the relative normalization and distance compression effects. Keywords: Dimensionality, Minkowski metrics, normalization, clustering, latent classes.
1 Introduction There has been much recent work in the data mining field on the effects of high dimensional data on various data mining algorithms and on the problems of using the Euclidean metric to calculate distances in high dimensions. This is due to the fact that nearest neighbors may be ill defined when using the Euclidean metric to calculate distances in high dimensional space. Much, but not all of this research has focused on the performance of clustering algorithms on distances calculated using different distance metrics [3][12]. Non-clustering work includes [15], which describes how high dimensional data affect neural-network based algorithms. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 280–294, 2009. © Springer-Verlag Berlin Heidelberg 2009
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
281
The document data sets described in [3] and [12] have a very specific format. Documents are items to be clustered and words correspond to data dimensions. Each ‘word’ contained within at least one document constitutes a dimension, and the number of occurrences of a word in a document is the value for that document on the corresponding word dimension. Each document tested is a member of some underlying latent class. The latent class is often the document category; e.g., for documents that are news stories, the classes may include sports, crime, business, home, international, etc. Work on document clustering, such as [3], often uses the homogeneity of clusters with respect to underlying classes as a proxy for the quality of the clustering solutions. Solutions where clusters correspond to latent classes are considered to be of high quality. We perform a similar analysis to determine the quality of neighborhood recovery for pair-wise distances between items. High quality distance representations should have more homogenous class labels for pairs of items with short distances between them, and less homogenous class labels for pairs of items with large distances between them. For many clustering and dimensionality reduction applications where the input data are in the form of object × stimuli data, the Euclidean metric is used to calculate input distances. The Euclidean metric is a special case of the more general Minkowski metric given in (1), with p = 2. If p = 1 the metric is known as the city-block or Manhattan metric. 1/ p
p ⎞ ⎛ d d ( x i , x j ) = ⎜ ∑ xi , k − x j , k ⎟ ⎝ k =1 ⎠
(1)
Despite the fact that the Euclidean metric is the most common metric used in the clustering literature, it is not always the most appropriate metric to use. Recent data mining research suggests that the concept of nearest neighbors of points calculated using Euclidean distances becomes meaningless as the dimensionality of the data increases. For these data, higher dimensional distances should not be calculated using the Euclidean metric. [1] extends [2], and shows that for an arbitrary distribution of n points embedded in a space of dimensionality d, ⎡ D max dp − D min dp ⎤ C p ≤ lim d →∞ E ⎢ ⎥ ≤ ( n − 1) C p , d 1 p −1 2 ⎣ ⎦
(2)
where p is the Minkowski metric, Cp is an arbitrary constant dependent on the metric, and Dmax and Dmin are the maximum and minimum distances between any pair of the n points. Thus, D max dp − D min dp increases at a rate of d 1 p −1 2 , independent of the distribution. As the dimensionality d increases, for p = 1 (city-block) d 1 p −1 2 increases at a rate of d 1 2 , for p = 2 (Euclidean) d 1 p −1 2 = 1 , and for p > 2 d 1 p −1 2 converges to 0. Thus, the higher the value of p, the less relative distance there is between nearest and furthest neighbor. [1] states that the city-block p = 1 metric is preferable to the Euclidean distance metric (for high dimensional data mining applications). [12] tests several clustering procedures on high dimensional document data, using a variety of different distance metrics. The paper suggests that cosine, extended
282
S. France and D. Carroll
Jaccard, and correlation distances are the most appropriate metrics for dealing with high dimensional data. The cosine distance between two items is based upon the difference in angles with which the item vectors meet with the origin. The cosine distance is given in (3). d ( xi , x j ) = 1 −
xi ⋅ x′j
( xi ⋅ x′i )
1/ 2
(x
⋅ x′j )
(3)
1/ 2
j
The correlation distance is the correlation subtracted from 1, giving a maximum correlation distance of 2 (when the correlation is -1) and a minimum correlation distance of 0 (when the correlation is 1). The formula for the correlation distance is given in (4). d ( xi , x j ) = 1 −
( xi − 1.xi ) ( x j − 1.x j )′ ⎡ x − 1.x )( x − 1.x )′ ⎤ i i i ⎥ ⎢⎣( i ⎦
1/2
⎡ ′⎤ ⎢⎣( x j − 1.x j )( x j − 1.x j ) ⎥⎦
1/2
(4)
The extended Jaccard distance is a continuous extension of the discrete Jaccard distance, which measures the ratio of the number of attributes shared by both item vectors to the number of attributes possessed by either item. The formula for the extended Jaccard distance is given in (5).
d ( xi , x j ) = 1 −
xi ⋅ x′j
( xi ⋅ xi′ ) + ( x j ⋅ x′j ) − xi ⋅ x′j
(5)
2 Experimentation In this section we begin by testing the usage of a wide range of distance metrics on high dimensional data sets. We find somewhat unexpected results and posit that that the performance differential between the Euclidean metric and the cosine and correlation metrics, previously ascribed to “distance compression” is in fact due to the inbuilt normalization in the latter two metrics. High quality distance representations should have more homogenous class labels for pairs of items with short distances between them, and less homogenous class labels for pairs of items with large distances between them. We test this homogeneity directly using neighborhood recovery measures rather than testing the performance of clustering algorithms on the data. We do this in order to decouple the effects of the different distance metrics tested from the performance effects of the clustering algorithms used. Consider a situation where distances are calculated for very high dimensional data and the homogeneity of the resulting distance representations is tested with respect to the underlying latent classes. We consider the following research questions: Q1: How does the use of different distance metrics affect the discriminatory power of distances for very high dimensional data? Based on the previously described
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
283
literature, one would expect that the use of Euclidean distances would cause problems with discrimination between nearest neighbors in the higher dimensional space. Testing how different distance metrics recover homogeneity between items with respect to the underlying latent classes should provide a measure of the usefulness of the distance metrics for use with high dimensional data. Q2: Which Minkowski metric is most appropriate for calculating distances for very high dimensional data? Based upon the theoretical results from the data mining literature, the city-block metric should perform better on high dimensional data sets than the Euclidean metric or Minkowski metric with p > 2. In order to test Q1 and Q2 we calculated distances between items on high dimensional data sets and then calculated neighborhood recovery with respect to the underlying latent classes. We considered six different data sets. Four of the data sets were high dimensional document data sets where each word corresponds to a single dimension in the data set. Given the large number of possible words relative to the size of the documents, the document data are rather sparse, with a small number of non-zero entries relative to the total number of entries. Two non-document data sets were included in order to compare between types of data sets. All six data sets had items grouped into underlying latent classes. The four high dimensional document data sets were taken from the CLUTO clustering data repository [9]. The k1a data set (see [3] for a description) originated from the WebACE project and describes webpage documents. The documents are split into classes using Yahoo subject categories. The fbis and la data sets are taken from [13]. The fbis data set is from the Foreign Broadcast Information Service and each latent class is a category of broadcast. For the la data set, each document is a news story in the Los Angeles Times and the documents are categorized into classes using the ‘desk’ of the paper from which the article was sourced. The OHSUM data set is taken from the OHSUMED database of medical journal articles. The articles are categorized using medical document categories. The data are described in [8]. The data sets used are summarized in Table 1. The isolet and SEG data sets are intended to provide comparison with the highdimensional document data sets. These data sets contain dimensions that are not sparse (i.e., have non-zero values for most items) and that measure continuous attributes. The isolet data set has a relatively high dimensionality (617) for the type of data set and the SEG data set has a lower dimensionality (18). The isolet data set is taken from a speech recognition classification task. Each item in the data is the recording of a letter being spoken. The latent classes for the data are the 26 letters of the alphabet. Each dimension is a sensor reading for the recording. A description of the data set is given in [6]. The SEG data set is taken from a data set stored at the UCI Machine Learning Repository and donated by the Vision Group at the University of Massachusetts [4]. Each item in the SEG data set contains measurements taken on a subset of a certain image. The different images from which the measurements are taken correspond to the latent classes for the data set and each dimension corresponds to a measurement on the image subset.
284
S. France and D. Carroll Table 1. Data set summary
Data fbis isolet k1a la OHSUM SEG
Dimensions 2463 617 21839 31472 11465 18
Type Sparse Count real (non-sparse) Sparse Count Sparse Count Sparse Count real (non-sparse)
Data set description [13] [6] [3] [13] [8] [4]
For each data set, we took four sets of 500 items and derived distance matrices from the data. We calculated separate distance matrices for the Euclidean, city-block, Minkowski for p = 3, 4, 10, cosine, correlation, and extended Jaccard metrics. In total, 6 × 4 × 7 = 168 distance matrices were calculated. From the distances we tested the quality of the distance representations with respect to the underlying latent classes. In order to determine the quality of the distance representations we tested the homogeneity of item neighborhoods with respect to the underlying latent classes. As mentioned previously, one would expect that high quality distance representations should have more homogenous class labels for items with short distances between pairs of items, and less homogenous class labels for items with large distances between pairs of items. We test this homogeneity using three measures, the nearest neighbor, the decile lift, and the GINI coefficient. When describing these measures, it is assumed that the k nearest neighbors of a point i are the k points with the smallest values of the distance d(i,k) using the distance metric to be tested. Many data analysis methods, including many partitioning and hierarchical clustering algorithms, use the concept of nearest or furthest neighbor from either an item to be clustered or from a cluster centroid. Thus, an accurate representation of item neighborhoods is a necessary condition for the strong performance of these methods. The nearest neighbor (NN) is the simplest of the approaches described and measures the latent class homogeneity with respect to the single nearest neighbor of each point. Consider each point i in the configuration and set ai to be 1 if the nearest neighbor of i has the same latent class as i, otherwise set ai to be 0. The value of the nearest neighbor metric is given in (6). The value of the metric varies between 0 and 1, with higher values indicating strong neighborhood recovery with respect to the latent classes.
NN =
1 n ∑ ai n i =1
(6)
Both the decile lift (DL) metric and the GINI coefficient are often used as measures of classification performance, e.g., [10]. A modified version of the decile lift for calculating medium range neighborhood recovery is given in (7). The decile lift measures the increase in the proportion of the labels of the 10% nearest neighbors of an item having the same latent class as the item over the proportion of the latent class across all items (except the tested point) in the configuration. Let ai,j be 1 if the jth nearest neighbor of item i has the same latent class as point i and be 0 otherwise. The decile lift over all n points is given in (7).
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
DL =
1 n ∑ n i =1
⎡ ⎣⎢ n −1 10⎦⎥ ⎤ ⎢ ∑ ai , j ⎣⎢ n − 1 10 ⎦⎥ ⎥ ⎢⎣ j =1 ⎥⎦ n −1
∑ ai, j n − 1
285
(7)
j =1
The GINI coefficient [5] was originally developed to measure a nation’s level of income equality/inequality. Consider a graph with the cumulative share of population plotted on the horizontal axis and the cumulative share of income plotted on the vertical axis. A straight line indicates an equal distribution of income and a line forming a convex function below the straight line indicates some income inequality (assuming that people are ordered in terms of increasing wealth). The GINI coefficient for a given convex income line is the proportion of area below the linearly increasing income equality line that is above the convex income line. We use the GINI coefficient as a measure of neighborhood recovery throughout item neighborhoods of all sizes. For an item i, the other n – 1 items are ordered in decreasing order of distance from item i. Taking each item i, ci,j is the sum of items with the same class label as i from items 1 to j, with items ordered in decreasing distance from i. j
ci , j = ∑ ai
(8)
i =1
Given some homogeneity of items with respect to latent classes, it would be expected that the value of ci,j will increase slowly for the first items, which are further away from item i than for the later items, which are closer. Greater “inequality” of the cumulative distribution leads to a more strongly convex graph and thus a larger GINI coefficient. It is possible that most items with the same latent class as item i have the largest distances from the item. This would lead to a concave graph and a negative value of the GINI coefficient. The equation for the GINI coefficient, aggregated for all items, is given in (9). ⎡ ⎛ ⎛ n −1 ⎢ ⎜ ⎜ ∑ ( n − j ) ci , j 1 ⎢ 1 ⎜ j =1 G(S ) = ∑ n − 2⎜ n −1 ⎜ n i =1 ⎢ n − 1 ⎜ ci , j ⎢ ∑ ⎜ ⎜ j =1 ⎢⎣ ⎝ ⎝ n
n −1 ⎞ ⎞⎤ ⎡ ⎤ 2∑ j ⋅ ci , j ⎟ ⎟⎥ ⎥ n ⎢ 1 n j =1 ⎟ ⎟⎥ = ∑ ⎢ ⎥ − n −1 ⎟ ⎟ ⎥ n i =1 ⎢ n − 1⎥ n 1 j c − ⋅ ) ∑ i, j ⎟ ⎟⎥ ⎢( ⎥ j =1 ⎠ ⎠ ⎥⎦ ⎣ ⎦
(9)
In order to test Q1 and Q2, the values of the nearest neighborhood, decile lift, and GINI coefficients were calculated for each combination of data set and distance metric. Based upon the previously described theoretical work of [1] and the experimental work of [12], one would expect that distances calculated using the cosine distance, correlation distance, or extended Jaccard distance would be the better at recovering neighborhood structure than those calculated with the Euclidean metric or Minkowski-p metrics with higher values of p. One would expect distances calculated with the city-block metric to be intermediate to the results of the other distance metrics.
286
S. France and D. Carroll
2.1 Results
The results for the 6 files, 4 item sets per file, and 7 distance metrics per item set are summarized in Tables 2-4. Table 2. Summary of results for nearest neighbor
Avg of NN Metric CB Corr Cos Euclid Mi10 Mi3 Mi4 ExJac
File fbis 0.4715 0.709 0.7095 0.6145 0.5825 0.6185 0.6005 0.696
isolet 0.7245 0.7625 0.754 0.7585 0.646 0.779 0.767 0.7545
k1 0.191 0.439 0.4395 0.251 0.372 0.334 0.3555 0.4465
la12 0.441 0.4765 0.4735 0.3935 0.238 0.376 0.34 0.7635
OH 0.176 0.4815 0.481 0.301 0.331 0.3485 0.3405 0.4805
segmt 0.945 0.9245 0.931 0.925 0.9 0.914 0.9105 0.9245
Table 3. Summary of results for decile lift
Avg of DL Metric CB Corr Cos Euclid Mi10 Mi3 Mi4 ExJac
File fbis 2.2883 4.7988 4.7623 2.8080 2.9050 2.9458 2.9488 4.4288
isolet 6.8522 7.1040 7.0248 7.0870 5.9859 7.2547 7.2652 7.0240
k1 1.1573 2.1508 2.1611 1.1873 1.4411 1.2628 1.3139 2.4117
la12 1.9740 1.9641 1.9633 1.6225 1.0461 1.2474 1.1683 2.4538
OH 1.2121 2.7693 2.7669 1.3514 1.6257 1.4811 1.5324 2.7628
segmt 4.7414 4.4367 4.5034 4.5801 4.1044 4.4317 4.3405 4.5785
Table 4. Summary of results for GINI coefficient
Avg of GINI Metric CB Corr Cos Euclid Mi10 Mi3 Mi4 ExJac
File fbis 0.3700 0.6910 0.6851 0.3646 0.3618 0.3621 0.3619 0.6025
isolet 0.7949 0.7975 0.7938 0.8005 0.7413 0.8027 0.8026 0.7937
k1 0.3688 0.4799 0.4800 0.3764 0.3614 0.3720 0.3686 0.4786
la12 0.4189 0.4260 0.4254 0.3637 0.3376 0.3397 0.3388 0.4107
OH 0.3853 0.5106 0.5085 0.3766 0.3596 0.3688 0.3660 0.5078
segmt 0.7004 0.7032 0.7128 0.6370 0.5058 0.5799 0.5570 0.7137
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
287
The data were analyzed using the MANOVA technique. Both the distance metric and the file used were considered to be factors, with each of the four random data sets considered as a single response for a combination of metric and file. The values of the nearest neighbor, the decile lift, and the GINI coefficient were taken as dependent variables. Using Wilks’s Lamba, both factors and the interaction between factors were significant, with p = 0.000 to 3.s.f for all values. Tests of between subject effects, were significant for all combinations of dependent variables and factors, again with p = 0.000 for all values. Post hoc tests were carried out on the marginal means of the different distance metrics for all three dependent variables. The Scheffé test [11] was used, as it is a fairly conservative and robust post-hoc test. These post-hoc tests show strong performance of the cosine and correlation metrics relative to the Minkowski-p distances, but we note that as the data sets are biased towards high dimensional data sets. We aim to gain more insight into the performance of the metrics by plotting, for each dependent variable, the marginal means of each file and metric. The metrics are given on the horizontal axis, the marginal means are given on the vertical axis, and separate series are plotted for each file. The plots for nearest neighbor, decile lift, GINI coefficient are given in Fig.’s 1, 2, and 3 respectively. We wish only to compare plots for individual files. To compare performance between series would not be appropriate as the different data sets have different numbers and distributions of latent classes. Multiple series are only displayed on a single graph in order to aid parsimony. The cosine, correlation, and extended Jaccard metrics outperform the Minkowski metrics. All three graphs show several clear trends. For three of the four high dimensional document data sets, the cosine and correlation metrics have superior performance to the Minkowski metrics. The one data set that does not conform to this trend is the la data set, for which case the city-block metric produces virtually identical performance to the correlation and cosine metrics. In fact, the results are not as expected, as the outperformance of the cosine and correlation metrics over the cityblock and Euclidean metrics on the high-dimensional document data sets is not related to the dimensionality of the data. For example, the fbis data set with a dimensionality of 2463, which has the lowest dimensionality of all of the high dimensional document data sets, has the greatest performance increase for the correlation and cosine metrics over the Euclidean metric. There is also little difference between the performances of the Minkowski-p metrics for p ≥ 2. Particularly for the high-dimensional document data sets, there is little difference in performance between p = 2, 3, and 4. The exceptions to this are the very high dimensional la and k1a data sets, where this effect appears to be present. There is a significant decrease in performance for some of the data sets with the p = 10 metric, particularly using the larger neighborhood measures, the decile lift and the GINI coefficient. It would be expected, given past experimental work and equation (2), that there would be a decreasing level of performance as p increases from 2 to 4, but that is not the case for all data sets. We explore possible reasons for these results in the theoretical discussion in the next section.
288
S. France and D. Carroll
Estimated Marginal Means of Nearest Neighbor 1
0.9
0.8
0.7
fbis isolet
Means
0.6
k1 la12 OH segmt
0.5
0.4
0.3
0.2
0.1
0 ExJac
Corr
Cos
CB
Euclid
Mi3
Mi4
Mi10
Metric
Fig. 1. Estimated marginal means of nearest neighborhood for high-dimensional distances
Estimated Marginal Means of Decile Lift 8
7
6
Means
5
fbis isolet k1 la12 OH segmt
4
3
2
1
0 ExJac
Corr
Cos
CB
Euclid
Mi3
Mi4
Mi10
Metric
Fig. 2. Estimated marginal means of decile lift for high-dimensional distances
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
289
Estimated Marginal Means of GINI Coefficient 0.9
0.8
0.7
Means
0.6 fbis isolet k1 la12 OH segmt
0.5
0.4
0.3
0.2
0.1
0 ExJac
Corr
Cos
CB
Euclid
Mi3
Mi4
Mi10
Metric
Fig. 3. Estimated marginal means of GINI coefficient for high-dimensional distances
3 Theoretical Discussion Past theoretical work has shown that for the family of Minkowski-p metrics, distances between points become increasingly compressed as the dimensionality of data increases, which leads to an inability to distinguish between neighborhoods of points. Past empirical work has tested the performance of both Minkowski-p and nonMinkowski-p distance metrics. In these experiments the homogeneity of clustering solutions with respect to latent classes is tested. The results from these experiments show that the cosine, correlation, and extended Jaccard metrics outperform the Minkowski-p distance metrics on these data. It was concluded from these experiments that the cause of this performance disparity is due to the previously described phenomenon of distance compression on high dimensional data. Here the homogeneity of the clustering solutions relative to a set of latent classes is used as a proxy for the neighborhood recovery of points relative to the latent classes. There are several aspects of our results that do not mesh with the previously described work. These are listed below: The difference in performance between the cosine and correlation metrics and the Minkowski-p metrics is highest for the data set with the lowest dimensionality (fbis) and lowest for the data set with the highest dimensionality (la). Given the inequality (2), one would expect the opposite. Except for the very high dimensional data set, there is little difference in performance of the Minkowski-p metrics for p = 2, 3, and 4. As there is a large difference in performance between the Minkowski-p metrics and the non-Minkowski metrics ascribed to (2), one would expect a strong difference in performance between
290
S. France and D. Carroll
the Minkowski metrics. With increasing dimensionality and given (2), there is a constant bound on the differences between distances for p = 2 and a converging bound for p > 2, so one would expect differences in performance that in the experimental results do not occur or only occur slightly. We account for these inconsistencies as follows. We believe that there are two effects at work. These effects are a normalization effect and a dimensionality effect. The normalization effect is due to the nature of the document data tested and was misinterpreted as a dimensionality effect in previous studies due to the document data being the only type of high dimensional data tested in these studies. The normalization effect is strongest in the document data sets with lower dimensionality and less strong in document data sets with very high dimensionality. We found evidence of the dimensionality effect in the high dimensional la data set, but believe that the dimensionality effect is overstated for data sets with dimensionality between 2,000 and 20,000 and that most of the performance difference between the cosine and correlation metrics and the Minkowski-p metrics is due to the normalization effect. So what is meant by normalization and how does this affect the Minkowski-p distance metrics relative to the cosine, correlation, and extended Jaccard metrics? Consider a document data set consisting of newpaper stories, with latent classes corresponding to the story categories (e.g., news, finance, sport, house & home, business, etc). Consider two stories, one a 2000 word house & home feature detailing the revitalization of a city neighborhood and the other a 200 word sports bulletin. Both stories may mention the word tennis ‘two times’, the first in the context of the local sports facilities and the second in the context of the sports results. However, this similarity is misleading as only a very small proportion of the first article is concerned with tennis and the article belongs to a different latent class than does the sports bulletin. If the data were proportioned or normalized by dividing each word count by either the total number of words in the document or by the Euclidean norm for the story vector, then the similarities/distances between the story vectors may provide stronger predictors of the story latent class. The cosine and correlation metrics are strongly related to the Euclidean metric. In fact one can describe these metrics as special cases of the squared Euclidean metric on appropriately normalized data. The squared Euclidean metric is monotone increasing with respect to the Euclidean metric, so it can be shown that the Euclidean metric with appropriately normalized data produces the same item neighborhoods as the cosine metric. Theorem 1. Let X be an m × n data matrix with m items and n dimensions. Let X be normalized to Xnorm so that each row xi is normalized to having xi = 1 for all i =
1⋅⋅⋅m. Let D1 be the m × m distance matrix where D1ij is the cosine distance between row vectors xi and xj of X. Let D2 be the m × m distance matrix where D2ij is the Euclidean distance between row vectors of Xnorm, so that x*i = xi x and x*j = x j x . Then D1 and D2 share the same neighborhoods, i.e., for any point i and some permutation P of the remaining n – 1 points ordered so that D1iP1 ≤ D1iP2 ≤ ⋅⋅⋅ D1iP(n – 2) ≤ D1iP(n – 1) and some permutation Q of the remaining n – 1 points ordered so that D2iQ1 ≤ D2iQ2 ≤ ⋅⋅⋅ D2iQ(n – 2) ≤ D2iQ(n – 1) then P = Q.
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
291
Proof. Take any arbitrary point i in the configuration Xnorm. Arrange the remaining n - 1 points in ascending order of Euclidean distances from point i. The permutation of points in order of distances is given as D2iQ1 ≤ D2iQ2 ≤ ⋅⋅⋅ D2iQ(n – 2) ≤ D2iQ(n – 1). By the distance axioms, all distances are nonnegative, so squared Euclidean distances are monotone increasing with Euclidean distances, and (10) holds. D2iQ1 ≤ D2iQ 2 ≤ ⋅⋅⋅ ≤ D2iQ1( n − 2) ≤ D2iQ ( n −1) ⇔ D22iQ1 ≤ D22iQ 2 ≤ ⋅⋅⋅ ≤ D22iQ1( n − 2) ≤ D22iQ ( n −1)
(10)
The squared Euclidean distances for Xnorm can be expressed as Dij2 = ( x*i − x*j )( x*i − x*j )′ = x*i
2
+ x*j
2
− 2 x*i , x*j ,
(11)
where x*i and x*j are row vectors of Xnorm. From the definition of x*i and x*j , x*i = x*j = 1 , giving
(
Dij2 = 2 − 2 x*i , x*j = 2 1 − x*i , x*j
).
(12)
.
(13)
Again, from the definition x*i and x*j , x*i , x*j =
x i xi , xi x j
=
xi , x j xi x j
Putting (12) into (13) and applying the cosine rule to the angle between the two vectors xi and xj, gives (14), which is the definition of the cosine distance.
Dij2 2
= 1−
xi , x j xi x j
= 1 − cos θ .
(14)
By (14) half of the squared Euclidean distances on the Xnorm are equal to the cosine distances on X, so have the same neighborhoods and as Dij2 2 is monotone with respect to Dij2 then squared Euclidean distances on the Xnorm have the same neighborhoods as the cosine distances on X. By (10) the cosine distances on X have the same neighborhoods as the Euclidean distances on Xnorm. Theorem 2. Given the mean centered matrix X• = X - 1m x , then Euclidean distances on the normed mean centered matrix X•norm produce the same neighborhoods as the correlation distances applied to X. Proof: The proof follows as for Theorem 1, except that (13) becomes, x•i , x•j =
xi − x x j − x , xi − x x j − x
=
xi − x , x j − x xi − x x j − x
= r ( xi , x j ) ,
(15)
292
S. France and D. Carroll
where r ( xi , x j ) is the correlation between xi and xj. It follows that Dij2
2
= 1 − xi• , x•j = 1 − r ( xi , x j ) .
(16)
The remainder of the proof is the same as that for Theorem 1. Thus, the Euclidean metric applied to Xnorm and X•norm , respectively produces the same item neighborhoods as the cosine and correlation metric applied to X. Given both the experimental and theoretical results, we believe that most of the difference between the performance of the Minkowski-p metrics and the correlation and cosine metrics is due to the inbuilt normalization in the latter two metrics. We cannot discount completely the possibility that the compression of high dimensional distances shown in (2) affects normalized data and non-normalized data differently, but the fact that the outperformance of the cosine and correlation metrics relative to the Minkowski-p metrics is inversely proportional to dimensionality suggests that distance compression is not a major factor in the difference in performance between the metrics. The conclusion, in previous experimental work, that the difference in performance between the Minkowski-p metrics and the cosine and correlation metrics is due to distance compression is probably due to a couple of reasons. The first is that real world data sets of very high dimensionality are rare. Most of the high dimensional data sets are document data sets, in which words are considered as dimensions. In previous research these data sets are often taken as a proxy for all high dimensional data. These data are sparse count data, which cannot be taken as being typical of all high dimensional data. Also, previous research tested both the clustering/classification algorithm and the distance metrics in a single experiment. Thus it is much harder to detect patterns in results than if neighborhood recovery is tested directly. We do believe that the compression of distances shown in (2) is a factor in the relative performance of distance metrics in high dimensional space, and we make no attempt to question the validity of the work described in [1] and [2]. We note that some differences between the performances of the Minkowski-p distance metrics become apparent in the 21,839 dimensional k1a and 31472 dimensional la data sets. We also note some degradation in the performance of the cosine and correlation distance metrics relative to the extended Jaccard metric on the l1a data set. This may be because of distance compression affecting neighborhood recovery for the Euclidean metric and thus the neighborhood recovery for the cosine and correlation metrics due to the neighborhood agreement shown in Theorems 1 and 2. We conclude that the use of Minkowski-p metrics on non-normalized document data is not appropriate for applications reliant on the preservation of neighborhood structure. It is appropriate to use the Euclidean metric on normalized data or mean centered and normalized data. The resulting neighborhood structures are identical to those produced by the cosine and correlation distances respectively. For very high dimensional data it may be more appropriate to use the extended Jaccard metric rather than the previously mentioned metrics.
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
293
4 Future Work In this paper we have proposed that the difference in performance of distance metrics for the preservation of neighborhood structure is due to two main factors, that of normalization and that of distance compression. Future work could help clarify several aspects of this performance relationship. Firstly, the only high dimensional data sets tested in either this paper or in previous high dimensional work are document data sets. The SEG and isolet data sets were not of this type but were of lower dimensionality. Both these data sets had continuous valued dimensions, with each dimension measured on a different scale. Thus, the data were standardized before applying the Euclidean metric. If high dimensional continuous data, with dimensions measured on different scales, are not available then it may be possible to artificially generate the data. Futher testing on further high dimensional document data sets would aid the discovery of the relationship between the normalization and distance compression effects. By testing the differences between the performances of the Minkowski-p metrics, a measure of the effects of distance compression on neighborhood preservation could be gained. The shape of the curve formed by plotting the value of p against the three neighborhood preservation metrics and the change in gradient of the curve could provide measures of distance compression. The differences in performance between the cosine or correlation metrics and the Euclidean metric on the original data or mean centered data respectively could provide a measure of the level of the normalization effect for data of differing dimensionalities. The degradation in performance of the cosine/correlation metrics with respect to the Extended Jaccard metric could provide a measure of the level of distance compression. There may be some asymptotic effect as the dimensionality increases towards infinity that could be investigated theoretically. The analogy of the 2000 word house & home feature and the 200 word sports bulletin gives some justification for the normalization effect. But one could make the same argument using proportions (dividing by the total number of words in the document), rather than using the Euclidean norm. It may be useful to test both proportions data and normalized data using a range of Minkowski-p metrics and compare these results with those from the cosine and correlation metrics. In this paper we explore the relationship between the Euclidean, squared Euclidean, cosine, and correlation metrics. There is little theoretical discussion on the properties of the extended Jaccard metric, either in this paper or in other papers dealing experimentally with the effects of high dimensional data. It may be useful to theoretically explore the properties of the extended Jaccard metric, both in respect to the work in [1] and [2] and to the relationships described in Theorems 1 and 2. Only Minkowski-p metrics with p ≥ 1 were considered. [7] extends [1] and explores compression in fractional (Minkowski-p with p < 1) distance metrics and conclude that fractional distance metrics have less problems with distance compression than do Minkowski-p metrics. It may be difficult to implement fractional distance metrics in practical data-analysis applications as the fractional distance metrics violate the triangle equality. The modified Minkowski distances (omitting the
294
S. France and D. Carroll
1 / p term) do not violate the triangle equality for fractional values of p, but do violate the law of segmental additivity. For a metric to satisfy segmental additivity, for any pair of points i and k, there must exist a third point, j, distinct from i and k, such that dik = dij + djk . This property is defined in [14] as a necessary property for a ”psychological” metric.
References 1. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2001) 2. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217– 235. Springer, Heidelberg (1999) 3. Boley, D., Gini, M., Goss, R., et al.: Partitioning-Based Clustering for Web Document Categorization. Decision Support Systems 27, 329–341 (1999) 4. Statlog (Image Segmentation) Data Set, http://archive.ics.uci.edu/ml/datasets/ Statlog+%28Image+Segmentation%29 5. Corrodo, G.: Measurement of Inequality and Incomes. The Economic Journal 31, 124–126 (1921) 6. Fanty, M., Cole, R.: Spoken Letter Recognition. In: Lippman, R.P., Moody, J., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 3, pp. 220–226. Morgan Kaufmann, San Mateo (1990) 7. Francois, D., Wertz, V., Verleysen, M.: The Concentration of Fractional Distances. IEEE Transactions on Knowledge and Data Engineering 19, 873–886 (2007) 8. Hersh, W., Buckley, C., Leone, T.J., Hickman, D.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: Croft, W.B., Van Rijsbergen, C.J. (eds.) Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 192–201. Springer, New York (1994) 9. CLUTO: Software for Clustering High-Dimensional DataSets, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download 10. Neslin, S.A., Gupta, S., Kamakura, W.A., Lu, J., Mason, C.H.: Defection Detection: Measuring and Understanding the Predictive Accuracy of Customer Churn Models. Journal of Marketing Research 43, 204–211 (2006) 11. Scheffé, H.: The Analysis of Variance. John Wiley & Sons, New York (1959) 12. Strehl, A., Ghosh, J., Mooney, R.: Impact of Similarity Measures on Web-Page Clustering. In: Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64. AAAI, Cambridge (2000) 13. TREC Text REtrieval Conference, http://trec.nist.gov 14. Tversky, A., Krantz, D.H.: The Dimensional Representation and the Metric Structure of Similarity Data. Journal of Mathematical Psychology 7, 572–596 (1970) 15. Verleysen, M., Francois, D., Simon, G., Wertz, V.: On the Effects of Dimensionality on Data Analysis with Neural Networks. In: Mira, J., Álvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2687, pp. 105–112. Springer, Heidelberg (2003)
Fast Local Support Vector Machines for Large Datasets Nicola Segata1 and Enrico Blanzieri2 1 2
DISI, University of Trento, Italy [email protected] DISI, University of Trento, Italy [email protected]
Abstract. Local SVM is a classification approach that combines instance-based learning and statistical machine learning. It builds an SVM on the feature space neighborhood of the query point in the training set and uses it to predict its class. There is both empirical and theoretical evidence that Local SVM can improve over SVM and kNN in terms of classification accuracy, but the computational cost of the method permits the application only on small datasets. Here we propose FastLSVM, a classifier based on Local SVM that decreases the number of SVMs that must be built in order to be suitable for large datasets. FastLSVM precomputes a set of local SVMs in the training set and assigns to each model all the points lying in the central neighborhood of the k points on which it is trained. The prediction is performed applying to the query point the model corresponding to its nearest neighbor in the training set. The empirical evaluation we provide points out that FastLSVM is a good approximation of Local SVM and its computational performances on big datasets (a large artificial problem with 100000 samples and a very large real problem with more than 500000 samples) dramatically ameliorate performances of SVM and its fast existing approximations improving also the generalization accuracies.
1
Introduction
The direct integration of k-nearest neighbors (kNN) with support vector machines (SVM) has been proposed in [1]. The algorithm, that belongs to the class of local learning algorithm [2], is called kNNSVM, and it builds a maximal margin classifier on the neighborhood of a test sample in the feature space induced by a kernel function. Theoretically, it permits better generalization power than SVM because, like all local learning algorithms, the locality parameter permits to find a lower minimum of the guaranteed risk [3,4] and since it can have, for some values of k, a lower radius/margin bound [5]. It has been successfully applied for remote sensing tasks [1] and on 13 small benchmark datasets [6], confirming the potentialities of this approach. kNNSVM can be seen as a method for integrating locality in kernel methods compatible with the traditional strategy of using local non-stationary kernel functions [7] and it is particularly indicated for P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 295–310, 2009. c Springer-Verlag Berlin Heidelberg 2009
296
N. Segata and E. Blanzieri
non high-dimensional problems, i.e. for data requiring some non linear mapping (kernel) to be successfully tackled. The main drawback of the original idea of Local SVM concerns the computational performances. The prediction phase is in fact very slow since for each query point it is necessary to train a specific SVM before performing the classification, in addition to the selection of its k-nearest neighbors on which the local SVM is trained. In [8] it has been independently proposed a similar method in which however the distance function for the kNN operations is performed in the input space and it is approximated with a “crude” distance metric in order to improve the computational performances. In this work we developed a fast local support vector machine classifier, called FastLSVM, introducing various modifications to the Local SVM approach in order to make it scalable and thus suitable for large datasets. Differently from [8] we maintain the feature space metric for the nearest neighbor operations and we do not adopt any approximation on the distance function and thus on the neighborhood selection. We aim, in fact, to be as close as possible to the original formulation of kNNSVM in order to maintain its theoretical and empirical advantages over SVM. Moreover, our intuition is that, in general, as the number of samples in the training size increases, also the positive effect of locality on classification accuracy increases. Roughly speaking, the idea is to precompute a set of local SVMs covering (with redundancy) all the training set and to apply to a query point the model to which its nearest neighbor in the training set has been assigned. The training time complexity analysis reveals that the approach is asymptotically faster than the state-of-the-art accurate SVM solvers and the training of the local models can be very easily parallelized. Notice that the issue of scalability for the local SVM approach is particularly appealing also because our intuition is that locality can play a more crucial role as the problem becomes larger and larger and the ideal decision function is complex and highly non-linear. The source code of FastLSVM is part of the Fast Local Kernel Machine Library (FaLKM-lib) [9] freely available for research and education purposes; the FastLSVM implementation we use in this work is a preliminary version of the FaLK-SVM classifier available in FaLKM-lib. In the rest of the introduction we briefly review the related work and the main topics necessary to understand the FastLSVM approach discussed in Section 2. Section 3 details the experimental evaluation we conducted before drawing some conclusions and discussing further extensions in Section 4. 1.1
Related Work
An attempt to computationally unburden the Local SVM approach of [8] has been proposed in [10] where the idea is to train multiple SVMs on clusters retrieved with a k-means based algorithm; however, differently from this work the method does not follow directly the idea of kNNSVM, it can build only local linear models, the clustering method considers together training and testing sets, the neighborhood is retrieved only in input space and the testing point can lie in
Fast Local Support Vector Machines for Large Datasets
297
very peripheral regions of the local models. Moreover the clusters have problems of class balancing and their dimensions cannot be controlled thus not assuring the SVM optimization to be small enough. The computational performances (only empirically tested on a small dataset) are in fact much worse than SVM (although better than their local approach) and seems to decrease asymptotically much faster than SVM. Multiple approaches have been proposed in order to overcome SVM computational limitation for large datasets approximating the traditional approach. Two of the most popular and effective techniques are Core Vector Machines [11] (CVM) based on minimum enclosing ball algorithms and LaSVM [12] which introduces an online support vector removal step in the optimization. Other proposed approaches were based on parallel mixture of SVMs trained on subsets of the training set [13,14], on using editing or clustering techniques to select the more informative samples [15], on training SVM between clusters of different class nearest to the query point [16] and on parallel algorithms for training phase [17,18]. Recently very fast algorithms have been proposed for linear SVM like SVMPerf [19] and LibLinear [20] . However, we are focusing here on large datasets with non high-dimensionality and thus the use of a non-linear kernel is crucial. It is important to underline, however, that what we are proposing here is not a method to approximate SVM in order to enhance performances. Our main purpose is to make kNNSVM, which has been shown to be more accurate of SVM for small datasets, suitable for large scale problems. Indirectly, since the method is asymptotically faster than SVM, it can be seen as an alternative to SVM for large datasets on which traditional SVM algorithms cannot be directly applied. 1.2
The K-Nearest Neighbors Classifier
Let assume to have a classification problem with samples (xi , yi ) with i = 1, . . . , n, xi ∈ Êp and yi ∈ {+1, −1}. Given a point x , it is possible to order the entire set of training samples X with respect to x . This corresponds to define a function rx : {1, . . . , n} → {1, . . . , n} that reorders the indexes of the n training points as follows: ⎧ ⎪ r (1) = argmin xi − x ⎪ ⎨ x i=1,...,n
⎪ ⎪ ⎩rx (j) = argmin xi − x i=1,...,n
i = rx (1), . . . , rx (j − 1) for j = 2, . . . , n
In this way, xrx (j) is the point of the set X in the j-th position in terms of distance from x , namely the j-th nearest neighbor, xrx (j) − x is its distance from x and yrx (j) is its class with yrx (j) ∈ {−1, 1}. In other terms: j < k ⇒ xrx (j) − x ≤ xrx (k) − x . With this definition, the majority decision rule of kNN for binary classification is defined by kN N (x) = sign( ki=1 yrx (i) ).
298
1.3
N. Segata and E. Blanzieri
Support Vector Machines
SVMs [21] are classifiers with sound foundations in statistical learning theory [4]. The decision rule is SV M (x) = sign(w, Φ(x) H + b) where Φ(x) : Êp → H is a mapping in a transformed Hilbert feature space H with inner product ·, · H . The parameters w ∈ H and b ∈ Ê are such that they minimize an upper bound on the expected risk while minimizing the empirical risk. The empirical risk is controlled through the set of constraints yi (w, Φ(xi ) H + b ≥ 1 − ξi with ξi ≥ 0, i = 1, . . . , n, where yi ∈ {−1, +1} is the class label of the i-th nearest training sample. The presence of the slack variables ξi ’s allows some misclassification on the training set. Reformulating such an optimization problem with Lagrange multipliers αi (i = 1, . . . , n), and introducing a positive definite kernel (PD) function1 K(·, ·) that substitutes the scalar product in the feature space Φ(xi ), Φ(x) H , the decision rule can be expressed as
SV M (x) = sign
n
αi yi K(xi , x) + b .
i=1
PD kernels avoids the explicit definition of H and Φ [22]; the most popular are the linear (LIN) kernel k lin (x, x ) = x, x , the radial basis function (RBF) kernel k rbf (x, x ) = exp x − x 2 /σ where σ is a positive constant, and the inhomogeneous polynomial (IPOL) kernel k ipol (x, x ) = (x, x + 1)d where d is the degree of kernel. SVM has been shown to have important generalization properties and nice bounds on the VC dimension [4]. Computationally, an accurate solver for SVM takes O(n2 ) time for computing the kernel values, O(n3 ) time for solving the problem and O(n2 ) space for storing the kernel values as discussed in [11,23]; empirical evidence highlights that modern accurate SVM solvers like LibSVM [24] scale effectively between n2 and n3 depending mainly on C (the higher the value of C the closer the scaling to n3 ). Approximate solutions (see Section 1.1) can of course lower the computational complexity. 1.4
The kNNSVM Classifier
The method [1] combines locality and searches for a large margin separating surface by partitioning the entire Hilbert feature space through a set of local maximal margin hyperplanes. In order to classify a given point x , we need first to find its k nearest neighbors in the feature space H and, then, to search for an optimal separating hyperplane only over these k neighbors. In practice, this means that an SVM is built over the neighborhood of each test point x . Accordingly, the constraints become: yrx (i) w, Φ(xrx (i) ) + b ≥ 1 − ξrx (i) , with i = 1, . . . , k, 1
We refer to kernel functions with K and to the number of nearest neighbors with k.
Fast Local Support Vector Machines for Large Datasets
299
where rx : {1, . . . , n} → {1, . . . , n} is a function that reorders the indexes of the training points as follows: ⎧ ⎪ rx (1) = argmin Φ(xi ) − Φ(x )2 ⎪ ⎪ i=1,...,n ⎨ rx (j) = argmin Φ(xi ) − Φ(x )2 ⎪ i=1,...,n ⎪ ⎪ ⎩ i = rx (1), . . . , rx (j − 1) for j = 2, . . . , n In this way, xrx (j) is the point of the set X in the j-th position in terms of distance from x and the thus j < k ⇒ Φ(xrx (j) ) − Φ(x ) ≤ Φ(xrx (k) ) − Φ(x ). The computation is expressed as ||Φ(x) − Φ(x )||2 = Φ(x), Φ(x) H + Φ(x ), Φ(x ) H −2·Φ(x), Φ(x ) H = K(x, x)+K(x , x )−2·K(x, x ). If the kernel is the RBF kernel or any polynomial kernels with degree 1, the ordering function can be built using the Euclidean metric. For non-linear kernels (other than the RBF kernel) the ordering function can be quite different to that produced using the Euclidean metric. The decision rule of this method is: k kNNSVM(x) = sign αrx (i) yrx (i) K(xrx (i) , x) + b (1) i=1
For k = n, kNNSVM becomes the usual SVM whereas, for k = 2 with LIN or RBF kernels, corresponds to the NN classifier. The method is computationally expensive because, for each test point, it computes the kNN in H, train an SVM and finally perform SVM prediction. Implementing kNN simply sorting the distances, kNNSVM takes O(n log n · k 3 · m) time for m testing samples. Like all the class of local learning algorithms, kNNSVM states the learning problem in a different setting as detailed in [3]. Basically, instead of estimating a global decision function with the aim of minimizing the probability of errors of all possible unseen samples, kNNSVM tries to estimate a decision function that maximize the probability of correctly label a given test point. Notice that for kNN (the simplest local learning algorithm) this learning statement is crucial because the majority rule is effective only locally (globally it reduces to the class with the highest cardinality). With respect to global SVM, the possibility of estimating a different maximal margin hyperplane for each test point can thus achieve a lower probability of misclassification on the whole test set. These considerations are formalized in the theory of local structural risk minimization for local learning algorithms [3] which is a generalization of the structural risk minimization [4]. The main idea is that, in addition to the complexity of the class of possible functions and of the function itself, the choice of the locality parameter (k for kNNSVM) can help to lower the guaranteed risk. An implementation of kNNSVM, called FkNNSVM, is available in the freely available Fast Local Kernel Machine Library (FaLKM-lib) [9].
2
FastLSVM: A Local SVM Approach for Large Datasets
In this section we present FastLSVM, a modified version of the kNNSVM classifier that allows for the use on large datasets. As a first step, we can generalize
300
N. Segata and E. Blanzieri
the decision rule of kNNSVM considering the case in which the local model is trained on a set of points that are the k-nearest neighbors of a point that, in general, is different from the query point. A modified decision function for a query point x and another (possibly different) point t is: kNNSVMt (x) = sign
k
αrt (i) yrt (i) K(xrt (i) , x) + b
i=1
where rt (i) is the kNNSVM ordering function (see above) and αrt (i) and b come from the training of an SVM on the k-nearest neighbors of t in the feature space. In the following we will refer to kNNSVMt (x) as being centered in t and to t as the center of the model. The original decision function of kNNSVM corresponds to the case in which t = x, and thus kNNSVMx (x) = kNNSVM(x). 2.1
A First Approximation of Local SVM
In the original formulation of kNNSVM, the training of an SVM on the k-nearest neighbors of the query point must be performed in the prediction step. Although this approach is convenient when we have a rather large training set and very few points to classify, it introduces a considerable overhead in the prediction step which is not acceptable in the great majority of classification problems. As a first approximation of kNNSVM, we propose to compute and maintain in memory a set of local SVMs centered on each point of the training set. This unburdens the prediction step in which it is sufficient to select a model for the query point and use it to perform the classification. In particular, we chose to select the precomputed model to classify a point x with the model centered on its nearest point in the training set. Formally the classification of a point x with this method is kNNSVMt (x) with t = xrx (1) . The set of precomputed local SVMs in the training set with corresponding central points is S = {(t, kNNSVMt ) t ∈ X}. Notice that in situations where the neighbourhood contains only one class the local model does not find any separation and so considers all the neighbourhood to belong to the predominant class thus simulating the behaviour of the majority rule. This approximation slightly modifies the approach of kNNSVM and of local learning algorithm. This because, instead of estimating the decision function for a given test point, we are locally approximating the decision function for a number of subregions of the training set space. The test point is then evaluated using the model built for the subregion on which it lies. 2.2
Introducing the Assignment Neighborhood
With the previous modification of kNNSVM we made the prediction step much more computationally efficient, but a considerable overhead is added to the training phase. In fact, the training of an SVM for every point of the training set can be slower than the training of a unique global SVM (especially for non small k values), so we introduce another modification of the method which aims to
Fast Local Support Vector Machines for Large Datasets
301
drastically reduce the number of SVMs that need to be precomputed. Theoretically, this can cause a loss in classification accuracy, so we must take care of not reducing too much the number of SVMs and to maintain the more representative ones. The modification is based on assigning to the local model centered in a point c not only c itself but also the first k (with k < k) nearest neighbors of c. In this way we aim to make a compromise (controlled by k ) between the kNNSVM approach, in which the test point is surrounded by the samples used to build the model, and the need of decreasing the total number of SVM trained. The set of points used to select the k-nearest neighbors for the models is defined as follows. Definition 1. Given k ∈ N, a k -neighborhood covering set of centers Ck ⊆ X is a subset of the training set such that the following holds:
{xrc (i) | i = 1, . . . , k } = X.
c∈Ck
Definition 1 means that the union of the sets of the k -nearest neighbors of Ck corresponds to the whole training set. Theoretically, for a fixed k , the minimization of the number of local SVMs that we need to train can be obtained computing the SVMs centered on the points contained in the minimal k -neighborhood covering set of centers2 C. However, since the computing of the minimal C is not a simple and computationally easy task, we choose to select each ci ∈ C as follows: ci = xj ∈ X
with j = min z ∈ {1, . . . , n} xz ∈ X \ Xci (2)
where Xci = xrcl (h) h = 1, . . . , k . l
The idea of this definition is to recursively take as centers those points which are not k -neighbors of any point that has already been taken as center. So c1 = x1 corresponds to the first point of X since, being c1 the first center, the union of the neighbors of the other centers is empty; c2 , instead, is the point with the minimum index taken from the set obtained eliminating from X all the k -neighbors of c1 . The procedure is repeated until all the training points are removed from X. X must be thought here as a random reordering of the training set. This is done in order to avoid the possibility that a training set in which the points are inserted with a particular spatial strategy affects the spatial distribution of the k -neighborhood covering centers. The reason why we adopt this non standard clustering method is twofold: from one side we want each cluster to contain exactly k samples in order to be able to derive rigourous complexity bounds, from the other side in this way we are able to select a variable number of samples that are in the central region 2
From now on we simply denote Ck with C because we do not discuss here particular values for k .
302
N. Segata and E. Blanzieri
(at least form a neighborhood viewpoint) of each cluster. Moreover the proposed clustering strategy follows quite naturally from kNNSVM approach. Differently from the first approximation in which a local SVM is trained for each training sample, in this case we need to train only |C| SVMs centered on each c ∈ C obtaining the following models: kNNSVMc (x),
∀c ∈ C.
Now we have to link the points of the training set with the precomputed SVM models. This is necessary because a point can lie in the k neighborhood of more than one center. In particular we want to consider the assignments of each training point to a unique model such that it is in the k neighborhood of the center on which the model is built. Formally this is done with the function cnt(t) : X → C that assigns each point in the training set to a center: cnt(xi ) = xj ∈ C
with j = min z ∈ {1, . . . , n} xz ∈ C and xi ∈ Xxz where Xxz = xrxz (h) h = 1, . . . , k .
(3)
With the cnt function, each training point is assigned to the first center whose k -nearest neighbors set includes the training point itself. The order of the ci points derives from the randomization of X used for defining C. In this way each training point is univocally assigned to a center and so the decision function of this approximation of Local SVM, called FastLSVM, is simply: FastLSVM(x) = kNNSVMc (x) with c = cnt(xrx (1) ) (4) Algorithm 1 presents the pseudo-code of FastLSVM implementing the formal definition of Equation 2 for selecting the centers and Equation 3 to assign each training point to a unique corresponding center and thus to the SVM model trained on the center neighborhood. Algorithm 2 illustrates the prediction step of FastLSVM following Equation 4. Although not deeply discussed and empirically evaluated here, notice that it is not required that all local models share the same hyperparameters. In fact, it is possible to set different parameters for different local models, being able of better capturing local properties of the data. This can be done with local model selection, e.g. performing cross validation (CV) on the local models or estimating the parameters using local data statistics (as proposed in [11] for RBF kernel, based on distance distribution). In particular setting the σ parameter of RBF kernel locally, leads to a very similar goal of traditional RBF-SVM with variable width that demonstrated good potentialities for classification as shown for example in [25]. 2.3
Complexity Bounds
Hypothesizing the worst scaling behaviour for the training of each local SVM model of k 3 , FastLSVM requires O(|C| · n log n + |C| · k 3 ) for training, thus overcoming SVM performance (see Section 1.3). Notice that for k = k = n we have
Fast Local Support Vector Machines for Large Datasets
303
Algorithm 1. FastLSVM TRAIN (training set x[], training size n, neighborhood size k, assignment neighborhood size k’ ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
models[] ⇐ null //the set of models modelP trs[] ⇐ null //the set pointers to the models c ⇐ 0 //the counter for the centers of the models indexes[] ⇐ {1, . . . , n} //the indexes for centers selection Randomize indexes //randomize the indexes for i ⇐ 1 to n do index ⇐ indexes[i] //get the i-th index if modelP trs[index] = null then //if the point has not been assigned to a model. . . localP oints[] ⇐ get ordered kNN of x[i] models[c] ⇐ SVMtrain on localP oints[] modelP trs[index] ⇐ models[c] for j = 1 to k do //assign the model to the k’
Algorithm 2. FastLSVM PREDICT (training set x[], points-to-model pointers modelPtrs, Local SVM models models, query point q ) 1: 2: 3:
Set p = get NN of q in x Set nnIndex = get index of p return label = SVMpredict q with modelP trs[nnIndex]
a global SVM computable, as expected, in O(n log n + n3) = O(n3 ) since |C| = 1. kNNSVM testing is instead slightly slower than SVM: O(n·k·m) against O(n·m). Although not considered in the implemented version, FastLSVM can take great advantages from data-structures supporting nearest neighbors searches [26]. For example, using the recently developed cover tree data-structure [27] allowing kNN searches in k log(n) with n log n construction time, FastLSVM can further decrease its training computational complexity to O(n log n+|C|·log n·k+|C|·k 3 ) which is much lower than SVM complexity for fixed and non-high values of k. Similarly, for testing, the required time becomes O(log n · k · m). Another not implemented modification able to reduce computational complexity consists in avoiding the training of local SVMs with samples of one class only. Moreover, FastLSVM can be very easily parallelized differently from SVM for which parallelization, although possible [17,18], is a rather critical aspect; for FastLSVM is sufficient that, every time the points for a model are retrieved, the training of the local SVM is performed on a different processor. In this way the
304
N. Segata and E. Blanzieri
time complexity of FastLSVM can be further lowered to O(|C| · n log n + |C| · k 3 /nprocs ). It can be argued that some modern accurate SVM solvers, mainly based on decomposition strategies, can scale better than O(n3 ) approaching O(n2 ) for favourable C values. However, asymptomatically, FastLSVM is faster than every SVM solver taking more than |C| · n log n for training (n log n using cover trees) which include also some approximated solvers. Another advantage of FastLSVM over SVM is the space complexity. Since FastLSVM performs SVM training on small subregions (assuming a reasonable low k), there are no problems of fitting the kernel matrix into main memory. The overall required space is, in fact, O(n + k 2 ), i.e. linear in n, that is much lower than SVM space complexity of O(n2 ) which forces, for large datasets, the discarding of some kernel values thus increasing SVM time complexity due to the need of recomputing them.
3
Empirical Evaluation
In this work we used LibSVM (version 2.85) [24] for SVM enabling shrinking and caching, and our implementation of FastLSVM and kNNSVM that use LibSVM for training and prediction of the local SVMs and a simple brute-force impleTM mentation of kNN3 . The experiments are carried out on an AMD Athlon 64 X2 Dual Core Processor 5000+, 2600MHz, with 3.56Gb of RAM. 3.1
kNNSVM - FastLSVM Comparison
In order to understand if FastLSVM is a good approximation of kNNSVM, we compared the two methods on the 13 small datasets of [6] using the SVM results as references. We present the 10-fold cross validation (CV) accuracies obtained with the three methods using the LIN, RBF, HPOL and IPOL kernels. The model selection is performed internally to each fold minimizing the empirical risk with 10-fold CV choosing C ∈ {1, 5, 10, 25, 50, 75, 100, 150, 300, 500}, the σ parameter of the RBF kernel among {2−10 , 2−9 , . . . , 29 , 210 } and the degree of the polynomial kernels is bounded to 5. The dimension of the neighborhood for the kNNSVM classifier, i.e. k, is chosen among the first 5 odd natural numbers followed by the ones obtained with a base-2 exponential increment from 9 and the cardinality of the training set, namely in {1, 3, 5, 7, 9, 11, 15, 23, 39, 71, 135, 263, 519, |training set|}. The k parameter of FastLSVM is fixed to 1/4 · k. To assess the statistical significance of the differences between the 10-fold CV of kNNSVM and FastLSVM with respect to SVM we use the two-tailed paired t-test (α = 0.05) on the two sets of fold accuracies. The results are reported in Table 1. We can notice that the generalization accuracies are generally a little worse for FastLSVM than kNNSVM, but the overall advantage over SVM is maintained. In fact FastLSVM demonstrates 12 3
Faster implementations of FastLSVM and kNNSVM, called FaLK-SVM and FkNNSVM, are freely available in FaLKM-lib [9].
Fast Local Support Vector Machines for Large Datasets
305
Table 1. Evaluation of FastLSVM generalization power with respect to kNNSVM. The Table reports the 10-fold CV accuracies of SVM, kNNSVM and FastLSVM with LIN, RBF, HPOL and IPOL kernels on 13 small datasets. The statistical significative differences of kNNSVM and FastLSVM with respect to SVM (two-tailed paired t-test with α = 0.05) are in bold. LIN kernel dataset
iris wine leukemia liver svmguide2 vehicle vowel breast fourclass glass heart ionosphere sonar
RBF kernel
HPOL kernel
IPOL kernel
kNN- FastSVM SVM LSVM
kNN- FastSVM SVM LSVM
kNN- FastSVM SVM LSVM
SVM
kNN- FastSVM LSVM
0.97 0.97 0.95 0.68 0.82 0.80 0.84 0.97 0.77 0.62 0.83 0.87 0.78
0.95 0.99 0.71 0.72 0.84 0.85 0.99 0.97 1.00 0.69 0.83 0.94 0.89
0.97 0.97 0.95 0.71 0.82 0.84 0.98 0.97 0.81 0.72 0.82 0.89 0.88
0.97 0.97 0.95 0.70 0.83 0.85 0.99 0.97 1.00 0.70 0.82 0.91 0.88
0.97 0.99 0.93 0.73 0.86 0.85 1.00 0.96 1.00 0.71 0.82 0.93 0.89
0.96 0.98 0.93 0.74 0.86 0.86 1.00 0.97 1.00 0.69 0.82 0.93 0.88
0.95 0.97 0.93 0.73 0.84 0.85 0.99 0.96 1.00 0.69 0.83 0.88 0.85
0.96 0.99 0.93 0.73 0.84 0.84 1.00 0.97 1.00 0.67 0.82 0.93 0.90
0.95 0.99 0.88 0.72 0.84 0.85 1.00 0.97 1.00 0.70 0.81 0.94 0.89
0.96 0.99 0.93 0.74 0.84 0.86 1.00 0.96 1.00 0.72 0.82 0.93 0.89
0.96 0.98 0.93 0.74 0.82 0.86 1.00 0.96 1.00 0.72 0.82 0.91 0.88
0.97 0.97 0.93 0.72 0.85 0.86 1.00 0.96 1.00 0.70 0.82 0.92 0.88
cases in which the classification accuracies are significantly different (according to the t-test) to the SVM ones, and all 12 cases are in favour of FastLSVM without cases in which it is significantly worse than SVM. In total, there are 7 cases in which the significant improvements of kNNSVM over SVM are not maintained by the FastLSVM algorithm; this can be due to the choice of k , in fact, a lower value of k guaranties much lower differences between FastLSVM and kNNSVM. However, since our final objective is the application of the approach to large and very large problems on which it is reasonable to hypothesize that locality can assume an even more important role, and since FastLSVM is still better than SVM, the empirical comparison between FastLSVM and kNNSVM on small datasets let us to conclude that FastLSVM is a good approximation of kNNSVM. 3.2
The 2SPIRAL Dataset
The 2SPIRAL artificial dataset is a recurrent artificial benchmark problem in machine learning and large margin classifier (see for example [28,29]). Here the two classes of the problem are defined as follows: x(1) (τ ) = c · τ d · sin(τ ) d = 2.5, τ ∈ [0, 10π] x(2) (τ ) = c · τ d · cos(τ ) using c = 1/500 for the first class (yi = +1) and c = −1/500 for the second class (yi = −1). The points are sampled with intervals of π/5000 on the τ parameter obtaining 50000 points for each class. A Gaussian noise with zero mean and variance proportional to the distance between the point and the nearest internal twist is added on both dimensions. With this procedure we generated to different datasets of 100000 points each for training and testing.
306
N. Segata and E. Blanzieri
Table 2. Percentage accuracy and computational (in seconds) results for SVM and FastLSVM on the 2SPIRAL dataset. The parameters reported are the one permitting the lowest empirical risk, found with 5-fold CV. Method RBF-SVM LIN-FastLSVM RBF-FastLSVM IPOL-FastLSVM
k
k’ -
250 1000 500
62 250 125
C
σ
valid. acc.
test. acc.
26
2−10
81.30
-
88.37 88.48 88.32
10
2 28 210
2−5 -
# of SVMs
training time (s)
testing time (s)
81.39
1
6185
392
88.46 88.43 88.41
3202 853 1657
222 165 240
484 492 3415
We compare FastLSVM and SVM using LIN, RBF and IPOL (with degree 2) kernels. Since LIN and IPOL kernels can only build linear and quadratic decision functions in the input space, they cannot give satisfactory results for global SVM and thus we do not loose generality in presenting SVM results with the RBF kernel only. For model selection we adopt grid search with 5-fold CV. For both methods, C and σ of RBF kernel are chosen in {2−10 , 2−9 , . . . , 29 , 210 }. It is possible that values higher than 210 for C and lower than 2−10 for σ could give higher validation accuracy results, but the computational overhead of SVM becomes too high to be suitable in practice (e.g. RBF-SVM with C = 211 , σ = 2−11 requires more than 24 hours). For FastLSVM we fix k = k/4 (intuitively a good compromise between accuracy and performance), while k is chosen among {0.25%, 0.5%, 1%, 2%, 4%, 8%, 16%, 32%} of training set size. Table 2 shows the results obtained for SVM and FastLSVM. The results highlights that RBF-FastLSVM improves over RBF-SVM in test accuracy of 8.65%, LIN-FastLSVM of 8.69% and IPOL-FastLSVM of 8.63%. The improvements on classification accuracies are accomplished by a dramatic increase of computational performances for training phase: while the time needed to compute the global SVM on the training set is more than 100 minutes (6185 seconds), the training of FastLSVM requires no more than 4 minutes. The best prediction time, instead, is achieved by SVM although RBF-FastLSVM and LIN-FastLSVM give comparable performances; the prediction time of IPOL-FastLSVM is, instead, about an order of magnitude higher than RBF-SVM. 3.3
The CoverType Dataset
The binary CoverType dataset (retrieved from LibSVM homepage [24]) has 581012 samples with 54 features. We randomly chose 25000 samples for testing, the others for training. Smaller training sets (from 1000 to 500000 samples) are obtained randomly sub-sampling the training data. We apply SVM and FastLSVM with RBF kernel using the same model selection strategy of the 2SPIRAL problem stopping it without giving the best parameters, and thus without performing the classification, if at least one fold of 5 fold CV does not terminate within 6 hours. The accuracy and performance results at increasing training set sizes are reported in Table 3. SVM accuracies are lower than FastLSVM ones for every training set size. From a computational viewpoint, FastLSVM can train the
Fast Local Support Vector Machines for Large Datasets
307
Table 3. Percentage accuracies and performances (in seconds) of SVM and FastLSVM on CoverType data SVM n ×1000 1 2 3 4 5 7.5 10 15 20 30 40 50
test acc. 74.32 76.25 77.83 78.83 80.19 82.36 83.48 85.56 86.45 88.14 89.44 90.22
FastLSVM
train. test. time time 0 1 1 3 4 11 34 148 138 588 933 1814
2 3 6 9 10 16 19 32 32 51 64 71
test. acc. 74.77 76.28 77.84 79.34 80.35 82.47 83.64 85.78 86.69 88.25 89.48 90.32
train. test. time time 0 0 1 2 10 43 10 15 20 59 78 103
5 10 14 20 26 42 29 73 97 146 193 238
SVM n ×1000
test. acc.
75 91.78 100 92.81 150 200 250 300 350 400 450 500 556 -
FastLSVM
train. test. test. time time acc. 4862 7583 -
128 152 -
91.84 92.84 94.07 94.63 95.35 95.52 95.83 95.95 96.18 96.36 96.47
train. test. time time 391 439 505 813 1196 1663 3573 2879 3600 4431 5436
361 476 716 952 1188 1427 1663 1980 2140 2370 2633
Table 4. Training times of LaSVM (from [12]), CVM (from [11]) and FastLSVM normalized with the training times of LibSVM on 100000 samples taken from the corresponding works (LibSVM 100000 samples training time in [12] is 10310s, in [11] is about 20000s, in this work 7583s) number of samples 100k 521k (556k for FastLSVM)
LaSVM 0.70 9.40
CVM 1 1.5
FastLSVM 0.06 0.72
model on the whole training set faster than SVM on 100000 samples (less than 1/5 of the data); moreover, starting from n = 30000 FastLSVM training is at least one order of magnitude faster than SVM, and the difference is more and more relevant as n increases. It is important to underline that the whole dataset permits a much higher classification accuracy than the random sub-sampled sets, so it is highly desirable to consider all the data. Since we implemented FastLSVM without supporting data-structures for nearest neighbors, we must compute, for all test points, the distances with all the training points, leading to a rather high testing time. The binary CoverType dataset allows us to make some comparison with stateof-the-art SVM optimization approaches: CVM [11], and LaSVM [12]. Since CVM and LaSVM have been tested on different hardware systems, but both have been compared with LibSVM on a reduced training set of 100000 samples, a fair comparison is possible normalizing all training times with the training time of LibSVM on 100000 samples performed in the same work4 . Table 4 reports the comparison, and it is clear that FastLSVM is sensibly faster than LaSVM and CVM both on the reduced training set of 100000 samples and on the complete dataset of more than 500000 samples. From the generalization accuracy viewpoint, taking as reference the 100000 samples training set, we can 4
The LibSVM version used here is more recent than the version used in [11] and LaSVM [12] and, since the last version is the fastest, the comparison can be a little penalizing for FastLSVM.
308
N. Segata and E. Blanzieri
notice that FastLSVM is more accurate than LibSVM, whereas LaSVM is less accurate than LibSVM (see [12]) and CVM seems to be as accurate as LibSVM (see [11]).
4
Conclusions
Starting from the kNNSVM classifier, we presented FastLSVM, which is scalable for large datasets and maintains the advantages in terms of classification accuracy of the original formulation of Local SVM for non high-dimensional data. Differently from kNNSVM, FastLSVM precomputes the local models in the training set trying to minimize the number of SVM that needs to be built assigning the models not only to the central point but to the k most central samples. Furthermore the training of the local models can be very easily parallelized. The prediction is performed applying to the query point the SVM model to which its nearest neighbor in the training set has been assigned. FastLSVM demonstrated empirically to be a good approximation of kNNSVM and to substantially overcome SVM both in terms of classification accuracy and training time performance on an artificial large dataset. On the real large dataset called CoverType, our experiments highlights that FastLSVM is not only faster and more accurate than standard SVM solvers, but has substantial advantages over state-of-the-art approximated fast SVM solvers. Moreover we discussed some improvements to the method such as kNN supporting data-structures, which can further sensibly increase the training and testing performances of FastLSVM. The source code of FastLSVM with these improvements are freely available in FaLKM-lib [9].
References 1. Blanzieri, E., Melgani, F.: An adaptive SVM nearest neighbor classifier for remotely sensed imagery. In: IEEE Int. Conf. on Geoscience and Remote Sensing Symposium (IGARSS 2006), pp. 3931–3934 (2006) 2. Bottou, L., Vapnik, V.: Local learning algorithms. Neural Computation 4(6), 888– 900 (1992) 3. Vapnik, V.N., Bottou, L.: Local algorithms for pattern recognition and dependencies estimation. Neural Computation 5(6), 893–909 (1993) 4. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (2000) 5. Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Transactions on Geoscience and Remote Sensing 46(6), 1804–1811 (2008) 6. Segata, N., Blanzieri, E.: Empirical assessment of classification accuracy of Local SVM. In: Proc. of Benelearn, pp. 47–55 (2009) 7. Brailovsky, V.L., Barzilay, O., Shahave, R.: On global, local, mixed and neighborhood kernels for support vector machines. Pattern Recognition Letters 20(11-13), 1183–1190 (1999)
Fast Local Support Vector Machines for Large Datasets
309
8. Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In: Proc. of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2126–2136 (2006) 9. Segata, N.: FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Technical report, number DISI-09-025. DISI, University of Trento, Italy (2009), http://disi.unitn.it/~ segata/FaLKM-lib 10. Cheng, H., Tan, P.N., Jin, R.: Localized Support Vector Machine and Its Efficient Algorithm. In: Proc. SIAM Intl. Conf. Data Mining (2007) 11. Tsang, I.W., Kwok, J.T., Cheung, P.M.: Core Vector Machines: Fast SVM Training on Very Large Data Sets. The Journal of Machine Learning Research 6, 363–392 (2005) 12. Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online and active learning. Journal of Machine Learning Research 6, 1579–1619 (2005) 13. Collobert, R., Bengio, S., Bengio, Y.: A parallel mixture of SVMs for very large scale problems. Neural Computation 14(5), 1105–1114 (2002) 14. Collobert, R., Bengio, Y., Bengio, S.: Scaling Large Learning Problems with Hard Parallel Mixtures. International Journal of Pattern Recognition and Artificial Intelligence 17(3), 349–365 (2003) 15. Yu, H., Yang, J., Han, J., Li, X.: Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing. Data Mining and Knowledge Discovery 11(3), 295– 321 (2005) 16. Dong, M., Wu, J.: Localized Support Vector Machines for Classification. In: International Joint Conference on Neural Networks, IJCNN 2006, pp. 799–805 (2006) 17. Zanni, L., Serafini, T., Zanghirati, G.: Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems. The Journal of Machine Learning Research 7, 1467–1492 (2006) 18. Dong, J.X., Krzyzak, A., Suen, C.Y.: Fast SVM training algorithm with decomposition on very large data sets. IEEE Transaction on Pattern Analysis Machine Intelligence 27(4), 603–618 (2005) 19. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 217–226. ACM, New York (2006) 20. Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A Dual Coordinate Descent Method for Large-scale Linear SVM. In: Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML) (2008) 21. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995) 22. Sch¨ olkopf, B., Smola, A.J.: Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge (2002) 23. Bottou, L., Lin, C.J.: Support Vector Machine Solvers. Large-Scale Kernel Machines (2007) 24. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 25. Chang, Q., Chen, Q., Wang, X.: Scaling gaussian rbf kernel width to improve svm classification. In: International Conference on Neural Networks and Brain, ICNN&B 2005, October 13-15, vol. 1, pp. 19–22 (2005)
310
N. Segata and E. Blanzieri
26. Ch´ avez, E., Navarro, G., Baeza-Yates, R., Marroqu´ın, J.L.: Searching in metric spaces. ACM Computing Surveys (CSUR) 33(3), 273–321 (2001) 27. Beygelzimer, A., Kakade, S., Langford, J.: Cover Trees for Nearest Neighbor. In: Proceedings of the 23rd International Conference on Machine learning, Pittsburgh, PA, pp. 97–104 (2006) 28. Ridella, S., Rovetta, S., Zunino, R.: Circular backpropagation networks for classification. IEEE Transactions on Neural Networks 8(1), 84–97 (1997) 29. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9(3), 293–300 (1999)
The Effect of Domain Knowledge on Rule Extraction from Support Vector Machines Nahla Barakat1 and Andrew P. Bradley2 2
1 German University of Technology in Oman School of Information Technology and Electrical Engineering (ITEE) The University of Queensland, St Lucia, QLD 4072, Australia [email protected], [email protected]
Abstract. Prior knowledge about a problem domain can be utilized to bias Support Vector Machines (SVMs) towards learning better hypothesis functions. To this end, a number of methods have been proposed that demonstrate improved generalization performance after the application of domain knowledge; especially in the case of scarce training data. In this paper, we propose an extension to the virtual support vectors (VSVs) technique where only a subset of the support vectors (SVs) is utilized. Unlike previous methods, the purpose here is to compensate for noise and uncertainty in the training data. Furthermore, we investigate the effect of domain knowledge not only on the quality of the SVM model, but also on rules extracted from it; hence the learned pattern by the SVM. Results on five benchmark and one real life data sets show that domain knowledge can significantly improve both the quality of the SVM and the rules extracted from it. Keywords: Data mining, Machine learning, Domain knowledge utilization, Rule learning, Support Vector Machines.
1 Introduction The idea of utilizing domain knowledge arises from the need for a “good” model (classifier) to have both a logical justification (analytical learning) and a statistical justification (inductive learning) [1]. In the context of Support Vector Machines, there have been two main streams of research for utilizing domain knowledge. The first incorporates prior knowledge by modifying the SVM kernel to cater for domain knowledge, e.g., [2] and [3]. However, modifying the SVM kernel often requires domain specific SVM algorithms, which may not always be a feasible approach. The other stream utilizes domain knowledge by creating virtual examples from the available training data, which are then used together with the original data set to train another SVM [2]. Creating virtual examples from the SVM SVs in this way is called the virtual support vectors (VSVs) method [2]. Both streams of research build on the idea that the function being learned should be invariant with respect to certain transformations. These transformations have been extensively applied in the image classification problems, where transformations such P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 311–321, 2009. © Springer-Verlag Berlin Heidelberg 2009
312
N. Barakat and A.P. Bradley
as rotation, scaling and translation, which are known to keep the learned function unchanged, have been adopted. However, the main motivation behind the majority of studies into domain knowledge utilization has been the problem of a scarcity of training examples. Even though there have been a number of studies into rule extraction from SVMs in order to provide an explanation of the (black-box) SVM model, e.g., [4-6]; to date, there have been no studies investigating the additional benefit of utilizing domain knowledge on the quality of the extracted rules. Therefore, in this paper we: 1. Propose a new method for utilizing domain knowledge in SVM training; 2. Investigate the effect this domain knowledge has on the quality of the rules extracted from the SVMs. The main idea behind the proposed method for domain knowledge utilization is to create a set of positive VSVs by analyzing a specific subset of the SVM SVs. In particular, we target the specific case of potentially mislabeled positive examples that were incorrectly given a negative target class. That is, we specifically acknowledge that the ground truth may be imperfect and therefore the correct class is more likely to be positive. This type of “noisy” example would be an outlier in the data set and so would, most probably, appear in the SVM model as a SV. Therefore, if reclassified by the SVM model it would be apparently misclassified as positive. In this case, domain knowledge is then used to confirm the correct class label, and if it is found to be positive, a VSV is created from this example with the correct (positive) class label. In this way, these VSVs, in addition to the original training set, are used to retrain the SVM. The main contribution of the proposed approach here is that it tries to compensate for noise or uncertainty in the class labels of the training data; unlike other methods of domain knowledge utilization which were proposed to overcome the scarcity of training examples. Results are presented on five benchmark data sets and real life data set showing that the utilization of domain knowledge can significantly improve both the quality of the SVM model and the extracted rules. The paper is organized as follows: Section 2 provides a brief background to SVMs; Section 3 outlines the (eclectic) method of rule extraction utilized; Section 4 describes the proposed method for domain knowledge utilization; Section 5 summarizes the experimental methodology; Section 6 provides a summary and discussion of the results; and Section 7 offers a number of conclusions from the work.
2 Support Vector Machines SVMs are based on the principle of structural risk minimization. They operate by finding a linear hyper-plane that separates the positive and negative examples with a maximum interclass distance or margin. In the case of non-separable data, a soft margin hyper-plane is defined to allow errors ξi (slack variables) in classification. Hence, the optimization problem is formulated as follows [7]:
minimize
1 w 2
2
l
+ C∑ξi i =1
The Effect of Domain Knowledge on Rule Extraction from Support Vector Machines
Subject to yi ( wxi + b) ≥ 1 − ξ i ,
313
ξi ≥ 0
Where, C is a regularization parameter which defines the trade-off between the training errors and the margin [7]. In the case of non-linearly separable data, SVMs map input data to a higher dimensional feature space using kernel functions. Introducing αi (Lagrange multipliers), the dual optimization problem is modified as follows: l
maximize w(α ) = ∑ α i − i =1
C ≥ αi ≥ 0
1 l ∑ α i yiα j y j K (xi ..x j ) 2 i=1, j =1
∀i ,
l
∑ α i yi = 0
i =1
Solving for α, training examples with non zero α are called support vectors (SVs) and the hyper-plane is being completely defined by the SVs only [7].
3 Eclectic Rule Extraction from SVMs In this paper, we use the eclectic rule extraction approach described in [4]. This approach uses a labeled data set to train an SVM to a level of acceptable accuracy, precision and recall. Next, a synthetic data set, composed of the training examples that became support vectors, is constructed with the target class for these examples replaced by the class predicted by the SVM. Rules representing the concepts learned by the SVM are then extracted from this synthetic data set using the C5 decision tree learner [8].
4 The Proposed Method The proposed approach builds on the idea of VSVs [2] and borrows concepts from analytical learning, where only positive examples are analyzed [1]. The method, which we call Positive virtual support vectors (PVSV), utilizes domain knowledge in the form of a positive virtual data set. This PVSV data set is an explicit input, which is used together with the original training set for SVM training. The VSVs are created from false positive SVs (FPSVs) only (as classified by the SVM model), hence it does not significantly increase the training set size. The main steps of PVSV are as follows: 3. 4. 5. 6.
From a trained SVM, extract the model SVs; Classify the extracted SVs with the trained SVM; Extract rules from the SVs in step 2 using the eclectic approach described in [4]; Using domain knowledge, analyze all FPSVs and decide if their class label is correct. If the label is incorrect, i.e., the true class is positive; create a PVSV with the same feature values as the FPSV, but with a positive class label;
314
N. Barakat and A.P. Bradley
7. Train another SVM on the PVSVs plus the original training set; 8. Again, extract rules from this new SVM model using the eclectic approach. It should be noted that the extension of this approach to false negative SVs (FNSVs) would be neither straight forward nor desirable. The creation of negative SVs that represent domain knowledge would have to consider all rules for the negative class. This requires much broader domain knowledge and would significantly increase the number of VSVs.
5 Experimental Methodology Experiments were performed using five benchmark data sets as follows: ─ US Postal service (USPS) database [2] has 9298 handwritten digits. Each digit is a 16×16 image, represented as a 256-dimensional vector and class label normalized between −1 and 1. The data set has 10 classes (0-9) and so 10 binary SVM classifiers were constructed that separate each digit from the other nine. No additional machine-printed characters have been used. This data set was selected so as to allow direct comparison with other VSVs methods [2]. It may be worth mentioning here that VSVs method proposed in [2] obtained the best results using only the original USPS data set, without any additional machine-printed characters [2]; ─ Pima Indians diabetes [9]: A sample of 438 examples were used, after removing all examples with a zero value for the attributes 2-hour OGTT plasma glucose, diastolic blood pressure and triceps skin fold thickness which are clinically insignificant; ─ Heart diseases [9]: The reduced Cleveland heart diseases data set was used. All examples with missing values were discarded; ─ Breast cancer [9]: The Wisconsin breast cancer data set was used. All repeated examples were discarded to avoid the bias resulting from the effect of those examples; ─ Thyroid [9]: The experiment was executed as binary classification task (i.e., normal, abnormal). As shown in Table 1, each of the five data sets was randomly split into disjoint training and test sets. SVMlight was used in all experiments [10]. Table 1. Data Sets
Data set USPS Pima Indians Breast cancer Heart Disease Thyroid
Features 256 8 9 13 21
Training 7291 247 208 223 3428
Testing 2007 191 208 74 3772
Since there were no domain experts available to provide “real” domain knowledge, the correct class label for FPSVs was identified for each data set using a
The Effect of Domain Knowledge on Rule Extraction from Support Vector Machines
315
committee voting concept. Voting was performed by two different decision tree learners [8] and [11], trained on the same training data as the SVM. If the two classifiers agreed that the class label should be positive, then the correct class was considered to be positive and a PVSV created. Three sets of experiments were conducted with different number of PVSVs. In the first set of experiments, one PVSV is created for each FPSV. In the second experiment a duplicate of each PVSVs is added for each FPSV, while in the third set of experiments three copies of each PVSV are used. The aim of this was to observe the effect of adding more weight to the PVSVs. It should be noted that even in the case of using three PVSVs, we have added significantly less VSVs than the method described in [2]. In their approach they added four VSVs for each model SV, one VSV being added for each one-pixel shift of the images in the four principal directions (up, down, left and right). In addition to the benchmark data sets, and to verify the validity and the usefulness of the proposed approach for real life problems, the proposed approach was also applied to a real life data set for the diagnosis of Diabetes Mellitus. The input features are: Age, Sex, Family history of diabetes, Body mass index (BMI), Waist circumference, Hip circumference, Systolic blood pressure, Diastolic blood pressure (BPDIAS), Cholesterol and Fasting Blood Sugar (FBS). These features are then used to predict the two hour post-glucose load (specifically, an oral glucose tolerance test (OGTT)), with class labels -1 for OGTT<200 and 1 for OGTT>=200. All attributes are continuously valued, except for sex, family history of diabetes, which are binary valued. Two independents data sets were used for training and testing respectively. It should be noted here that unlike the benchmark data sets, the FPSVs have been analyzed by relevant domain experts, who have decided their correct class labels. The used cutoff value of diabetes diagnosis is FBS>=110. For all the data sets, the accuracy of the SVM on test set was measured both before and after the utilization of domain knowledge. Furthermore, to investigate the effect of domain knowledge on rule quality, rules were extracted from SVMs both before and after the addition of domain knowledge, and their performance on the test sets has been investigated.
6 Results and Discussion 6.1 Benchmark Datasets Results On the USPS data set, the only SVM with FPSVs was the one constructed for the digit 9. In this SVM, 18 FPSVs were found in total, of which 13 were found to belong to the positive class according to the voting committee classifiers. Therefore, for the USPS data set 13 positive VSVs (PVSVs) were added to the digit 9 classifier and each of the other nine classifiers (0-8), but with negative class label. This process was then repeated for 26 (two duplicates) and 39 (three duplicates) PVSVs. On the USPS data set, the best results were obtained with 39 PVSVs, not only in the case of digit 9 SVM, but also for the other digits as well. Table 2 shows the SVM classification error on the USPS test set compared to the best results obtained over this data set using VSVs method as reported in [2]. From this table, it can be seen
316
N. Barakat and A.P. Bradley
that the classification error rate has improved significantly for the ten classifiers even though only one classifier had any FPSVs. It would appear that method [2] obtained better absolute results because of their use of the higher degree polynomial kernel (three compared to our two) and by the introduction of a much larger virtual training data set (and hence a far greater number of SVs). We have used the regularization parameter C = 10 for training, before and after adding the VSVs. This value has been found to be the optimum value in both cases for this data set, which was also the same value used in [2]. Table 2. SVM error on USPS test set
Method PVSVs VSVs [2]
Error before 4.7% 4.0%
Error after 3.7% 3.2%
Average No of SVs after
423 686
Figure 1 shows the improvement in SVM classification accuracy on the test sets for the remaining four benchmark data sets after the application of domain knowledge. This improvement is consistent, being both independent of class priors and the size of the training set. Only the Heart disease data set shows an inconsistent trend as additional PVSVs are included.
Fig. 1. SVM accuracy on test sets
Figures 2 illustrates that the accuracy of extracted rules has also been improved after adding domain knowledge for all data sets. Again, only heart disease shows an inconsistent improvement in accuracy as additional PVSVs are included in the training set.
The Effect of Domain Knowledge on Rule Extraction from Support Vector Machines
317
Fig. 2. Rules accuracy on test sets
Given the improved performance of both the SVM and extracted rule sets, we decided to investigate if SVM learning (as defined by the extracted rules) has been explicitly biased towards the domain knowledge used. Therefore, we explicitly compared the rules extracted from the SVM before and after the application of domain knowledge. For illustrative purposes, we show the rules represented domain knowledge for the positive class for Breast cancer and Heart diseases data sets only, followed by the rules extracted before and after domain knowledge application respectively. Positive Class Domain Knowledge. (common rules for the positive class learned by the voting committee [8],[11]) Breast cancer: 1- If uniofcellshape >=5; Then -> 1 Heart Disease: 1- If thal >3; Then -> 1 Rules before (Breast cancer): 3- If uniofcellshape > 3 and blandChromatin > 1 and blandChromatin =< 4 and singleEpiCellSize =< 4 Then -> 1 (20, 0.85) 4- If uniofcellshape > 3 and blandChromatin > 1 and blandChromatin =< 4 and singleEpiCellSize > 5 Then -> 1 (13, 0.692) 5- If uniofcellshape > 3 and blandChromatin > 4 Then -> 1 (70, 0.929)
318
N. Barakat and A.P. Bradley
Rules after (Breast cancer): 3- If uniofcellshape > 3 Then -> 1 (132, 0.871) Note: Rules 1 and 2 remained unchanged both before and after adding domain knowledge. Rules before (Heart disease): 1- If thal > 3 Then -> 1 (44, 0.935) 2- If thal <= 3 Then -> 1 (43, 0.733) Rules after (Heart disease): 1- If thal > 3 Then -> 1 (46, 0.979) 2- If sex > 0 Then -> 1 (72, 0.743) It can be seen from these rules that SVM learning has been explicitly biased towards the domain knowledge rules. This is specifically evident in the heart disease data set where there were two contradicting rules for the positive class, which are then corrected by the domain knowledge rule. It can also be seen that more concise rule set is extracted for the Breast cancer data set after domain knowledge utilization (3 rules after compared to 5 rules before). 6.2 Real Life Diabetes Data Set Results The SVM Leave One Out estimate for precision, recall and error rates are shown in Figures 3 and 4 respectively. From Figure 3, it can be seen that the recall is increasing with the addition of domain knowledge, which is expected as the number of positive training examples is increased. In fact this increase in recall is also associated with an increase in the precision, and a decrease in the error rate as shown in Figures 3 and 4.
Fig. 3. SVM precision & recall before and after applying domain knowledge
The Effect of Domain Knowledge on Rule Extraction from Support Vector Machines
319
Fig. 4. SVM error rate before and after applying domain knowledge
Considering the quality of the extracted rules, the situation here is similar to the benchmark data sets results summarized in section 6.1. Again, it is evident that the utilization of domain knowledge was also beneficial for the quality of the extracted rules. The improved performance of the rules on the test set in terms of fidelity and accuracy can be seen in Figures 5.
Fig. 5. Rules accuracy and fidelity before and after applying domain knowledge
320
N. Barakat and A.P. Bradley
The following are the rules extracted before and after applying domain knowledge: Rules before 1. If FBS > 124.2 Then Diabetic (29, 1.0) 2. If FBS > 90 and FBS =< 124.2 and WAIST > 84 and BPDIAS > 70 Then Diabetic (24, 0.833) Rules after 1. If BMI > 26.3 and BPDIAS > 90 Then -> Diabetic (20.7, 0.956) 2. If BMI > 26.3 and WAIST > 90 and FBS > 90 Then -> Diabetic (50.3, 0.903) 3. If FBS > 106.2 Then -> Diabetic (51.4, 0.834)
7 Conclusions In this paper we have proposed a new method for domain knowledge utilization in SVMs, called positive virtual support vectors (PVSV). The method utilizes FPSVs to create a set of positive VSVs which are then used as an adjunct to the original training data. A voting committee is adapted to re-label potentially misclassified positive support vectors for the UCI benchmark data sets. For the real life of Diabetes data set, the correct class labels for FPSVs have been decided by relevant domain experts. Results on both benchmark and real life data sets demonstrate that PVSV method is effective at enhancing the performance of both SVM models and the rules extracted from them. However, the significance of the improvements obtained by the proposed approach is domain dependent, and definitely the amount of mislabeled training examples in the training data.
References 1. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997) 2. Decoste, D., Scholkopf, B.: Training Invariant Support Vector Machines. Machine Learning 46, 161–197 (2002) 3. Pozdnoukhov, A., Bengio, S.: Tangent Vector Kernels for Invariant Image Classification with SVMs. In: 17th International Conference on Pattern Recognition (ICPR 2004), pp. 486–489. IEEE Press, New York (2004)
The Effect of Domain Knowledge on Rule Extraction from Support Vector Machines
321
4. Barakat, N., Bradley, A.P.: Rule Extraction from Support Vector Machines: Measuring the Explanation Capability Using the Area under the ROC Curve. In: 18th International Conference on Pattern Recognition (ICPR 2006), pp. 812–815. IEEE Press, New York (2006) 5. Núñez, H., Angulo, C., Catala, A.: Rule-Extraction from Support Vector Machines. In: European Symposium on Artificial Neural Networks, pp. 107–112 (2002) 6. Barakat, N., Bradley, A.P.: Rule Extraction from Support Vector Machines: A Sequential Covering Approach. IEEE Transactions on Knowledge and Data Engineering 19, 729–741 (2007) 7. Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery. Kluwer Academic Publishers, Dordrecht (1998) 8. Data Mining Tools See5 and C5.0, Rule Quest data mining tools, http://www.rulequest.com 9. Merz, C., Murphy, P.: UCI Machine Learning Repository, Irvine, http://www.ics.uci.edu 10. Joachims, T.: Making Large-Scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smoland, A. (eds.) Advances in kernel Methods-Support Vector Learning. MIT Press, Cambridge (1999), http://svmlight.joachims.org 11. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. Wadsworth and Brooks, Monterrey (1984)
Towards B-Coloring of SOM Haytham Elghazel and Khalid Benabdeslem University of Lyon, F-69622 Lyon, France University of Lyon 1, Villeurbanne, LIESP, EA4125 {elghazel,kbenabde}@bat710.univ-lyon1.fr
Abstract. The Self-Organizing Map (SOM) is one of the most popular neural network methods. It is a powerful tool in visualization and analysis of high-dimensional data in various application domains such as Web analysis, information retrieval, and many other domains. The SOM maps the data on a low-dimensional grid which is generally followed by a clustering step of referent vectors (neurons or units). Different clustering approaches of SOM are considered in the literature. In particular, the use of hierarchical clustering and traditional k-means clustering are investigated. However, these approaches don’t consider the topological organization provided by SOM. In this paper, we propose BcSOM, an extension of a recently proposed graph b-coloring clustering approach for clustering self organized map. It exhibits more important clustering features and enables to build a fine partition of referents by incorporating the neighborhood relations provided by SOM. The proposed approach is evaluated against benchmark data sets and its effectiveness is confirmed. Keywords: Self-organizing map, clustering, graph b-coloring, topological organization.
1
Introduction
Clustering is a fundamental data mining process that aims to group observations into classes or categories, such that all objects in the same group are similar to each other, while dissimilar from objects in other groups. Clustering plays an important role in data mining applications such as Web analysis, information retrieval, medical diagnosis, and many other domains. Clustering algorithms are generally based on two popular paradigms, namely partitioning and hierarchical clustering. While hierarchical algorithms build clusters gradually and then give a cluster hierarchy whose leaves are the instances and whose internal nodes represent nested clusters of various sizes, partitioning algorithms learn clusters directly. Many other clustering techniques are developed primarily for specific problems and/or specific data sets. These are densitybased approaches, grid-based approaches, and also model-based approaches. Jain et al. [1] survey many of these efforts. In this paper, we consider the problem of clustering using self-organizing map (SOM) [2]. This technique is a prominent tool for high-dimensional data analysis since it provides a substantial data reduction that can be used to visualize and P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 322–336, 2009. c Springer-Verlag Berlin Heidelberg 2009
Towards B-Coloring of SOM
323
explore properties of the data. The clustering of these data is thus given using a two-level strategy, where the data set is first clustered using the SOM, and then, the SOM is clustered. Several authors have investigated the ways to cluster SOM. This step is generally realized using traditional k-means or hierarchical clustering [3]. However, these approaches don’t take into account the topological neighborhood relations offered by SOM. Recently, we have proposed a new clustering framework [4,5] based on the concept of graph b-coloring [6]. This technique consists on the assignment of colors (clusters) to the vertices of the graph such that (i) no two adjacent vertices have the same color (proper coloring), (ii) for each color there exists at least one dominating vertex which is adjacent to all the other colors. This specific vertex reflects the properties of the class and also guarantees that the class has a distinct separation from all other classes of the partitioning. The b-coloring based clustering framework in [4,5] enables to build a fine partition of the data set (numeric or symbolic) when the number of clusters is not specified in advance. Such a partition possesses several properties that are desirable for clustering. In the sequel, we propose an extension of our b-coloring based clustering approach for clustering self organizing map. Indeed, the purpose of this approach is to simultaneously use dissimilarities and topological relations (provided by SOM) between objects (i.e. referents) to find a cluster partition of self-organizing map where internal cluster cohesion and separation among clusters are simultaneously effective. Experimental results on benchmark data sets demonstrate that the proposed clustering algorithm is able to cluster data in a better way than classical clustering algorithms of SOM and indicates the effectiveness of SOM to offer real benefits (runtime) for the original b-coloring based clustering approach. The rest of the paper is organized as follows: Topological clustering using selforganizing map is discussed in Section 2. In Section 3, we briefly describe the b-coloring based clustering framework. Section 4 is devoted to the modified graph b-coloring approach for clustering SOM which we will refer to as BcSOM (for Bcoloring of Self-Organizing Map). Some experiments using relevant benchmark data sets are presented in Section 5. This section includes the evaluation method and the experimental results. Section 6 summarizes our contribution with useful propositions of extensions.
2
Topological Clustering Using Self-Organizing Map
SOM is used nowadays through numerous domains and has been successfully applied in numerous applications. It is a very popular tool used for visualizing high dimensional data spaces. SOM can be considered as doing vector quantization and/or clustering while preserving the spatial ordering of the input data rejected by implementing an ordering of the codebook vectors (also called prototype vectors, cluster centroids or referent vectors) in a one or two dimensional output space. The SOM consists of neurons organized on a regular low-dimensional grid, called the map. More formally, the map is described by a graph (V , E). V is a set of m interconnected neurons having a discret topology defined by E. For each
324
H. Elghazel and K. Benabdeslem
Fig. 1. Two dimensional topological map with 1-neighborhood of a neuron c. Rectangular (red) with 8 neighbors and diamond (blue) with 4 neighbors.
pair of neurons (c, r) on the map, the distance δ(c, r) is defined as the shortest path between c and r on the graph. This distance imposes a neighborhood relation between neurons (Fig. 1). Each neuron c is represented by a p-dimensional referent vector wc = {wc1 , ...., wcp } from W (the set of all map’s neurons), where p is equal to the dimension of the input vectors. The number of neurons may vary from a few dozen to several thousand depending on the application. The SOM training algorithm resembles k-means [7]. The important distinction is that in addition to the best matching referent vector, its neighbors on the map are updated: the region around the best matching vector is stretched towards the training sample presented. The end result is that the neurons on the grid become ordered: neighboring neurons have similar referent vectors. The SOM takes as its input a set of labeled sample vectors and gives as output an array of neurons with the input vectors labels attached to these neurons. Let n be the number of sample vectors zi ∈ Rp , i = 1, 2, ..., n, where each sample vector zi is identified by a label. The batch version of SOM is an iterative algorithm in which the whole data set (noted) is presented to the map before any adjustments are made. In each training step, the data set is partitioned according to the Voronoi regions of the map referent vectors. More formally, we define an affectation function f from Rp (the input space) to C, that associates each element zi of Rp to the neuron whose referent vector is “closest” to zi (for the Euclidean distance). This function induces a partition P = {Pc ; c = 1...m} of the set of individuals where each part Pc is defined by: Pc = {zi ∈ Ω; f (zi ) = c}. This is the affectation step. After affectation, a representation step is performed. The algorithm updates the referent vectors by minimizing a cost function, noted E(f, W). This function has to take into account the inertia of the partition P , while insuring the topology preserving property. To achieve these two goals, it is necessary to generalize the inertia function of P by introducing the neighborhood notion attached to the map. In the case of individuals belonging to Rp , this minimization can be done in a straight way. Indeed new referent vectors are calculated as:
Towards B-Coloring of SOM
wrt+1
n hrc (t)zi = i=1 n i=1 hrc (t)
325
(1)
where c = arg minr zi − wr , is the index of the best matching unit of the data sample zi , . is the distance mesure, typically the Euclidean distance, and t denotes the time. hrc (t) the neighborhood kernel around the winner unit c. This function is a nonincreasing function of time and of the distance of unit r from the winner unit c. The new referent vector is a weighted average of the data samples, where the weight of each data sample is the neighborhood function value hrc (t) at its winner c. In the batch version of the k-means algorithm, the new referent vectors are simply averages of the Voronoi data sets.
3
Clustering Using Graph B-Coloring
In this section, we provide some background on the b-coloring based clustering framework that was recently introduced in [4,5]. Consider the data to be clustered W = {w1 , w2 , . . . , wm } (in our case wi is a p-dimensional referent vector corresponding to the SOM neuron i, wi = {wi1 , wi2 , . . . , wip }) as an undirected complete edge-weighted graph G(V , E), where V = {v1 , v2 , . . . , vm } is the vertex set and E = V × V is the edge set. Vertices in G correspond to data objects (i.e. SOM neurons, vi for wi ) and edge-weights reflect dissimilarity between pairs of linked vertices. The graph G is traditionally represented with the corresponding weighted dissimilarity matrix, which is the m × m symmetric matrix D = {di,j |vi , vj ∈ V }. A widely adopted definition of optimal clustering is a partitioning that minimizes dissimilarities within and maximizes dissimilarities between clusters. These two conditions amount to saying that edges between two vertices within one cluster should be small weighted (denoting high similarity), and those between vertices from two clusters should be large weighted (weak similarity). The clustering problem is hence formulated as a graph b-coloring problem. The b-coloring of such a complete graph is not interesting for the clustering problem. Indeed, the trivial partition is returned where each cluster (color ) is assumed to contain one and only one instance (vertex ). Consequently, our clustering approach requires to construct a superior threshold graph, which is a partial graph of the original one G(V , E). Let G>θ (V , E>θ ) be the superior threshold graph associated with threshold value θ chosen among the dissimilarity table D. In other words, G>θ is given by V = {v1 , v2 , . . . , vm } as vertex set and {(vi , vj )|D(wi , wj ) = di,j > θ} as edge set. The data to be clustered are now depicted by a non-complete edge-weighted graph G(V , E>θ ). The goal is to divide the vertex set V into a partition Pk = {C1 , C2 , . . . , Ck } where for ∀Ci , Cj ∈ Pk , Ci ∩Cj = φ f or i = j (when the number of clusters k is not predefined). The notation of Pk is used to both represent a set of clusters as well as a set of colors, because each cluster Ci ∈ Pk corresponds to a color in our b-coloring based clustering framework which consists of two steps: 1) generating an initial proper coloring of vertices using a maximum number of
326
H. Elghazel and K. Benabdeslem Table 1. A weighted dissimilarity table vi A B C D E F G H I
A 0 0.20 0.10 0.10 0.20 0.20 0.15 0.10 0.40
B
C
D
E
F
G
H
I
0 0.30 0.20 0.20 0.20 0.10 0.20 0.075
0 0.25 0.10 0.20 0.15 0.10 0.15
0 0.40 0.25 0.10 0.10 0.15
0 0.65 0.10 0.05 0.15
0 0.75 0 0.05 0.05 0 0.15 0.15 0.15 0
colors, and 2) removing each color that has no dominating vertices yet using a greedy procedure. For an illustration purpose, Figure 2 gives the superior threshold graph G>0.15 (θ = 0.15) for the dissimilarity Table 1 obtained from a data set {A, B, C, D, E, F, G, H}. In G>0.15 , edges are labeled with the corresponding dissimilarities. The figures 3 and 4 illustrate the two stages of the b-coloring algorithm performed on the superior threshold graph G>0.15 . Therefore, the algorithm associated with the threshold 0.15 splits the graph into the following classes: C1 = {B}, C2 = {A, D}, C3 = {C, E,G, H, I} and C4 = {F}. The vertices with the same color (shape) are grouped into the same cluster and the nodes with bold letter are the dominating vertices of classes. This means that these vertices are linked to at least one vertex in each other color class. The clustering algorithm is iterative and performs multiple runs, each of them increasing the value of the dissimilarity threshold θ. Once all threshold values passed, the algorithm provides the optimal partitioning (corresponding to one threshold value θo ) which maximizes Generalized Dunn’s index (DunnG ) [8]. DunnG is designed to offer a compromise between the intercluster separation and the intracluster cohesion. So, it is the more appropriated to partition data set in compact and well-separated clusters. H
0.2
B 0.2
0.3
A I
C
0.4
0.2 0.2
0.2 0.2 0.25
0.2 0.2
G
0.25
0.75 F
0.65
D
0.4 E
Fig. 2. The superior threshold graph G>0.15 (θ = 0.15)
Towards B-Coloring of SOM
Fig. 3. Initial coloring of graph G>0.15 with maximal colors
4
327
Fig. 4. The b-coloring of the graph G>0.15 . Four colors are identified.
BcSOM: B-Coloring of SOM
This section is devoted to discuss our modification to the b-coloring approach for clustering self-organizing map by considering the SOM neighborhood relations. In the sequel, we show our investigation to incorporate this kind of topological informations into the b-coloring clustering algorithm. The main idea is to use the SOM neighborhood relations to constrain the possible selections of vertex color in the building of the b-coloring of threshold graph. As mentioned above, the b-coloring clustering approach requires a non complete edge-weighted graph G(V , E>θ ) to return a partition Pk of W = {w1 , w2 , . . . , wm } neurons (referent vectors) set. The b-coloring of G is carried out using a two-stage procedure, where an initial proper coloring of vertices is first generated using a maximum number of colors available for G (i.e. Δ + 1 where Δ is the maximal degree of graph), and then modified by removing colors without any dominating vertex using a greedy procedure. In order to take into account the topological organization provided by SOM in the b-coloring of G, our changes concern the second stage of the b-coloring approach. In fact, after performing the first procedure (initial coloring of G), some assigned colors remain without any dominating vertex. The objective of the second procedure (find the b-coloring of G) is to find a b-coloring of graph G where all colors are dominating. The idea is the following: each non dominating color l can be changed. In fact, after removing l from the graph G, for each vertex vi colored with l, a new color is assigned to vi which is different from those of its neighborhood (proper coloring condition). The main problem is to find the appropriate color to assign for vi when there is a choice between many colors for it. We note the reliance of the b-coloring result to the selection colors manner for such vertices. This choice is constrained to maximize the intracluster homogeneity of the returned partition Pk . For maximizing intracluster homogeneity, the sum of vertex dissimilarities within each color has to be minimized.
328
H. Elghazel and K. Benabdeslem
So the color whose distance with vi is minimal will be selected if there is a choice between many colors for vi . In our original b-coloring algorithm [4,5], the distance between the vertex vi and one color c is defined by the single-linkage distance as the distance between vi and the nearest vertex having color c. For BcSOM approach, this distance relies only on the neighborhood relations between neurons (c.f. figure 1). Hence, the distance between the vertex vi and a color c is given by the number of neighbors colored with c in the 1-neighborhood of the neuron i related to the vertex vi (eq.(2)). In our case, the 1-neighborhood of i is considered by the set of its 8 nearest neighbors (N N 8 (i)) in the two dimensional topological map (i.e. the red rectangle in figure 1). By considereing the topologial oganization offered by SOM, the proposed distance is more general and robust than the original one (single-linkage), since it considers the set of 8 nearest neighbors and not the most closest one. d8 (vi , c) = |{vj , |1 ≤ j ≤ n and j ∈ N N 8 (i) and c(vj ) = c}|
(2)
where |.| denote the cardinality. Our idea is summarized in the procedure BcSOM find b-coloring(). For that, assuming that the vertices of G are colored, the following notations and routines will be used: – – – –
– – – –
Δ: the maximum degree of G. c(vi ): the color (integer value) of the vertex vi in G. Nc (vi ): the neighborhood colors of vertex vi . C: the color set used in the graph (one set of integer values). Initially, since vertices of G are colored with the maximum number of colors available for G (i.e. Δ + 1), C = {1, 2, .., Δ + 1}. Dm : the set of colors which have dominating vertices. N Dm : the set of colors that have no dominating vertex. U pdate(Nc (vi )) is the method which updates the neighborhood colors of the vertex vi when the color of at least one of its neighbors has changed. Enqueue(c, Dm ) is the method which adds the color c into the color set Dm .
The BcSOM find b-coloring procedure reaches a fix point when the set of colors with no dominating vertices is an empty set. As a consequence, there exists at least one dominating vertex for each color class. To conclude our proposal, for each dissimilarity threshold θ selected from the dissimilarity table between neurons, the BcSOM approach consists of the 3 following steps: 1. Construction of the Threshold Graph G. 2. Initial coloring of G. //The same as for original b-coloring framework. 3. BcSOM find b-coloring().
Towards B-Coloring of SOM
329
Algorithm 1. Procedure BcSOM find b-coloring() 1: N Dm := C \ Dm ; 2: repeat 3: c := argmaxh∈N Dm ; 4: C := C \ {c}; 5: N Dm := C \ Dm ; 6: for each vertex vi such that c(vi ) = c do 7: H := C \ Nc (vi ); 8: c(vi ) := argminh∈H (d8 (vi , h)); 9: end for 10: for each vertex vj such that c(vj ) ∈ N Dm do 11: U pdate(Nc (vj )) 12: if (Nc (vj ) = C \ {c(vj )}) then 13: Enqueue(c(vj ), Dm ); 14: end if 15: end for 16: until N Dm = φ
Proposition 1. The two-level approach using self-organizing map and our proposed BcSOM approach reduces considerably the runtime of the original b-coloring clustering approach (which we will refer to as Orig-b-col in he remaining of the paper). Proof. Consider the clustering of n instances (zi ∈ Rp , i = 1, 2, ..., n) using the original b-coloring clustering approach (Orig-b-col ) in [4,5]. This approach generates the b-coloring of any graph G (associated with a threshold value θ) in O(n2 Δ). When the data set (n instances) is first clustered using self-organizing map, BcSOM is then applied, during a second step, for clustering a √ set of m neurons returned by SOM (i.e. W = {w1 , w2 , . . . , wm } where m = 5 n)1 . There√ 2 fore, BcSOM uses at most ((5 n) × Δ) instructions. The clustering using a two-level approach (self-organizing map and modified b-coloring approach i.e. BcSOM) allows to decrease the complexity of the b-coloring based clustering approach to O(nΔ).
5
Experiments on Benchmark Data Sets
In this section, we illustrate our algorithm’s performance on several relevant benchmark data sets, viz., Hepta, Tetra, Two-Diamonds, Pima, Lsun, Rings and Engytime [9] (c.f. Table 2). For each trained SOM, a visual inspection was performed. Then, BcSOM, Orig-b-col (the Original b-coloring approach without any neighborhood information) in [4,5], Agglomerative Hierarchical (AHC in the form of Ward-based 1
Heuristic proposed by T. Kohonen for automatically providing the number of neurons in the map.
330
H. Elghazel and K. Benabdeslem Table 2. Characteristics of used data sets Data sets Hepta Tetra Two-Diamonds Pima Lsun Rings Engytime
n 212 400 800 768 400 1000 4096
p #labels 3 7 3 4 2 2 8 2 2 3 3 2 2 2
approach) and k-means clustering SOM algorithms were applied (c.f. examples of results in figure 5). A comparison is made based on the quality of the partitioning obtained from them. We remind that the Euclidian distance is applied to define the dissimilarity level D between two p-dimensional referent vectors. Moreover, AHC and k-means approaches have been performed for several partitions over SOM neurons and for each of them the optimal partition was returned using Davies-Bouldin quality index [8]. For an interesting assess of the results gained with the different clustering approaches, the following performance indices are used: – Two dissimilarity-based validity schemes called Davies-Bouldin and Generalized Dunn’s indices [8]. Considered as internal criteria, they offer an idea about both intracluster similarity (tightness) and the intercluster dissimilarity (separation) of the returned partition. – Two statistical-matching schemes called Purity and Adjusted Rand index [10] which concern the clustering accuracy. They are used in order to compare clustering results against external criteria.
Fig. 5. Results of compared clustering SOM approaches on Tetra and Lsun data sets
Towards B-Coloring of SOM
5.1
331
Clustering Validity Indices
Different cluster validity indices have been proposed in the literature [8]. Suppose a data set Ω = {z1 , z2 , . . . , zn } clustered into a partition Pk = {C1 , C2 , . . . , Ck } (i.e. obtained from clustering the SOM neurons set W = {w1 , w2 , . . . , wm }), where for ∀Ci , Cj ∈ Pk , Ci ∩ Cj = φ f or i = j and ηi = m such that ηi is the cardinality of the cluster Ci . We consider the following validity indices: Davies-Bouldin index is a function of the ratio of the sum of within-cluster scatter to between cluster separations. The best clustering minimizes the following value: k 1 sa (Ci ) + sa (Cj ) DB = max (3) k i=1 j=i da (Ci , Cj ) Generalized Dunn’s index was proposed to offer a compromise between the intercluster separation and the intracluster cohesion and recognized as a good assessment of the partition data set. So, it is the more appropriated to partition data set in compact and well-separated clusters. Basically we would like to maximize the Dunn’s generalized index to achieve high quality clustering. min da (Ci , Cj )
DunnG =
i,j,i =j
max sa (Ch )
(4)
h
where – Ch , Ci , Cj ∈ Pk . – sa (Ci ) is the average distance within the cluster Ci corresponding to the dissimilarity measures D. It is given by the following formula. sa (Ci ) =
ηi ηi 1 D(wu , wv ) ηi (ηi − 1) u=1 v=1
(5)
– da (Ci , Cj ) is the between-cluster separation expressed as: ηi ηj 1 da (Ci , Cj ) = D(wu , wv ) ηi ηj u=1 v=1
5.2
(6)
Clustering Accuracy Indices
In our case, the used UCI data sets include class information (label ) for each data instance. These labels are available for evaluation purposes but not visible to the clustering algorithm. Remember that the objective was to perform unsupervised classification that correctly identifies the underlying classes when the number of clusters is not predefined. Consequently, evaluation is based on
332
H. Elghazel and K. Benabdeslem
two statistical-matching schemes called Purity and Adjusted Rand index which concern clustering accuracy. Purity is a simple and transparent evaluation measure. It can be expressed as the percentage of elements of the assigned label in a cluster. To compute this measure, each cluster is assigned to the label which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned instances and dividing by n (the total number of instances). k 1 Purity(Pk ) = max |Ci ∩ Lj | (7) n i=1 j∈{1,2,...,k } where – Ci is one cluster given from the partition obtained with the clustering algorithm Pk = {C1 , C2 , . . . , Ck }. – Lj is one class label given from the correct predefined partition L = {L1 , L2 , . . . , Lk } (k is the total number of labels). On the other hand, the partition obtained using one clustering approach will be assessed by considering relations upon instances: for each pair of instances, either they have the same label or different ones. For a data set with n instances, there are n × (n − 1)/2 unique pairs of instances (vi , vj ), and thus there are n × (n − 1)/2 pairwise decisions reflected in the obtained partition. Let a be the number of pairs of instances that are placed in the same cluster in Pk and having the same label, b be the number of pairs of instances in the same cluster in Pk but with different labels, c be the number of pairs of instances in different clusters in Pk but with the same label, and d be the number of pairs of instances in different clusters in Pk and with different labels. The quantities a and d can be interpreted as agreements, and b and c as disagreements. The Rand index [11] is defined as: a+d a+d Rand(Pk ) = = (8) n(n − 1)/2 a+b+c+d However, since the expected value of the Rand index of two random partitions does not take a constant value (close to 0), this index is biased and no longer recommended in many empirical studies [12]. Consequently, we adopt the Adjusted Rand index proposed by Hubert and Arabie [10] to assess the degree of agreement between two partitions (the one obtained with the clustering algorithm (clusters) and the correct predefined one (labels)). It is measured as: Rand a(Pk ) =
(a + b + c + d)(a + d) − ((a + b)(a + c) + (c + d)(b + d)) (a + b + c + d)2 − ((a + b)(a + c) + (c + d)(b + d))
(9)
When comparing two clustering algorithms, the one that produces the greater Adjusted Rand index and Purity should be preferred since the partition correctly identifies the underlying classes in the data set.
Towards B-Coloring of SOM
333
Table 3. Evaluation of clustering SOM approaches on Generalized Dunn’s index. The number of clusters returned in each case is given between parentheses. Data sets Hepta Tetra Two-Diamonds Pima Lsun Rings Engytime
5.3
k-means 2.7030(7) 0.7816(4) 1.9678(2) 0.8499(4) 2.1456(5) 0.9639(10) 0.9466(10)
AHC 2.7030(7) 1.1784(6) 1.9753(2) 1.0458(3) 1.4380(8) 1.3313(10) 0.7588(10)
Orig-b-col 2.7030(7) 2.1135(4) 1.9753(2) 1.2660(2) 1.8860(5) 1.2531(15) 1.7220(3)
BcSOM 2.7030(7) 2.1662(4) 1.9754(2) 1.2697(2) 2.2395(5) 1.3629(13) 1.7301(4)
Results
We report here our experiments using seven relevant benchmark data sets (c.f. Table 2) chosen from UCI database [9]. Tables 3 and 4 provide the clustering results according to Generalized Dunn’s and Davies-Bouldin indices. Both DunnG and DB measures indicate better clustering for all partitions generated by the proposed BcSOM approach, except for Hepta data set where we obtain identical results with all approaches. The clusters given from BcSOM are thus compact and well-separated. This confirms the pertinence of the graph b-coloring technique associated with the neighborhood informations provided by SOM (1) to offer a compromise between the intercluster separation and the intracluster homogeneity, and (2) to improve the results obtained by the original b-coloring approach (without any neighborhood information). Furthermore, in order to better assess the results obtained from traditional clustering approaches (AHC and k-means), DunnG and DB are evaluated when these approaches generate the same number of clusters that BcSOM provides. Even in this case, traditional approaches do not reach the results provided by BcSOM. For k-means, DunnG , DB values are computed as 1.1510, 1.7025 (Pima), 1.1427, 0.7979 (Rings), and 1.4409, 1.0907 (Engytime). For AHC, they are computed as 1.9675, 0.9015 (Tetra), 1.1241, 1.7391
(Pima), 1.9120, 0.8610 (Lsun), 1.1427, 0.7979 (Rings), and 1.4410, 1.0384
(Engytime). We note that, for the not cited data sets, the traditional approaches already provide partitions with the same number of clusters as for BcSOM. Tables 5 and 6 list the clustering results according to Adjusted Rand index and Purity rate. Additionally, the purity rates provided by our approach are compared with those returned from a two recently proposed clustering SOM techniques [13]. They are AT-Neigh-W (Hierarchical clustering based on artificial ant and a new dissimilarity measure which take into account the topological order of referent vectors) and AHC-Neigh-W (Agglomerative Hierarchical Clustering approach based on the same dissimilarity measure)2 . 2
We note that the results of AT-Neigh-W and AHC-Neigh-W algorithms are given from [13] and not be reproduced in this paper.
334
H. Elghazel and K. Benabdeslem Table 4. Evaluation of clustering SOM approaches on Davies-Bouldin index Data sets Hepta Tetra Two-Diamonds Pima Lsun Rings Engytime
k-means 0.4840(7) 1.4261(4) 1.0093(2) 1.8147(4) 0.8412(5) 0.9682(10) 1.1573(10)
AHC 0.4840(7) 1.2315(6) 1.0088(2) 1.7892(3) 0.9646(8) 0.8435(10) 1.1594(10)
Orig-b-col 0.4840(7) 0.8963(4) 1.0088(2) 1.5690(2) 0.8347(5) 0.7887(15) 1.0622(3)
BcSOM 0.4840(7) 0.8808(4) 1.0082(2) 1.5665(2) 0.8316(5) 0.7839(13) 1.0019(4)
According to the Adjusted Rand index measures, we observe that BcSOM provides always the highest values, except for Hepta data set where we obtain identical results with all approaches. Considering the purity rate measures, BcSOM provides generally better clustering results, except for: – Hepta data set, where we obtain identical results with all approaches but better than AT-Neigh-W (i.e. 43.4%). In fact, all approaches (except ATNeigh-W one) correctly identify the correct predefined partition (labels) of this data set. – Rings, where BcSOM provides the same purity rate as for AHC, AHC-NeighW and original b-coloring approaches. However, it is the highest rate. – Two-Diamonds and Pima, where BcSOM provides the highest purity value than all approaches except the AT-Neigh-W one. These results are expected since AT-Neigh-W provides a large number of clusters (7 for Two-Diamonds and 5 for Pima) than BcSOM (2 for Two-Diamonds and 2 for Pima). Indeed, high purity is easy to achieve when the number of clusters is large - in particular, purity is 100% if each instances gets its own cluster (this is the main problem of the purity rate scheme). Moreover, we note that BcSOM identifies the same number of clusters as for the correct predefined partition (labels) which is respectively 2 for Two-Diamonds and 2 for Pima. – Engytime, where we observe that k-means, AHC, and AHC-Neigh-W achieve the best purity rates. This result is also expected since the number of clusters returned from these approaches is also greater than the one provided by BcSOM. Consequently, it is observed that BcSOM generally achieves purity rates close to 100% with the smallest number of clusters (generally the same as for the correct predefined partition). It can be concluded that BcSOM generates meaningful clusters by correctly identifying the underlying classes in each used data set. On the other hand, by looking the results provided by the original b-coloring approach, we deduce that incorporating neighborhood informations (offered by SOM) in our approach increases the clustering accuracy related to the previously discussed measures. Using the same number of clusters (i.e. selected from BcSOM) as an input parameter for k-means, Adjusted Rand index, P urity rates are computed
Towards B-Coloring of SOM
335
Table 5. Evaluation of clustering SOM approaches on Adjusted Rand index rate (%) Data sets Hepta Tetra Two-Diamonds Pima Lsun Rings Engytime
k-means 100(7) 82.69(4) 98.68(2) 34.56(4) 85.57(5) 56.93(10) 57.02(10)
AHC 100(7) 96.70(6) 99.34(2) 35.20(3) 81.81(8) 58.63(10) 57.07(10)
Orig-b-col 100(7) 98.82(4) 99.34(2) 34.69(2) 86.31(5) 55.31(15) 66.94(3)
BcSOM 100(7) 99.76(4) 99.67(2) 35.50(2) 86.85(5) 59.56(13) 72.58(4)
Table 6. Evaluation of clustering SOM approaches on Purity rate (%) Data sets Hepta Tetra Two-Diamonds Pima Lsun Rings Engytime
k-means 100(7) 74.75(4) 99.50(2) 66.28(4) 99.25(5) 95.70(10) 94.46(10)
AHC 100(7) 99.25(6) 99.75(2) 66.28(3) 99.75(8) 100(10) 94.48(10)
Orig-b-col AT-Neigh-W AHC-Neigh-W BcSOM 100(7) 43.4(4) 100(7) 100(7) 98.75(4) 81.75(5) 98.5(4) 99.75(4) 99.75(2) 100(7) 96.88(5) 99.86(2) 65.76(2) 72.4(5) 65.10(2) 67.63(2) 99.75(5) 93(5) 99.25(6) 100(5) 100(15) 81.5(5) 100(11) 100(13) 88.16(3) 88.04(7) 93.90(5) 92.75(4)
as 28.54, 66.02 (Pima), 56.44, 100 (Rings), and 69.62, 91.09 (Engytime). For the same purpose, these measures are computed for AHC as 97.70, 97.50
(Tetra), 34.97, 66.28 (Pima), 85.16, 99 (Lsun), 56.44, 100 (Rings), and 72.20, 93.25 (Engytime). Once again it is BcSOM which generates the best clustering except for Engytime data set where AHC generates the highest purity rate.
6
Conclusion
This paper has proposed BcSOM, an extension of the b-coloring based clustering approach to cluster self-organizing map. The proposed approach use simultaneously the dissimilarities and topological relations (provided by SOM) between referent vectors to find a cluster partition of self-organizing map, where a compromise between the intercluster separation and the intracluster homogeneity is achieved. We have implemented, performed experiments, and compared our method to other clustering SOM approaches. We have shown significant improvements in clustering quality and runtime as demonstrated by the results obtained over seven UCI data sets, in the form of internal and external criteria and concluded that combining the power of clustering with topological organization provided by SOM achieves better performances than either in isolation. Many interesting issues were raised from this work and some are currently being developed: (1) leading more experiments and comparisons for our algorithm on a
336
H. Elghazel and K. Benabdeslem
larger image data set, and (2) extending the definition of distance concept between referent vector and color to the N-neighborhood (in the current paper, we just consider the 1-neighborhood ) in the sense to better improving the clustering quality, to name a few.
References 1. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999) 2. Kohonen, T.: Self-organizing Maps, vol. 30. Springer, Heidelberg (2001) 3. Vesanto, J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Transactions on Neural Networks 11(3), 586–600 (2000) 4. Elghazel, H., Deslandres, V., Hacid, M.S., Dussauchoy, A., Kheddouci, H.: A new clustering approach for symbolic data and its validation: Application to the healthcare data. In: Esposito, F., Ra´s, Z.W., Malerba, D., Semeraro, G. (eds.) ISMIS 2006. LNCS (LNAI), vol. 4203, pp. 473–482. Springer, Heidelberg (2006) 5. Elghazel, H., Kheddouci, H., Deslandres, V., Dussauchoy, A.: A graph b-coloring framework for data clustering. Journal of Mathematical Modelling and Algorithms 7(4), 389–423 (2008) 6. Irving, W., Manlov, D.F.: The b-chromatic number of a graph. Discrete Applied Mathematics 91, 127–141 (1999) 7. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967) 8. Kalyani, M., Sushmita, M.: Clustering and its validation in a symbolic framework. Pattern Recognition Letters 24(14), 2367–2376 (2003) 9. Blake, C.L., Merz, C.J.: Uci repository of machine learning databases (1998) 10. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985) 11. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971) 12. Milligan, G.W., Cooper, M.C.: A study of comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research 21(4), 441–458 (1986) 13. Azzag, H., Lebbah, M.: Clustering of self-organizing map. In: European Symposium on Artificial Neural Networks (ESANN 2008), pp. 209–214 (2008)
CSBIterKmeans: A New Clustering Algorithm Based on Quantitative Assessment of the Clustering Quality Tarek Smaoui, Sascha Müller, and Christian Müller-Schloer Leibniz Universität Hannover - Institute of Systems Engineering, Appelstraße 4, 30167 Hannover, Germany {smaoui,mueller,cms}@sra.uni-hannover.de
Abstract. In this paper we introduce a clustering algorithm CSBIterKmeans1 based on the well-known k-means algorithm. Our approach is based on the validation of the clustering result by combining two “antipodal” validation metrics, cluster separation and cluster compactness, to determine autonomously the “best” number of clusters and hence dispense with the number of clusters as input parameter. We report about our first results with a collection of audio features extracted from songs and discuss the performance of the algorithm with different numbers of features and objects. Keywords: cluster analysis, k -means, clustering validation.
1
Introduction
Machine learning is a sub-field of artificial intelligence that deals with techniques to allow computers to learn. Cluster analysis belongs to the procedures of unsupervised machine learning, where no a-priori knowledge for the learning process is needed. The goal is to gain useful information about the structure of a given (complex) pattern set. Patterns (observations, data items, or feature vectors) are divided into natural groups (clusters). According to [1] for objects of the same cluster the following conditions should be — as far as possible — fulfilled: 1. 2. 3. 4.
Share the same or closely related properties Show small mutual distances or dissimilarities Have “contact” or “relations” with at least one other object in the group or Be clearly distinguishable from the complement, i.e., the rest of the objects in the data set.
Cluster analysis has been continuously developed and is used in many scientific disciplines such as biology, psychology, statistics, pattern recognition, economics and finance. 1
Stands for Compactness and Separation based Iterative k -means.
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 337–346, 2009. c Springer-Verlag Berlin Heidelberg 2009
338
T. Smaoui, S. Müller, and C. Müller-Schloer
The clustering process itself involves two major steps: defining a distance measure and choosing an appropriate grouping algorithm. For this work the Euclidean distance is applied. Clustering algorithms can be divided into five major classes: partitioning, hierarchical, density-based, grid-based, and model-based methods [2]. Partitioning methods start with an initial partition of the object set, which they try iteratively to improve. For assessing the quality of interim results appropriate heuristics may be used. Partitioning algorithms are quite easy to implement and they show good run time, which makes them popular in the practice. The crucial drawback is their sensitivity to the start partition. The major issue is to set the “right” number of clusters k, as input parameter. The most popular algorithm of this class is k -means. Further examples are PAM [3] and CLARA [3]. With hierarchical techniques the assignment of patterns to clusters is equally achieved iteratively however without an initial partition. This can be done divisively as a single cluster, to which all the objects belong, is gradually subdivided into sub-clusters. Alternatively with agglomerative methods each object at the beginning is considered as a separate cluster. These clusters are then iteratively merged together, to build eventually one cluster. The advantage of hierarchical algorithms is that they do not need a start partition to work. SLINK [4], BIRCH [5], CURE [6] and DIANA [3] are well-known algorithms of this class. The underlying idea of density-based clustering is to regard a cluster as a concentration of related patterns. There exist two different approaches within this class. The first approach looks at the density of the environment and the connectivity to other objects for each object individually. The second approach uses density functions to model a distribution of patterns. Density-based clustering procedures have the advantage that they can detect any cluster shape. DBSCAN [7] and DENCLUE [8] are two examples of density-based methods. Grid-based methods such as STING [9] and WaveClust [10] subdivide the object space into cells, where clustering operations are done. This reduces the computing complexity especially for large object sets. The challenge with these algorithms is how to set the size of the cells optimally. For the fifth class — model-based clustering — a (predetermined) model is assumed, after which clusters are constructed that best fit the model. COBWEB [11] and SOM [12] are two examples to mention. In this study we focus on partitioning algorithms, in particular the k -means algorithm. We try to improve the effectiveness of the algorithm by deploying clustering validation metrics. Our algorithm works more autonomously without the need of the number of clusters as input parameter. The clustering result could be rather controlled by the user. In addition it is more likely to find the natural groups in the data. This is explained in detail in chapter 2. Chapter 3 demonstrates our approach by means of empirical studies. Finally we summarize and conclude our work in chapter 4.
2
The CSBIterKmeans Algorithm
The paramount goal of the CSBIterKmeans algorithm is naturally to generate a clustering with the best possible structure. As the name suggests, it is based on the well-known k -means algorithm. k -means tries to find the optimal subdivision
CSBIterKmeans: A New Clustering Algorithm
339
of the data set for a given number of clusters. The clusters are represented by their centroids. The objects are assigned to their nearest centroid. This is carried out iteratively until no reassignment takes place or the maximum number of iterations is reached. The major issue thereby is whether the predetermined number of clusters corresponds to the natural one. The basic idea of CSBIterKmeans is to apply k -means multiple times with different parameters and pick the best result2 . To decide which result is the best, an internal evaluation criterion is necessary. For this criterion it is sufficient to use quantities and features inherited from the object set [1]. In order to ensure the comparability between two clustering results, we have to make sure that the output of k -means does not additionally depend on the initial partition of the object set. 2.1
Evaluation Function
Evaluation functions or measures may be divided into external and internal measures. External measures as the one proposed in [13] try to assess how well the clustering results match some prior knowledge found in the so-called ’truthground set’. Due to the lack of such prior knowledge, only internal measures are considered. Internal measures base solely on the data set and the clustering schema. Many of these (internal) evaluation measures are designed to compare partitions with the same number of clusters. That is why most of them are unable to compare two partitions with different sizes objectively, they tend to prefer a specific extreme value for the number of clusters. To avoid this problem CSBIterKmeans uses a composition of two single antipodal functions: the weighted sum of Cluster Compactness and Cluster Separation 3 presented in [14] and tries to find a ’trade-off’ clustering4 in terms of these measures. The compactness is a measure of cohesion of the objects in every single cluster. It indicates how well the clustering algorithm partitions the object set in distinct clusters in terms of object homogeneity reflected by the mean normalized variance (v(Ci )/v(O)). The higher this value is, the lower the average cohesion of the clusters is. Since this measure is not sufficient as the ’best’ compactness is given iff every object is regarded to be a single cluster, a complement measure, the Cluster Separation, had to be introduced. The Cluster Separation as given by formula 2 below analyzes how distinct every cluster centroids pairs is and therefore how dissimilar (separated) two clusters themselves are. The Gaussian function and the L1-normalization normalizes withal its value to between 0 and 1 as stated in [14]. A smaller value means a larger dissimilarity. For a set of clusters C compactness and separation are defined as: |C|
Compactness 2 3 4
1 v(Ci ) Cmp = |C| i v(O)
(1)
With respect to the number of clusters. Probably the terms Discompactness and Deseparation would be more intuitive, since a high value means a low compactness and separation respectively. We use deliberately at this point the term ’clustering’ and not ’number of clusters’.
340
T. Smaoui, S. Müller, and C. Müller-Schloer
with
Separation
Sep =
|X| 1 v(X) = d2 (xi , x ¯) |X| i=1 |C| |C| 1 d2 (oCi , oCl ) exp(− ) |C|(|C| − 1) i=1 2σ 2
(2)
l=1,l =i
X denotes a set of objects, O the whole object set, σ is the standard deviation and d(x, y) the Euclidean distance. Finally the overall cluster quality is defined as: Ocq(β) = βCmp + (1 − β)Sep
(3)
where β ∈ [0, 1] is the weight. 2.2
K -Means Initialization
It is quite obvious, that the compactness Cmp will be minimal, if every cluster consists of only one object. The separation Sep would be theoretically minimal, if there is only one cluster. So we expect Cmp to decrease and Sep to increase, if the number of clusters is incremented from k to k + 1. This behavior is quite common but not always given. The overall cluster quality depends though on the clusters (partition) created by k -means, which itself depends heavily on the initialization. To avoid the jitter caused by a random initialization, k -means should be initialized by a deterministic method. Such a deterministic method was presented by Kaufman in [3]. A study [15] showed that this approach performs very well. Originally it has been developed to initialize the PAM (Partitioning Around Medoids) algorithm [3]. The basic idea is to select k representative objects. The first one is the most centrally located one. The next representative is chosen depending on its neighborhood. It should be the object with the highest number of the rest of the objects that would be assigned to it and so on. The pseudocode in listing 1 illustrates the initialization with k centroids according to Kaufman. 2.3
The Clustering Process
CSBIterKmeans has to be applied on an object set with n ≥ 4 objects. The idea is to find the best number of clusters to use with respect to the evaluation Algorithm 1. Kaufman initialization method 1: T he f irst centroid is the most centrally located object 2: T he next centroid is object i, which maximizes l Cli , where Cli = max{mins dsl − dli , 0}, Maths is a centroid, l is a non-centroid-object, mins dst is the Euclidean distance to the nearest centroid and dli is the Euclidean distance between the two objects 3: W hile not k centroids have been selected, repeat step 2
CSBIterKmeans: A New Clustering Algorithm
341
measures Cmp and Sep. We proceed according to nested intervals principle, where we half in each step the size of interval with minimum and maximum number of clusters as endpoints. In each step we check for which of the new half-intervals the difference of the measures Cmp and Sep of its endpoints is bigger. For the next step (iteration) we consider thereafter the corresponding half-interval. Listing 2 determines the cluster structure. The operation mode of CSBIterKmeans is further depicted by the flowchart in figure 1. Algorithm 2. CSBIterKmeans 1: CSBIterKmeans(O, β) {O is the object set. The functions cmp(), sep() and ocq() are used to calculate the formulas (1), (2) and (3)} 2: min_cmp ← 0 3: min_sep ← 0 4: max_ocq ←+∞ 5: max_c ← {} 6: k ← |O| 2 7: step_width ← |O| 4 8: while step_width ≥ 1 do 9: C ← k_means(O, k) {run k -means using Kaufman initialization} 10: if ocq(C, β) < max_ocq then {new best clustering found} 11: max_ocq ← ocq(C, β) 12: max_c ← C 13: end if 14: if ((1−β)sep(C) − min_sep) > (βcmp(C) − min_cmp) then 15: min_cmp = βcmp(C) 16: k ← k + round(step_width) {assumption: more clusters might improve the overall quality} 17: else 18: min_sep ← (1−β)sep(C) 19: k = k − round(step_width) {assumption: less clusters might improve the overall quality} 20: end if step_width 21: step_width ← 2 22: end while 23: return max_c
3
Empirical Results
Our Implementation of CSBIterKmeans is based on the RapidMiner-Framework [16], which provides useful data structures and a fully-implemented k -means algorithm. To handle a real-world data set, we had to modify the evaluation function at some points. First problem is, that compactness and separation usually do not fall into the same range. That is why, we tried to scale them to an interval from 0 to 1 in an initial step. The other and more difficult problem was, that compactness and separation generally do not have the same curve progression either. So we applied some transformations to make them look more
342
T. Smaoui, S. Müller, and C. Müller-Schloer
! "
Fig. 1. Operation mode of CSBIterKmeans 1
0
Cmp
0
20
40
60
80
1
0
100
120
140
160
180 200 number of clusters
120
140
160
180 200 number of clusters
Sep
0
20
40
60
80
100
Fig. 2. Modified compactness and separation as function of the number of clusters
linear. We achieved the best results by extracting the square root of the compactness and multiplying the separation by the number of clusters (actually we just omitted the division in equation 2). The transformed values will be denoted as Cmp and Sep . We analyzed several sets of objects and figured out that these transformations generally yielded the most linear curves5 . Our test data came from the technical university of Dortmund. It is a collection of numerical audio features (like peak values, average frequencies, loudness and others) extracted from songs from garageband.com. The entire data set contains 1886 elements with more than 50 attributes. The feature extraction is presented in [17]. Figure 2 shows the graphs of the compactness and separation measurement of an object set with 8 features and 200 objects after the manipulation. The clustering result (including the number of clusters) can be controlled by the weight β as shown in figure 3. We applied 6 different values for β on a 5
Indeed it is still to examine whether this works for all object sets.
CSBIterKmeans: A New Clustering Algorithm β = 0.7 (38 Cluster) ll
k kk k kk kkk k k
l
l ll l l l l l
FB
l
hh h h
h
T URJ U M C O V
A
A ff f ff f f
j j j j jj j j jj
f ff
a
j A
A A AAA AA A A
dd c
f ff f f
f
K KK K KK K K
K
A AAA
M M MM MMMM MMM
A
A A AAA AA A A
CC C C
J
J J JJJJ J J
A A A A A AA A A
K KK K KK K K
K
K
A
A A AAA AA A A
H
A AAA
M M MM MMMM MMM
A
A A AAA AA A A
J J JJJJ J J
E EE E EE E E
H
Y A A A A A AA A A
HH H
B BBB BBB BB E EE E EE EEE E E
E
CC
E
C
H HH H
B
C
E
M MM MM MM D M
M A A A A A AA A A
Z MM Z ZPP D W
E EE
FB
J
UU U U
Y Y YY Y YYY Y YY
E
CC C C
FB T MMJ M O C R O
β = 0.2 (5 Cluster)
L
L
M MM MM MM D M
HH H
A AAA
G GII G GI GG L LL L LL LLL L L
E
V
U
dd c
C
M
X
H
f ff f f
X
V
Y M e ZWP D g
K KK
FB
C
j j j j jj j j jj
f
E
L
L
K
ff
V VV V VV VVV V V
β = 0.3 (14 Cluster)
G GII G GI GG
K KK
h
T URJ U M C O V
X XX X XX X X
j
β = 0.4 (14 Cluster)
L LL L LL LLL L L
FB
hh h
A AAA
S SKI S QG QL
X XX
E
k
i
H
f
a
k
Y M e ZWP D g
j f
k kk k kk kkk k k
h i
j
a aa a aa a a
β = 0.5 (26 Cluster)
b SKI b QG XL
a aa
E
k
k
A AAA
β = 0.6 (37 Cluster)
a SKI b QG XL
l
343
A AAA
A A AA A AAA A AA
A
A A AAA AA A A
CC C C
C C CCCC C C C D D D D DDD D D D
A A A A A A AA A A
DD D
Fig. 3. Influence of β on a two-dimensional data set
time in seconds 12 attributes
350 300
10 attributes
250 200
8 attributes 6 attributes
150 100 50 0
0
50 100 150 200 250 300 350 400 450 500 number of objects
Fig. 4. Time complexity of CSBIterKmeans with different number of attributes and β = 0.5
344
T. Smaoui, S. Müller, and C. Müller-Schloer
Table 1. Comparison of various clustering algorithms with CSBIterKmeans, with n=number of objects, k=number of clusters, s=size of sample, I=number of iterations [18] Cluster Shape Spherical
Algorithm
Input Parameters
Optimized For
k -means
Number of clusters
PAM
Number of clusters
CLARA
Number of clusters
Separated clusters Separated clusters, Spherical Small data sets Relatively large Spherical data sets
BIRCH CURE
DBSCAN
DENCLUE
STING
WaveClust
Branching factor, diameter threshold Number of clusters, number of cluster representatives Radius of clusters, Minimum number of points in clusters
Large data sets
Spherical
Arbitrary shapes of clusters, relatively Arbitrary large data dets Arbitrary shapes of clusters, Large data Arbitrary set Arbitrary shapes of Radius of clusters, Miniclusters, Large data Arbitrary mum number of objects set Number of cells in lowest Vertical and Large spatial data level, Number of objects horizontal sets in cell boundaries Number of cells for Arbitrary shapes of each dimension, wavelet, clusters, large data Arbitrary number of application of sets transform
CSBIterKmeans Optionally the weight β
Separated clusters, Spherical small data sets
Time Complexity O(Ikn) O(Ik(n − k)2 ) O(ks2 + k(n − k)) O(n) O(n2 log n) O(n log n) O(n log n) O(n)
O(n) O(n2 log n)
two-dimensional data set with 100 objects. In each step we changed β about one tenth. Nevertheless the resulting number of clusters changed irregularly. From β = 0.4 to β = 0.3 the number of clusters did not change at all. Obviously the location of the local optimum does not change in some cases. But even more important is the fact that the changes never yield into the wrong direction, because that would confuse the user very much. The CSBIterKmeans algorithm shows a non-linear behavior, so its application would be rather appropriate for small data sets. Figure 4 gives an overview of R some performance measurements. All measurements were taken on an Intel CoreTM 2 Duo processor [email protected] with 2GB RAM. As can be gleaned once more from table 3 the advantage of our algorithm consists in the abandonment of any kind of input parameters, whereas the quality of the clustering input can be intuitively influenced by the user. This is consistent with the subjective nature of the clustering task as such.
4
Conclusion and Discussion
We have proposed a new clustering algorithm based on the popular k -means algorithm and notions of cluster evaluation found in literature. Our algorithm is characterized by the fact that it can autonomously operate without the need of the number of clusters as input parameter. Furthermore the user can (op-
CSBIterKmeans: A New Clustering Algorithm
345
tionally) control the overall clustering quality by intuitively setting one simple parameter and therefore express his personal preference without the necessity of having technical understanding for the clustering process. This makes our algorithm very user-friendly. Furthermore the algorithm is stable and not sensitive to the initialization issue. However there is still room for improvement concerning performance, as first empirical analysis has shown that the algorithm does not scale well with increasing data space complexity (objects’ number and dimensionality). As the algorithm relies on applying k -means iteratively, it stands to reason to avoid applying k -means in each iteration and instead get the new clustering done depending on the previous results. Future work will also consist of integrating CSBIterKmeans into the InfoKiosk project [19], which deals with user-friendly representation of complex information spaces. There it should replace a commonly used clustering algorithm, with more user acceptance and satisfaction in mind.
References 1. Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications, pp. 6–7, 303. SIAM, Philadelphia (2007) 2. Han, J.W., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001) 3. Kaufman, L., Rousseeuw, P.: Finding Groups in Data - An introduction to Cluster Analysis (1990) 4. Sibson, R.: SLINK: An optimally efficient algorithm for the single link cluster method. The Computer Journal (1973) 5. Zhang, T., Ramakrishnan, R., Livney, M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: SIGMOD 1996. ACM, New York (1996) 6. Sudipto, G., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases. In: Proc. of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp. 73–84 (1998) 7. Ester, M., Krigel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, WA, pp. 226–231 (1996) 8. Hinneburg, A., Keim, D.A.: An Efficient Approach to Clustering in Multimedia Databases with Noise. In: Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, New York, pp. 58–65 (1998) 9. Wang, W., Yang, J., Muntz, M.: STING: A statistical information grid approach to spatial data mining. In: Proc. Int. Conf. on Very Large Data Bases (VLDB 1997), pp. 186–195 (1997) 10. Sheikholeslami, G., Chatterjee, S., Zhang, A.: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In: Proc. of the 24th International Conference on Very Large Databases, San Francisco, CA, pp. 428–439 (1998) 11. Fisher, D.H.: Knowledge Acquisition Via Incremental Conceptual Clustering. Machine Learning 2, 139–172 (1987) 12. Kohonen, T.: Self-Organization and Associative Memory. Springer, New York (1988)
346
T. Smaoui, S. Müller, and C. Müller-Schloer
13. Dom, B.E.: An Information-Theoretic External Cluster-Validity Measure. Research Report. IBM T.J. Watson Research Center RJ 10219 (2001) 14. Wu, W., He, J., Tan, A., Tan, C.: On Quantitative Evaluation of Clustering Systems. Kluwer Academic Publishers, Dordrecht (2002) 15. Pena, J., Lozano, J., Larranaga, P.: An empirical comparison of four initialization methods for the K-Means algorithm (1999) 16. http://rapid-i.com/ (last update: 2008) 17. Mierswa, I., Morik, K.: Automatic Feature Extraction for Classifying Audio Data (2005) 18. Andritsos, P.: Data Clustering Techniques Qualifying Oral Examination Paper (2002) 19. Smaoui, T., Müller-Schloer, C.: InfoKiosk: Towards User Adaptivity in Complex Information Spaces (to be published, 2009)
Agent-Based Non-distributed and Distributed Clustering Ireneusz Czarnowski and Piotr J¸edrzejowicz Department of Information Systems, Gdynia Maritime University Morska 83, 81-225 Gdynia, Poland {irek, pj}@am.gdynia.pl Abstract. The paper deals with the non-distributed and distributed clustering and proposes an agent-based approach to solving the clustering problem instances. The approach is an implementation of the specialized A-Team architecture called JABAT. The paper includes an overview of JABAT and the description of the agent-based algorithms solving the non-distributed and distributed clustering problems. To evaluate the approach the computational experiment involving several well known benchmark instances has been carried out. The results obtained by JABAT-based algorithms are compared with the results produced by the non-distributed and distributed k -means algorithm. It has been shown that the proposed approach produces, as a rule, better results and has the advantage of being scalable, mobile and parallel.
1
Introduction
During the last decade, a number of agent-based solutions have been proposed to solve different types of optimization problems [2],[13],[14]. One of the successful approaches to agent-based optimization is the concept of A-Teams. An A-Team is composed of simple agents that demonstrate complex collective behavior. The A-Team architecture was originally proposed by Talukdar [21] as a set of objects including multiple agents and memories which through interactions produce solutions of optimization problems. The paper proposes applying the A-Team concept to obtain solutions to the non-distributed and distributed clustering problems. Clustering can be considered the most important unsupervised learning problem. A loose definition of clustering could be ,,the process of organizing objects into clusters, which are groups whose members are similar in some way”. More precisely, clustering is the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis, marketing, multimedia and bioinformatics. While ,,classic” data mining algorithms invariably operate on centralized data, in practice related information is often acquired and stored at physically distributed locations due to organizational or operational constraints. Centralization of such data before analysis may not be desirable because of computational P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 347–360, 2009. c Springer-Verlag Berlin Heidelberg 2009
348
I. Czarnowski and P. J¸edrzejowicz
or bandwidth costs. In this paper clustering of the centralized data is referred to as the non-distributed clustering problem as opposed to the distributed clustering where the relevant data is stored at distributed locations. There is a number of problems with clustering. Dealing with large number of dimensions and large number of data items can be problematic because of the time complexity. The effectiveness of a clustering method depends on the definition of ”distance”. The result of the clustering algorithm, that in many cases can be arbitrary itself, can be interpreted in a different ways. The distributed clustering case renders the problem even more difficult. It assumes that the objects to be clustered reside on different sites. The data are clustered independently on different locations. In a subsequent step, the central site tries to establish a global clustering based on the local models, i.e. the representatives. This is not an easy step as there might exist dependencies between objects located on different sites which are not taken into consideration during the construction of the local models. Hence, distributed clustering is carried out on two different levels, i.e. the local level and the global level. On the local level, all sites carry out a clustering independently from each other. After having completed the clustering, a global model is determined which should reflect an optimum trade-off between complexity and accuracy [8],[15],[5]. To cope with the above listed problems it is proposed to use a set of agents cooperating within the A-Team. Such an approach could be helpful in solving some of the complexity problems through allowing parallel activities of agents carried out at the local level. It is also excepted that the agent-based approach can be helpful in easing the remaining difficulties, through taking advantage of the agent-based system features including scalability, peer-to-peer processing paradigm and the distributed computational intelligence. In this paper a middleware environment developed by authors and referred to as JABAT (JADE-based A-Team) is used to implement the proposed approach. The paper contains a short overview of the functionality and structure of the JABAT as well as the formulation of the non-distributed and distributed clustering problems. Main part of the paper focuses on the proposed JABAT architecture which has been designed with a view to solving both kinds of the clustering problem. To validate the approach computational experiment results has been carried out. Its results are compared with the results produced by the non-distributed and distributed k -means algorithm.Conclusions focus on evaluation of the proposed approach and on suggestions for further research.
2
Overview of the JABAT
The JADE-based A-Team environment (JABAT) described in a more detailed manner in [3] is a middleware supporting the construction of the dedicated A-Team architectures used for solving variety of computationally hard optimization problems. JADE is an enabling technology, for the development and run-time execution of peer-to-peer applications which are based on the agents paradigm and which can seamless work and interoperate both in wired and
Agent-Based Non-distributed and Distributed Clustering
349
wireless environment [4]. JADE allows each agent to dynamically discover other agents and to communicate with them according to the peer-to-peer paradigm. The central problem in the design of the multi-agent system is how much intelligence to place in the system and at what level. As it was observed in [11], the vast majority of the work in this field has focused on making agents more knowledgeable and able. This has been achieved by giving the deliberative agent a deeper knowledge base and ability to reason about data, giving it the ability to plan actions, negotiate with other agents, or change its strategies in response to actions of other agents. At the opposite end of the spectrum lie agent-based systems that demonstrate complex group behavior, but whose individual elements are rather simple. The JABAT belongs to the latter class. It does not provide ready answers to questions on how population of agents should be selected, which agents work best in combination, how should agents decide when to act and on which solutions or what should be the strategy for destroying unwanted solutions? Instead it offers tools to easily implement variety of strategies and solutions when dealing with the above listed questions, through providing a flexible framework and a set of predefined classes. The environment, when properly used, is expected to be able to produce solutions to difficult optimization problems through applying the following general rules: - To solve difficult optimization problems use a set of agents, each representing an improvement algorithm. - To escape getting trapped into a local optimum generate or construct an initial population of solutions called individuals, which, during computations will be improved by agents, thus increasing chances for reaching a global optimum. Main functionality of the proposed environment is searching for the optimum solution of a given problem instance through employing a variety of the solution improvement algorithms including, for example random and local search techniques, greedy construction algorithms, genetic algorithms etc. The search involves a sequence of the following steps: - Generating an initial population of solutions. - Applying solution improvement algorithms which draw individuals from the common memory and store them back after attempted improvement, using some user defined replacement strategy. - Continuing reading-improving-replacing cycle until a stopping criterion is met. To perform the above two classes of agents are used. The first class includes OptiAgents, which are implementations of the improvement algorithms. The second class includes SolutionManagers, which are agents responsible for maintenance and updating of individuals in the common memory. All agents act in parallel. Each OptiAgent is representing a single improvement algorithm (simulated annealing, tabu search, genetic algorithm, local search heuristics etc.). An OptiAgent has two basic behaviors defined. The first is sending around messages on
350
I. Czarnowski and P. J¸edrzejowicz
readiness to undertake an improvement action. The message includes also the information on a number of solutions requires to enable such an improvement action. The second is activated upon receiving a message from some SolutionManager containing the problem instance description together with the required number of individuals. This behavior involves improving fitness of individuals and resending the improved ones to a sender. A SolutionManager is brought to life for each problem instance. Its behavior involves sending individuals to OptiAgents and updating the common memory. Main assumption behind the proposed approach is its independence from a problem definition and solution algorithms. Hence, main classes Task and Solution upon which agents act, have been defined at a rather general level. Interfaces of both classes include function ontology(), which returns JADE’s ontology designed for classes Task and Solution, respectively. Ontology in JADE is a class enabling definition of the vocabulary and semantics for the content of message exchange between agents. More precisely, an ontology defines how the class is transformed into the text message exchanged between agents and how the text message is used to construct the class (here either Task or Solution).
3
JABAT-Based Approach to Clustering Problem
3.1
The Clustering Problem
The clustering problem (CP) can be defined as follows. Given the set of N data objects, partition the data set into k clusters, such that similar objects are grouped together and objects with different features belong to different groups. Clustering arbitrary data into clusters of similar items presents the difficulty of deciding what similarity criterion should be used to obtain a good clustering. It can be shown that there is no absolute ,,best” criterion which would be independent of the final aim of the clustering. Euclidean distance and Manhattan distance are probably the most commonly chosen measures of similarity [23], [10]. Partition defines the clustering by giving for each data object the cluster index of the group to which it is assigned. The goal is to find such a partition that minimizes an objective function, which, for instance, is the sum of squared distances of the data objects to their cluster representatives. K -means [12] is one of the best known and simplest unsupervised learning algorithms that solves the clustering problem. In this paper the k -means is used as the reference algorithm. The procedure follows a simple and easy way to partition a given data set into a certain number of clusters (assume k clusters) fixed a priori. The algorithm is composed of the following steps: - Place k points into the space represented by the objects that are being clustered. These points represent initial group centroids. - Assign each object to the group that has the closest centroid. - When all objects have been assigned, recalculate the positions of the k centroids.
Agent-Based Non-distributed and Distributed Clustering
351
- Repeat Steps 2 and 3 until the centroids no longer move. This produces an allocation of objects into groups from which the metric to be minimized can be calculated. To validate the solution produced by the k -means algorithm (as well as any other distance-based approach) the silhouette technique can be used. The silhouette validation technique [20] calculates the silhouette width for each sample, average silhouette width for each cluster and overall average silhouette width for a total data set. Using this approach each cluster could be represented by the so-called silhouette, which is based on the comparison of its tightness and separation. The average silhouette width could be used for evaluation of the clustering validity and to decide whether the number of clusters has been selected properly. In the traditional approach clustering algorithms are used basing on the assumption that all data can be pooled together in a centralized data repository. In the real life there are, however, numerous cases where the data have to be physically distributed due to some constraints. As a consequence the distributed clustering is, recently, attracting a lot of attention as there are many cases where pooling distributed data for clustering is not feasible, due to either huge data volume or data privacy or some other reasons. Applying the traditional clustering tools might not be possible [9]. Hence, clustering and also knowledge discovery from multiple-databases has became an important research field and is considered to be more complex and difficult task than clustering and knowledge discovery from mono-databases [17],[22]. Generally, distributed clustering is a dynamically growing area, but many algorithms for the distributed data mining have their roots in algorithms originally designed and developed for the parallel data mining, where an algorithm starts with the complete data set and then distributes the data among the different resources [8]. In such way the parallel version of k -means was introduced in [19]. Distributed clustering assumes that the instances to be clustered are located on different sites and the clustering process is of the two-stage type - first performing local data analysis and then combining the local results forming the global one [8]. In the first stage instances are clustered, at local sites, applying standard algorithms. Next the local models, i.e. the so-called representatives, are transmitted to the central site, where the global clustering process is run and the global model created from the local representatives [8]. For creating the global model from the local representatives any specialized technique should be used. The main aim of such technique is a suitable aggregation of the local results. Sophisticated approaches to combination of different clustering results include the ensemble methods [18]. In general, using ensembles for clustering can improve the quality of the results. However, aggregation strategies are very often based on the idea of minimizing ”average” dissimilarity, and when only the local cluster representatives are used this leads to optimization problems which in general are computationally hard [6]. Although a variety of methods for aggregating local results have been so far proposed in the literature (see, for example [6], [8], [20], [23]), no single approach can be considered as superior nor guaranteeing satisfactory results.
352
3.2
I. Czarnowski and P. J¸edrzejowicz
The Agent-Based Non-distributed Clustering
In this paper we propose the JABAT-based architecture with a view to solving non-distributed and distributed clustering problems. All the required classes are defined and placed in the package called CP (Clustering Problem). The CP is represented by the following classes: - CP Task inheriting form the Task class. - CP Solution inheriting from the Solution class. Both classes have been designed through over-ridding the default JABAT classes constructors and other functions to assure compatibility between agent’s actions and the problem instance requirements. To assure an appropriate communication between optimization agents and the solution manager the CP TaskOntology, inheriting from the TaskOntology, has been defined. In fact, it has been possible to simplify the default JABAT ontology through restricting communication processes to sending from the task manager agent to optimization agents the basic task parameters including the number of clusters, attributes and instances, as well as the location of the respective dataset. An optimization agent, after having received such message, reads the data from, the respective file specified in the message. Communication in the opposite direction - from optimization agents, involves resending an improved solution. CP Solution role is to store and maintain the population of solutions to the clustering problem instance. A solution of the problem is represented as a permutation of vector numbers allocated to each cluster. Such a representation includes also the number of elements in each cluster and the value of the objective function. The proposed approach is based on employing three kinds of optimization agents - random local search, hill-climbing local search and tabu search [7] agents. All optimization agents have been implemented in the form of classes inheriting from the OptiAgent class. The role of an optimization agent is trying to improve solutions represented by individuals in the population of individuals maintained by the respective solution manager. More precisely, optimization agents try to improve the fitness function value of these individuals which have been forwarded to them by the solution manager. Pseudo-codes showing all three categories of the optimization agents are shown in Fig. 1, 2 and 3. Each optimization agent carries out a fixed number of iterations. If, during this search a better solution has been found the process of search would stop and the improved solution would be transmitted to the respective solution manager. Otherwise, agents stop searching for an improvement after having completed the prescribed number of iterations. 3.3
Determination of the Number of Clusters
The proposed JABAT architectures has been designed to determine both - the number of clusters and the allocation of objects to clusters. This is achieved
Agent-Based Non-distributed and Distributed Clustering
353
Fig. 1. Pseudo-code showing the random local search agent
Fig. 2. Pseudo-code showing the hill climbing local search agent
through repeatedly searching for the best allocation of objects to clusters for different variants of the number of clusters. Allocation generated by JABAT for each variant of the cluster number is then evaluated using the overall average silhouette width as the global quality measure with a constraint on the allowed number of clusters with the negative value of the silhouette factor [20]. It should be noted that the overall average silhouette width is used to evaluate whether
354
I. Czarnowski and P. J¸edrzejowicz
Fig. 3. Pseudo-code showing the tabu search agent
the number of clusters has been property selected and not how well particular clustering problem has been solved. JABAT agents, acting within the predefined time slot, attempt to solve, in parallel, k* independent clustering problems, where k* is the arbitrarily determined maximum number of clusters. After the allowed time has elapsed, solutions obtained for different number of clusters can be evaluated and compared with solutions produced by the k -means algorithm. This comparison is based on the square error of the cluster dispersion used as a common quality criterion.
Agent-Based Non-distributed and Distributed Clustering
355
<problemname>Distributed Clustering <solutionmanager>CP_SolutionManager http://RemoteSite#1/data.html http://RemoteSite#2/data.html http://RemoteSite#3/data.html http://RemoteSite#n/data.html OptiAgentLocalSearch OptiAgentLocalSearchHillClimbing OptiAgentTabuSearch
Fig. 4. Example definition of the distributed clustering problem in JABAT
3.4
The Agent-Based Distributed Clustering
The above described population-based approach to solving instances of the nondistributed clustering problem has been extended to cover the distributed clustering case. It has been assumed that an instance of the non-distributed clustering problem can be identified and defined on site where the JABAT is physically implemented. Such a definition requires identifying and describing the location of data, using the language understandable to JABAT. Figure 4 depicts the example XML file defining an instance of the distributed clustering problem. The file should contain the name of the respective solution manager, as well as URL address of the distributed data set. The search for a solution to the distributed problem is carried on two levels - local and global one, as shown in Figure 5.
Fig. 5. The JABAT architecture for the distributed clustering
356
I. Czarnowski and P. J¸edrzejowicz 200 y 180 160 140 120 100 80 60 40 20 0 0
20
40
60
80
100 120 140 160 180 200
x
Fig. 6. Example distributed clustering problem at the global level (objects from two locations labeled with a cross and triangle, respectively)
The local clustering aims at finding the optimum solution at the local level. Optimization criterion would be some clustering quality measure like, for instance, the square error of the cluster dispersion (SE). Using JABAT allows for agent migration to locations where the distributed data is, physically, stored. While such a migration is not always necessary since URL addresses of distributed data sets are known at the problem definition stage, migrating agents may help to better use available computational resources. Besides, the direct access to data may prove advantageous from the data safety point of view. A solution to the distributed clustering problem at the local level is represented by the following data: number of clusters, for each cluster a set of object numbers, for each cluster the value of it’s representative (its centroid), for each cluster the value of its e-range (maximum distance between the centroid and an object belonging to the cluster) and the value of the clustering quality measure. Solutions from the local level are used to obtain the global solution. This is done through merging the local level solutions. In each case when a cluster e-range calculated at the local level is fully covered by an e-range of another cluster from the local level and different location the former is absorbed by the latter as shown in Fig. 6. where the cluster with the smallest e-range from location marked by a triangle is absorbed by the cluster with the largest e-range from location marked by a cross. In all other cases the proposed procedure for the non-distributed clustering is run with all centroids, excluding the absorbed ones, being treated as objects which need to be allocated to clusters, the number of which is determined by using the average silhouette width technique.
4
Computational Experiment Results
To validate the proposed approach to the clustering problem several benchmark instances have been solved. The proposed JABAT-based architecture has been
Agent-Based Non-distributed and Distributed Clustering
357
Table 1. Comparison of the results - JABAT A-Team versus k -means algorithm (value of the square error of the cluster dispersion) k
Ruspini
Iris
Heart
Credit
ImgSeg
k -means JABAT k -means JABAT k -means JABAT k -means JABAT k -means JABAT 2 89337.8 89337.8 152.4
152.4
343.0
338.2
548.8
528.6
2049.7
2058.5
3 51063.5 51155.4 78.9
74.4
316.7
311.2
480.9
468.7
1552.6
1490.6
4 12881.1 12881.1 57.3
52.4
297.4
287.5
449.0
420.8
1500.6
1073.8
5 12138.3 11425.2 49.7
46.8
275.4
245.4
366.5
326.1
1022.6
947.8
6 9379.6
8530.8 41.8
33.9
256.0
226.0
333.8
313.9
971
857.5
7 7509.4
5022.0 31.6
25.4
230.9
198.7
302.6
278.5
859.9
773.9
8 7189.5
5022.0 31.6
25.4
230.9
198.7
302.6
278.5
765.6
727
9 6469.7
5194.0 29.3
28.1
201.2
186.2
296.0
275.8
763.3
688.1
10 5990.1
4813.5 28.0
23.2
185.1
175.2
289.0
268.9
736.6
662.1
used to solve four instances of the five popular benchmark datasets, including Ruspini (75 objects with 2 attributes; natural dataset has originated from 4 clusters), Iris (150, 4, 3), Credit approval (690, 15, 2), Cleveland heart disease (303, 13, 2) and Image segmentation (ImgSeg) (2310,19,7) problems [16],[1]. The results obtained by JABAT have been compared with the solutions produced by the k -means algorithm. In Table 1 values of the objective function for the non-distributed case and for different numbers of clusters are shown. It can be seen that the proposed agent based approach outperforms k -means algorithm. The two-sample paired t -test for means shows that comparing the results obtained by both analysed algorithms for each of the analysed instances, the hypothesis that difference between means is equal to 0 should be rejected. In Tables 2 and 3 the results for the non-distributed and distributed cases produced by JABAT and k -means algorithm are, respectively, shown. The Table 2. The clustering results as produced by the JABAT A-Team Ruspini
Iris
Heart
Credit
ImgSeg
Non-distributed Clustering accuracy
100%
100%
99%
90%
Number of clusters produced 4
100%
3
2
2
7
SE
12881.1
74.4
338.2
528.6
773.9
Clustering accuracy
91%
83%
74%
82%
78%
Number of clusters produced 4
3
3
2
7
SE
77.9
373.0
586.3
906.2
Distributed
29401.3
358
I. Czarnowski and P. J¸edrzejowicz Table 3. The clustering results as produced by the k -means algorithm Ruspini
Iris
Heart
Credit
ImgSeg
100%
89%
93%
94%
82%
Number of clusters produced 4
3
2
2
7
SE
78.9
343.0
548.8
859.9
Non-distributed Clustering accuracy
12881.1
Distributed Clustering accuracy
90%
Number of clusters produced 4 SE
71%
73%
78%
71%
3
3
2
7
461.187
802.62
1131.4
36164.56 142.65
distributed clustering instances have been produced by a random partition of each of the non-distributed instances into two datasets representing two locations. Clustering accuracy has been calculated through comparison with a known solution for the benchmark instance. Such a criterion was proposed in [8]. The results, shown in Tables 1, 2 and 3, are the best results selected from 10 runs of each compared algorithm.
5
Conclusion
Main contribution of the paper is seen as proposing and implementing an agentbased architecture which is scalable, interoperational, distributed and parallel, and which can produce good quality results when solving non-distributed and distributed instances of the clustering problem. The approach has been validated experimentally through comparison with the reference k-means algorithm. Agent-based clustering seems to be well suited to deal with the distributed clustering problems. Migration of JABAT agents is an easy and natural process and the local level clustering can be easily carried on sites where the data are stored. This would eliminate necessity of transferring data between sites and help in more effective use of the available computational resources. Future research will focus on improving the clustering quality at the global level and on designing user friendly interface to the system.
References 1. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html 2. Aydin, M.E., Fogarty, T.C.: Teams of autonomous agents for job-shop scheduling problems: An Experimental Study. Journal of Intelligent Manufacturing 15(4), 455– 462 (2004)
Agent-Based Non-distributed and Distributed Clustering
359
3. Barbucha, D., Czarnowski, I., J¸edrzejowicz, P., Ratajczak-Ropel, E., Wierzbowska, I.: An Implementation of the JADE-base A-Team Environment. International Transactions on Systems Science and Applications 3(4), 319–328 (2008) 4. Bellifemine, F., Caire, G., Poggi, A., Rimassa, G.: JADE. A White Paper, Exp. 3(3), 6–20 (2003) 5. Chan, P.K., Fan, W., Prodromidis, A., Stolfo, S.J.: Distributed Data Mining in Credit Card Fraud Detection. IEEE Intelligent Systems 1094, 67–74 (1999) 6. Dimitriadou, E., Weingessel, A., Hornik, K.: A Cluster Ensembles Framework. In: Proceedings of the third International conference on Hybrid Intelligent Systems (HIS 2003), pp. 528–534 (2003) 7. Glover, F.: Tabu Search - Part I. ORSA Journal of Computing 1, 190–206 (1990) 8. Januzaj, E., Kriegel, H.P., Pfeifle, M.: Towards Effective and Efficient Distributed Clustering. In: Proceedings of International Workshop on Clustering Large Data Sets, 3rd International Conference on Data Mining (ICDM), pp. 49–58 (2003) 9. Kargupta, H., Park, B.H., Hershberger, D., Johnson, E.: Collective Data Mining: A New Perspective Toward Distributed Data Analysis. In: Kargupta, H., Chan, P. (eds.) Accepted in The Advances in Distributed Data Mining. AAAI/MIT Press (1999) 10. Leeser, M., Theiler, J., Estlick, M., Szymanski, J.J.: Design tradeoffs in a hardware implementation of the k-means clustering algorithm. In: Sensor Array and Multichannel Signal Processing Workshop, Proceedings of the IEEE, pp. 520–524 (2000) 11. Lerman, K.: Design and Mathematical Analysis of Agent-Based Systems. In: Rash, J.L., et al. (eds.) FAABS 2000. LNCS (LNAI), vol. 1871, pp. 222–234. Springer, Heidelberg (2001) 12. MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967) 13. Marinescu, D.C., Boloni, L.: A component-based architecture for problem solving environments. Mathematics and Computers in Simulation 54, 279–293 (2000) 14. Parunak, H.V.D.: Agents in Overalls: Experiences and Issues in the Development and Deployment of Industrial Agent-Based Systems. International Journal of Cooperative Information Systems 9(3), 209–228 (2000) 15. Prodromidis, A., Chan, P.K., Stolfo, S.J.: Meta-learning in Distributed Data Mining Systems: Issues and Approaches. In: Kargupta, H., Chan, P. (eds.) Book on Advances in Distributed and Parallel Knowledge Discovery. AAAI/MIT Press (2000) 16. Ruspini, E.H.: Numerical method for fuzzy clustering. Inform. Sci. 2(3), 19–150 (1970) 17. Ahang, S., Wu, X., Zhang, C.: Multi-Database Mining. IEEE Computational Intelligence Bulletin 2(1) (2003) 18. Strehl, A., Ghosh, J.: Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal on Machine Learning Research (JMLR) 3, 583–617 (2002) 19. Stoffel, K., Belkoniene, A.: Parallel k/h-means Clustering for Large Data Sets. In: Proceedings of EuroPar. (1999) 20. Struyf, A., Hubert, M., Rousseeuw, P.J.: Clustering in Object-Oriented Environment. Journal of Statistical Software 1(4), 1–30 (1996)
360
I. Czarnowski and P. J¸edrzejowicz
21. Talukdar, S., Baerentzen, L., Gove, A., de Souza, P.: Asynchronous Teams: Cooperation Schemes for Autonomous, Computer-Based Agents, Technical Report EDRC 18-59-96, Carnegie Mellon University, Pittsburgh (1996) 22. Zhang, X.-F., Lam, C.-M., Cheung, W.K.: Mining Local Data Sources For Learning Global Cluster Model Via Local Model Exchange. IEEE Intelligence Informatics Bulletine 4(2) (2004) 23. Haixun, W., Wei, W., Jiong, Y., Yu, P.S.: Clustering by Pattern Similarity in Large Data Sets. In: The ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, USA (2002)
An Evidence Accumulation Approach to Constrained Clustering Combination Daniel Duarte Abdala and Xiaoyi Jiang Department of Mathematics and Computer Science University of M¨ unster, M¨ unster, Germany {abdalad, xjiang}@uni-muenster.de
Abstract. Constrained clustering has received substantial attention recently. This framework proposes to support the clustering process by prior knowledge in terms of constraints (on data items, cluster size, etc.). In this work we introduce clustering combination into the constrained clustering framework. It is argued that even if all clusterings of an ensemble satisfy the constraints, there is still a need of carefully considering the constraints in the combination method in order to avoid a violation in the final combined clustering. We propose an evidence accumulation approach for this purpose, which is quantitatively compared with constrained algorithms and unconstrained combination methods.
1
Introduction
Recently, new efforts have been made to improve clustering results. Constrained clustering [4] takes advantage of known information about the data set to aid the clustering process. Partially labeled data, expected maximum and minimum cluster size, and pairwise linkage pattern relationship are examples of information used by a constrained clustering algorithm. Clustering combination [2,9,14], on the other hand, combines multiple clusterings into a new, more reliable solution. In this paper we introduce the clustering combination approach to the domain of constrained clustering. Given an ensemble of clustering results, which satisfy all constraints, we argue that the combination algorithm must explicitly consider the same constraints as well. Otherwise, the constraints may be violated in the final clustering result after the combination. Based on this consideration we extend the clustering combination method [9] to handle constraints, thus proposing a complete chain of constraint clustering combination. The remainder of this paper is organized as follows. In the next section we discuss some recent work on constrained clustering and introduce the main concepts, which are needed to understand our work. Section 3 reviews clustering combination and addresses details of the method proposed in [9], which builds the basis for our work. Thereafter, we present a motivational example where constraint violation actually happens because of no consideration of constraints
Daniel D. Abdala thanks the CNPq, Brazil-Brasilia for granting him a Ph.D. scholarship under the process number 290101-2006-9.
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 361–371, 2009. c Springer-Verlag Berlin Heidelberg 2009
362
D.D. Abdala and X. Jiang
in the combination step (Section 4). Section 5 presents the proposed model to handle constrained clustering combination. Finally, Section 6 presents an experimental evaluation of our approach. Some conclusions and future developments finalize our work.
2
Constrained Clustering
Constrained clustering [4] uses side information to aid the clustering process. There are a number of ways to constrain a clustering procedure. Perhaps the three most investigated methods refer to: (a) cardinality; (b) linkage, and (c) topological constraints. Cardinality constraints refer to methods that limit the maximum and minimum number of patterns assigned to each cluster. For instance, a constrained version of k-means is presented in [5], which specifies the minimum number of patterns assigned to each cluster. Linkage, or instance based, methods refer to constraints that restrain the way patterns will be collocated in the same cluster. There are basically two types of constraints: (i) must link (ML), where pairwise patterns must be placed into the same cluster, and (ii) cannot link (CL), where they must be in different clusters. In [3] a complete graph, labeled with + (ML constraint) and - (CL constraint), was used to derive an algorithm to maximize the number of agreements as well the complementary minimization version. The number of clusters is required to be a priori specified in order to improve the algorithm’s performance. Error bounds for the problem of correlation clustering were investigated in [12] based on the method proposed in [3]. In [15], an updated version of k-means is presented where CL and ML constraints are incorporated by the algorithm. This is a suitable method in dealing with large data sets, since the consideration of constraints does not necessarily increase the algorithm’s computational time. An extension of hierarchical clustering was investigated in [7,8]. It was proved that a complete dendrogram cannot be generated in all cases when constraints are considered. Finally, graph based approaches were investigated in [13]. A distance transformation is executed by a shortest path computation over the constrained graph. Although being also an extension of k-means, this work presents an elegant solution to the problem of hyperspherical shape limitation by computing subcenters driven by the ML-CL constraints. The last method to constrain clustering procedures refers to topological constraints. It dictates that the minumum/maximum distance between patterns must not be violated in order to allow them to be collocated into the same cluster. In [6] the constrained k-means is further generalized to accommodate two new kinds of constraints, δ and . The -constraint enforces that each pair of patterns can be classified into the same cluster if the distance between them is at most . Similarly, the δ-constraint enforces that two patterns found in different clusters must be separated by a distance of at least δ. One can regard such constraints as a generalization of ML and CL.
An Evidence Accumulation Approach
3
363
Clustering Combination
Clustering combination relies on the idea that evidence gathered by a number of partitions can be combined to produce a final, more precise result. Combination methods receive as input an ensemble of clusterings that can be generated by different clustering methods. In addition they are also generated using random initialization and different number of target clusters. A popular class of clustering combination methods are median-based, solving the median partition (MP) problem [16]. It can be formally stated as follows: Given M partitions m1 , ..., mM and d(., .), which is a symmetric distance measure between clusterings, try to find m∗ such that: m∗ = arg min m
M
d(mi , m)
(1)
i=1
This problem is known to be NP-complete [16], directing the research in developing heuristics to approximate it. Among the relevant works in this area, [10] presents a collection of six heuristics. In [14], the problem of clustering combination was investigated using a probabilistic model of consensus and finite mixtures. It generates a final partition using the EM (Expectation Maximization) algorithm. They also propose a consensus function based on mutual information. Finally, in [9] the authors explore the idea of evidence accumulation by combining the clustering results of M k-means runs into a co-association matrix. This matrix is later used as a new similarity measure for a standard agglomerative hierarchical clustering algorithm. In this paper, we focus our attention on investigating the need of considering constraints in clustering combination. More specifically, we are interested in the following question: ”If all clusterings of an ensemble satisfy the constraints, there is still a need to consider them in the combination step?” Our answer is definitively positive. We will demonstrate this point later and propose an approach to dealing with constraints in clustering combination. 3.1
Evidence Accumulation Based Clustering Combination
The remainder of this section discusses the three main steps of the method proposed in [9], which is fundamental to our constrained clustering combination approach. Ensemble Generation. The first step refers to the creation of an ensemble of clustering results. A total of M clusterings is produced using the k-means algorithm with random initialization. It is important to note that any clustering algorithm can be used to generate the ensemble, provided slightly different clusterings are produced. The clusterings can also be performed with variable number of target clusters. Evidence Accumulation. The underlying assumption is based on the fact that patterns belonging to the same ”natural” cluster are very likely to be collocated
364
D.D. Abdala and X. Jiang
in the same cluster among different clusterings. A co-occurrence matrix is computed by counting how many times pairs of patterns occur in the same cluster among different clusterings. We normalize it in relation to the number M of clusterings produced. A co-association matrix C of size N × N with values ranging from 0 (meaning no association) to 1 (maximum association) is then generated. C(i, j) =
mi,j M
(2)
where mi,j refers to how many times the pair (i, j) of patterns occurs in the same cluster among the M clusterings and N is the number of patterns in the data set. By combining the results gathered in the ensemble, a new similarity measure C is created. In fact, a nonlinear transformation of the original feature space is performed. Combination Step. An agglomerative hierarchical clustering algorithm was chosen to be used in the combination step. It takes the co-association matrix as input and produces the final clustering result. In addition to hierarchical methods, there exist other algorithms, which cluster patterns based on a similarity matrix, for instance, spectral clustering. All such clustering methods could be applied for the combination step.
4
Motivational Example
In order to understand the reasons why constraints need to be considered in the combination step, we devised a didactic example to show that they can actually
Fig. 1. Misclassification due to non-consideration of constraints in the combination step
An Evidence Accumulation Approach
365
be violated. It follows the same steps described above, except that a constrained version of k-means [15] was used to generate the ensemble. We will show that even if all clusterings of an ensemble satisfy the constraints, there is still need of carefully considering the constraints in the combination step in order to avoid violations in the final combined clustering. Consider the example presented in Figure 1. Here, we have used the original data set (A) consisting of six patterns to produce four clustering results (B; number K of clusters = 2). The columns represent different clusterings while the rows index the patterns. Each table cell contains a label assigned to that pattern by the clustering algorithm. All clusterings of this ensemble satisfy the two ML constraints (between patterns 1 and 2, and 5 and 6, respectively) and the CL constraint (between patterns 1 and 5). In (C), we have used the ensemble to compute the co-association matrix. The dendrogram produced by the standard single-link (SL) algorithm is shown in (D). Finally, (E) exemplifies an erroneous combination result due to the non-consideration of constraints in the combination process. If we inspect the dendrogram, it is easy to see that both ML constraints are satisfied. When merging the clusters {1, 2, 3} and {5, 6} in the dendrogram, the resultant cluster clearly violates the CL constraint between patterns 1 and 5.
5
Proposed Model
The simple example above motivates our research in finding ways to avoid such violations. We propose to extend the combination approach to accommodate constraints.
Fig. 2. Proposed model for constrained clustering combination
The proposed model is depicted in Figure 2. It receives as input the number K of desired clusters, the number M of clustering results in the ensemble, the original data set and the sets of must-link and cannot-link constraints. It starts
366
D.D. Abdala and X. Jiang
with computing M clustering results by running a constrained clustering algorithm. It is required that the chosen algorithm will generate different results e.g. by random initialization. Once the co-association matrix is available, another constrained clustering algorithm, is used to produce the final combined result. 5.1
Ensemble Generation
We have applied a constrained version of k-means to produce the ensemble. Fixed or variable number of target clusters can be used. The constrained version of kmeans introduced in [15] is shown in Table 1. From this point on, this algorithm will be referenced as cop-KM. The main change to the original version can be found at line 2. A conditional statement defines that a pattern can be assigned to the closest cluster if and only if no constraint is violated; see [6] for details of the constraint violation test. If the closest cluster cannot be chosen as target cluster, the algorithm proceed by checking all remaining clusters until no one is left. If no allowable cluster is found, it returns an empty label for that pattern. Table 1. Constrained k-means algorithm cop-KM cop-KM (D, ML, CL, K) D : Data set ML: Set of must-link constraints CL: Set of cannot-link constraints K : number of clusters 1. randomly initialize the cluster centers by C1 , . . . , CK . 2. ∀di ∈ D assign di to the closest cluster center Cj if no constraint is violated 3. ∀Cj compute new cluster center by averaging all di ∈ Cj 4. Iterate over (2-3) until the clustering converges
5.2
Clustering Combination
Given the ensemble of constrained clusterings, we apply the clustering combination method from [9] (see Sections 3.1) to produce a combined clustering result. In doing so, we need to make one extension: The agglomerative clustering for the combination step must be adapted to work with constraints. For the combination step, we have chosen to use a constrained version of hierarchical single-link algorithm. Table 2 describes the constrained agglomerative (single-link) algorithm Table 2. Constrained agglomerative (single-link) algorithm cop-SL cop-SL (SIM , ML, CL) repeat steps 1-3 1. find the minimum entry (xl , xm ) in SIM 2. check if merging xl and xm will violate CL if yes, find the next minimum entry in SIM if no, Merge(xl , xm ) and update SIM 3. if there are no more mergeable clusters, return the dendrogram
An Evidence Accumulation Approach
367
adapted from [8]. The algorithm expects as input a similarity matrix (SIM) and two sets of ML and CL constraints. It starts with all patterns assigned to singleton clusters, namely, {C1 , . . . , CN }. The only change to the original single link algorithm refers to the test for mergeable clusters, meaning that the algorithm must stop if no mergeable clusters are found. In this case, the final reached state of the dendrogram also represents the minimum obtainable solution in the presence of constraints. In [8] a check is presented to verify the minimum number of clusters obtainable in the presence of CL constraints. The first line finds the minimum entry in SIM and merges the clusters it refers to if no CL constraint is violated. It will select the next minimum on SIM, otherwise. Since the must-link constraints are known to be transitive, it is also possible to compute the transitive closure for the ML set as a pre-processing step as done in [8]. In our case this is not necessary. Since the computation of the co-association matrix will ensure maximum similarity (i.e. 1) to any pair of ML constrained patterns these patterns will be merged during the initial iterations of cop-SL. This is the reason why algorithm cop-SL has no explicit handling for must-links.
6 6.1
Experimental Results Test Data
We have evaluated our method using 4 synthetic (depicted in Figure 3) and 3 well-known UCI data sets (iris, wine and breast-cancer) [1]. Half-rings is a classical problem. Our data set is composed of 269 patterns distributed in 2 classes (160-class 1; 109-class 2). Two-rings is known to be a hard problem for hyperspherical algorithms such as k-means. Our data set has 326 patterns (165class 1; 161-class 2). C-ellipse is a data set where a C-shaped cluster embraces another elliptic cluster. It contains 225 patterns (85-class 1; 140-class 2). Finally, scattered data set contains 132 patterns (38-class 1; 94-class 2). The UCI data sets are: (I) Iris - 150 patterns and 4 attributes, 50 in each of three classes; (II) Breast cancer - 683 patterns and 9 attributes. There are two classes, one with 444 and the second with 239 patterns; and (III) Wine - 178 patterns distributed in 3 classes, containing 59, 71, and 58 patterns. Constraints were generated randomly using the available ground truth. We chose two patterns at random. If they share the same label, a ML-constraint was generated and placed into the ML set. A CL-constraint was generated otherwise. Five sets of constraints were generated for each data set, with 5, 10, 20, 50, 100, and 200 constraints, respectively. Previously generated constraints are retained in order to allow an incremental structure of the constraint sets. Fixed number of target clusters were used during all tests, namely, the same number specified by the ground truths. 6.2
Constraint Violation Test
In Section 4 we have presented a didactic example showing that constraints were violated in the combination step if they are not properly considered. We have
368
D.D. Abdala and X. Jiang
Fig. 3. Four synthetic data sets: (A) half-rings; (B) C-ellipse; (C) two-rings; and (D) scattered
conducted experiments to demonstrate that the same also happens with real data. Eight ensembles were generated to each data set using the incremental constraint sets, each containing 50 clusterings of the original data set. Afterwards, each ensemble was combined by the standard single link algorithm as in [9], without considering the constraints at all. We checked how many instances of constraint violations happened in the combined clustering. The result is shown in Figure 4 for the synthetic (A) and UCI data sets (B). It clearly demonstrates that constraints are violated if not considered properly in the combination step, proving the need of a careful management of the a priori information. 6.3
Experimental Validation
The error ratio was used as evaluation measure. It is computed by matching the ground truth information available with the clustering result obtained. All permutations of labels were tested and we have retained the maximum correspondence. All results are presented in percentage values. In special for the algorithms K-mean and cop-K-means, a tuple (mean, standard variation) is shown. Four groups of experiments have been performed: – No constraints, no combination In this group we have executed the standard k-means and single-link algorithm. To ensure a smooth result for k-means, 50 runs were done and the average error ratio was reported.
An Evidence Accumulation Approach
369
Fig. 4. Number of violations due to non-consideration of constraints in the combination step: Synthetic data sets (top); UCI data sets (bottom) Table 3. Error ratios (in percentage) for (unconstrained) KM: k-means; SL:SingleLink; cop-KM: constrained k-means; cop-SL: constrained Single-Link; EAC-SL: Evidence Accumulation Single Link; and cop-EAC-SL: constrained Evidence Accumulation Single-Link
Data Set half-rings two-rings c-ellipse scattered Iris Breast Cancer Wine
Group KM 13.9/2.07 47.5/0.18 27.3/0.22 45.3/1.79 18.1/13.40 4.0/0.0 31.3/5.97
1 SL 23.7 0 0 46.21 33.0 44.9 57.3
Group 2 Group 3 Group 4 cop-KM cop-SL EAC-SL cop-EAC-SL 16.14/2.01 10.2 0 0 47.2/1.30 0 46.0 43.25 25.9/2.37 0 27.6 18.4 44.3/6.10 29.6 43.18 11.1 17.5/7.35 31.3 25.3 10.8 4.2/0.28 34.7 42.0 2.2 32.2/3.17 57.3 29.8 21.3
– Constraints, no combination This group presents the results for the constrained versions of k-means and single-link. In this test, cop-KM was also averaged from 50 runs for the same reason. – No constraints, combination This group summarizes the results of the EAC-SL algorithm[9]. EAC-SL stands for Evidence Accumulation Single-Link algorithm. – Constraints, combination Our approach cop-EAC-SL described in Section 5. The results are reported in Table 3. All constrained algorithms were executed using a set of 20 constraints.
370
D.D. Abdala and X. Jiang
We can see that our method (cop-EAC-SL) achieved smaller error ratios for all data sets except for the two-rings and C-ellipse data sets. These two cases occur because some patterns from different clusters are closer to each other than the patterns actually belonging to each individual cluster. This problem can be easily circumvented by placing CL constraints between those patterns. A additional test was made, where 5 CL constraints were inserted between those closest patterns and cop-EAC-SL performed with 0% error rate. In scattered data set the overall performance of KM and SL was expected to be low. The two clusters cannot be linearly separated and they are also not connected. EAC-SL does not perform well in this case because there is little evidence accumulated leading to a merging of separated sub-clusters. This can, however, be improved by using a bigger number of target clusters. For all UCI, half-rings and scattered data sets, cop-EAC-SL outperforms the constrained cop-SL and cop-KM. This fact is an indication that even in the presence of constraints it is possible to take advantage of different clusterings to gather additional information about its underlying structure and improve the final clustering via a combination procedure. However, two-rings and C-ellipse were better classified using cop-SL alone. Since both data sets are not linearly separable but presenting a strong connectedness component, it is natural for a hierarchical method to perform well. The poorer performance of cop-EAC-SL occurs since it uses the ensemble of cop-KM runs to create its co-association matrix. Due to little evidence gathered among the clusterings produced by copKM, the transformation of the feature space in a co-association matrix will accumulate little evidence to guide the combination step.
7
Conclusion and Future Work
In this work we have introduced the clustering combination into the constrained clustering framework. It has been demonstrated that constraints have to be explicitly treated in the combination step. Otherwise, there is a danger of constraint violation in the final combined clustering. We have presented an extension of the evidence accumulation method proposed in [9] to solve the constrained clustering combination. Experimental results have been reported showing improvement made possible by our combination scheme. Despite of the promising experimental results there is still much room for further improvement of our approach. The general framework can be realized by many other different variants. Comparative studies will be conducted in future. In addition we will investigate other established constraint clustering methods for their potential for combination.
References 1. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html
An Evidence Accumulation Approach
371
2. Ayad, H.G., Kamel, M.S.: Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 160–173 (2007) 3. Bansal, N., Blum, A., Chawla, S.: Correlation Clustering. Machine Learning, 89– 113 (2004) 4. Basu, S., Davidson, I., Wagstaff, K.L.: Constrained Clustering - Advances in Algorithms, Theory, and Applications. In: Kumar, V. (ed.). Chapman & Hall / CRC Press (2009) 5. Bennett, K.P., Bradley, P.S., Demiriz, A.: Constrained K-Means Clustering. Techinical Repport, Microsoft Corp. (2000) 6. Davidson, I., Ravi, S.S.: Clustering With Constraints: Feasibility Issues and the k-Means Algorithm. In: Proceedings of the 5th SIAM International Conference on Data Mining, pp. 138–149 (2005) 7. Davidson, I., Ravi, S.S.: Towards Efficient and Improved Hierarchical Clustering with Instance and Cluster Level Constraints. Techinical Report, University of Albany (2005) 8. Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS, vol. 3721, pp. 59–70. Springer, Heidelberg (2005) 9. Fred, A.L.N., Jain, A.K.: Combining Multiple Clusterings Using Evidence Accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 835–850 (2005) 10. Goder, A., Filkov, V.: Consensus Clustering Algorithms: Comparison and Refinement. In: Proceedings of the 9th Workshop on Algorithm Engineering and Experiments, pp. 109–117 (2008) 11. Jain, A.K.: Data Clustering: User’s Dilemma. In: Proceedings of 5th Int. Conference on Machine Learning and Data Mining in Pattern Recognition, Leipzig, p. 1 (2007) 12. Joachims, T., Hopcroft, J.: Error Bounds for Correlation Clustering. In: Proc. 22nd Int. Conf. on Mach. L. Bonn, Germany (2005) 13. Rothaus, K., Jiang, X.: Constrained Clustering by a Novel Graph-Based Distance Transformation. In: 19th International Conference on Pattern Recognition, Tampa (2008) 14. Topchy, A., Anil, K., Punch, W.: Clustering Ensembles: Models of Consensus and Weak Partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1866–1881 (2005) 15. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means Clustering with Background Knowledge. In: Proceedings of the 18th Conference on Machine Learning, pp. 577–584 (2001) 16. Filkov, V., Skiena, S.: Integrating Microarray Data by Consensus Clustering. International Journal on Artificial Intelligence Tools 13, 863–880 (2004)
Fast Spectral Clustering with Random Projection and Sampling Tomoya Sakai and Atsushi Imiya Institute of Media and Information Technology, Chiba University, Japan {tsakai,imiya}@faculty.chiba-u.jp
Abstract. This paper proposes a fast spectral clustering method for large-scale data. In the present method, random projection and random sampling techniques are adopted for reducing the data dimensionality and cardinality. The computation time of the present method is quasilinear with respect to the data cardinality. The clustering result can be updated with a small computational cost when data samples or random samples are appended or removed.
1
Introduction
We introduce a fast spectral clustering method accelerated by random projection and random sampling of data. Spectral clustering [1–4] is an advanced technique in machine learning and data mining based on graph spectral analysis. Spectral clustering can stably detect non-convex patterns and linearly non-separable clusters. Reduction in the computational cost of the spectral clustering contributes to practical applications of large-scale data. Fast computation schemes for spectral clustering have been proposed by different authors [5–9]. They focus on the eigenvector computation of a graph Laplacian defined by a matrix of data similarities. The Krylov subspace methods, e.g., the Lanczos method [10], are iterative algorithms for finding leading eigencomponents of a sparse matrix, and used in spectral clustering [7–9]. Nyst¨om approximation [5, 6, 11] is another approach to fast computation. Although these techniques are practical, it would be preferable to accelerate the clustering algorithm not only by elaborating fast eigenvalue decomposition and matrix approximation but also by revising the clustering scheme itself. In this paper, we first review the derivation of spectral clustering. We also introduce the random projection for reducing data dimensionality. Second, we present a fast spectral clustering method. Our method consists of fast computation of similarities using random projection, and estimation of eigenvectors by the singular value decomposition from random samples. Additionally, we present an algorithm for updating clusters when we want to append the random samples or new data samples. Finally, we demonstrate our method to show its cost effectiveness. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 372–384, 2009. c Springer-Verlag Berlin Heidelberg 2009
Fast Spectral Clustering with Random Projection and Sampling
2 2.1
373
Preliminaries Spectral Clustering
For a given dataset P = {p1 , . . . , pn ∈ Rd }, spectral clustering [2, 4, 12] finds a set of data clusters, {C1 , . . . , Ck ⊂ P }, on the basis of spectral analysis of a similarity graph. Suppose a similarity graph G, whose vertices correspond to the data points of P , and the edge between pi ∈ P and pj ∈ P is weighted by the similarity wij = wji ≥ 0 between them. The weight matrix W := [wij ] ∈ Rn×n is also called the affinity matrix or the similarity matrix. Cutting edges of G, we obtain disjoint subgraphs of G as the clusters of P . Such graph partitioning scheme has been developed by many authors [1–4, 13–16]. The resulting clusters should be as compact and isolated as possible. The graph partitioning for data clustering can be interpreted as a minimisation problem of an objective function, in which the compactness and isolation are quantified by the subset sums of edge weights. The ratio cut (Rcut) [1], normalised cut (Ncut) [2] and min-max cut (Mcut) [3] are common objective functions. They are expressed as Rcut(C1 , . . . , Ck ) :=
k cut(Cl , P \Cl ) l=1
Ncut(C1 , . . . , Ck ) :=
k cut(Cl , P \Cl ) l=1
and Mcut(C1 , . . . , Ck ) :=
card Cl
cut(Cl , P )
,
(1)
,
(2)
k cut(Cl , P \Cl ) l=1
cut(Cl , Cl )
.
(3)
Here, cut(X, Y ) is the sum of the edge weights between ∀p ∈ X and ∀p ∈ Y . P \Cl is the complement of Cl ⊂ P , and card Cl denotes the number of points in Cl . Let hl be an n-dimensional vector indicating the members of the cluster Cl by its binary components. The minimisation problem of any objective function in Eqs. (1), (2) and (3) can be rewritten as a trace minimisation problem under a constraint on a matrix H := [h1 · · · hk ] [12]. min tr(H LH) subject to H N−1 H = I. H
(4)
Here, the matrix L ∈ Rn×n is called the graph Laplacian defined as L := D − W
(5)
where D is a diagonal matrix with the row sums on its diagonal: ⎛ ⎞ n n D := diag(W1n ) = diag ⎝ w1j , . . . , wnj ⎠ . j=1
j=1
(6)
374
T. Sakai and A. Imiya
The matrix N depends on the definition: N := I, D and W for Rcut, Ncut and Mcut, respectively. The spectral clustering algorithms were derived from the minimisation problem in Eq. (4) by relaxing the binary constraint on hl . The relaxed trace minimisation for H ∈ Rn×k is the generalised eigenvalue problem [1–3, 12–14]. LH = NHΛ.
(7)
The eigenvectors for Ncut and Mcut are identical due to this relaxation [3]. In the case of Ncut, Eq. (7) can be converted into a normal eigenvalue problem: SZ = ZΔ
(8)
where S = Ssym := D−1/2 WD−1/2 , or
S = Srw := D−1 W,
Z = D1/2 H and Δ = I − Λ Z = H and Δ = −Λ.
(9) (10)
Thus, data clustering by graph-cut boils down to the eigenvalue decomposition problem of S for finding the cluster indicators h1 , . . . , hk . The solution is the matrix of the first k eigenvectors of S. The matrix W is normalised by D in Eqs. (9) and (10). The former expression is convenient because Ssym is symmetric. It is suggested that the latter one is related to random walk [12]. Algorithm 1. Spectral clustering (by Ng, Jordan and Weiss, 2001) Input: dataset P = {p1 , . . . , pn } and the cluster number k < n and scale σ; Output: cluster set C = {C1 , . . . , Ck }; 1 construct the affinity matrix W ∈ Rn×n whose ij-entry is the similarity measured as K(pi , pj , σ); 2 compute the diagonal matrix D ∈ Rn×n whose ii-th entry is the sum of the i-th row of W; 3 form the matrix S ← D−1/2 WD−1/2 ; 4 compute the k leading eigenvectors Z ∈ Rn×k of S; 5 form the matrix X ∈ Rn×k by normalising each row vector of Z; 6 execute k-means clustering for n row vectors of X; 7 assign pi to the cluster Cl iff the i-th row vector of X is in the l-th cluster.
Algorithm 1 shows a spectral clustering method [4] based on Ncut. Normalisation at Step 5 and k-means clustering at Step 6 translate the cluster indicators by relaxed solution into the binary ones. Therefore, the essential part of the spectral clustering algorithm is the eigenvalue decomposition (EVD) which reconstructs, from the affinity matrix, a set of tight and convex clusters in k-dimensional space. The computational burden lies in the EVD of S of size n at Step 4, which requires O(n3 ) if S is dense. In Algorithm 1, the similarity between pi and pj is usually measured by a Gaussian kernel function as ||pj − pi ||22 K(pi , pj , σ) := exp − . (11) 2σ 2
Fast Spectral Clustering with Random Projection and Sampling
375
Construction of the affinity matrix W at Step 1 takes O(dn2 ) flops, which is also computationally intensive if the data cardinality n or dimensionality d is large. 2.2
Random Projection ˆ
ˆ ∈ Rd by a random Random projection is a linear transform from x ∈ Rd to x ˆ matrix R ∈ Rd×d as ˆ = Rx. x (12) The entries of the random matrix, rij ∈ R, are independent random variable values obeying a specific distribution such as the normal distribution. The concept of the random projection has emerged from the metric embeddings by Lipschitz mapping, or Lipschitz embeddings. The Lipschitz mapping is ˆ known as the mapping by a Lipschitz function f : Rd → Rd that satisfies ||f (xi ) − f (xj )|| ≤ c||xi − xj || ∀xi , xj ∈ Rd
(13)
where c is a constant independent of xi and xj . According to the JohnsonLindenstrauss lemma in [17] as described below, one can find the Lipschitz funcˆ tion that embeds a set of points in Rd onto a low-dimensional space Rd (dˆ < d) while preserving their pairwise distances approximately. Theorem 1 (Johnson-Lindenstrauss lemma). Given 0 < ε < 1, a set X of n points {x1 , . . . , xn ∈ Rd } and a number dˆ ≥ dˆ0 = O(ε−2 log n), there is a ˆ Lipschitz function f : Rd → Rd that satisfies (1 − ε)||xj − xi ||2 ≤ ||f (xj ) − f (xi )||2 ≤ (1 + ε)||xj − xi ||2 ˆ
(14)
2
for all xi , xj ∈ X with probability (1 − e−O(dε ) ). In fact, random projection by a matrix with i.i.d. random variables satisfies the above theorem with overwhelming probability. The best lower bound is dˆ0 = (4+2β)(ε2 /2−ε3 /3)−1 log n for guaranteeing the condition (14) with probability at least (1 − n−β ) [18]. Achlioptas [19] proved that the random projection by √ √ ˆ ˆ a sparse matrix such as rij = + 3/ d, 0, − 3/ d with probability 1/6, 2/3 and 1/6, respectively, works fast without any sacrifice in the quality of the lowdimensional embedding. For the data samples represented as n points in Rd feature space, the similarities between them are often quantified using their inner products or metric distances. Since reduction of dimensionality is of crucial importance in machine learning techniques, the performance of random projection with respect to the reduced dimensionality dˆ has been experimentally investigated in prior work [20, 21].
3 3.1
Fast Spectral Clustering Basic Idea
We address the two major causes of high computational complexity in the spectral clustering algorithm: construction of the affinity matrix W and EVD of S.
376
T. Sakai and A. Imiya n
r
⇒
n
W ≈ WPˆ Pˆ (a)
n
WPˆ Qˆ (b)
Fig. 1. Illustration of our spectral clustering scheme. (a) Graph-cut explanation. The dataset Pˆ consists of three clusters. Ten randomly chosen ‘examples’ are indicated by circles. The greylevel of the edges indicates the similarity (i.e., the darker the higher) between the sample and example points. (b) Size reduction of affinity matrix. WPˆ Qˆ is composed of randomly chosen columns of the approximation of W.
Our approach is to reduce simply the dimensionality d and the cardinality n of the given dataset P . Random projection is useful for fast computation of the kernel function in Eq. (11). Furthermore, we can drastically reduce the size of the affinity matrix by random sampling of the data points. We illustrate our basic idea in Fig. 1. We first embed n data points in a lower dimensional space to obtain Pˆ if d is large. Then we randomly choose r data samples as the examples of the data, and measure n × r similarities between the samples and examples. Note that we assume the similarities between any pairs of unchosen samples to be zero, and so we do not measure them. If the set ˆ roughly exhibits the data clusters of Pˆ , one can observe of example points, Q, in Fig. 1(a) that the samples in any cluster are strongly connected each other by way of the examples in it. This implies that one can obtain the clusters by graph-cut in the same manner as spectral clustering, e.g., Ncut or Mcut. In spectral clustering, the clusters are obtained by the EVD of S. Recall that S is the normalised version of the affinity matrix W as Eq. (9) or (10). The subspace spanned by the column vectors of S is wholly managed by W. We will replace W with WPˆ Qˆ ∈ Rn×r which contains r columns corresponding to the chosen examples as shown in Fig. 1(b). It is remarkable that the subspace is approximately preserved even if the column vectors are decimated as illustrated in Fig. 2(a) and (b). The principal components spanning this subspace can be estimated by the singular value decomposition (SVD) instead of the EVD. Therefore, we can obtain from WPˆ Qˆ the eigenvectors indicating the members of the clusters. Figure 2(c) shows that different matrices can span the same subspace as W. This suggests that we do not have to compensate the missing entries of W, but estimate its subspace or the principal components. 3.2
Approximate Affinity Matrix
We first propose a fast algorithm of affinity matrix construction. The entries of affinity matrix are calculated with dataset Pˆ embedded in a lower dˆ dimensional
Fast Spectral Clustering with Random Projection and Sampling
(a)
(b)
377
(c)
Fig. 2. Approximation of subspace. (a) Subspace spanned by the principal components of W. The dots represent the n column vectors of W. (b) Subspace spanned by the leading left singular vectors of WPˆ Qˆ . (c) The same subspace as (a) but spanned by columns of another matrix.
space. Algorithm 2 measures the similarities between n samples in Pˆ and r examˆ ⊂ Pˆ , so it computes only a n× r part of the affinity matrix W, approxples in Q imately. Since the kernel function in Eq. (11) is defined by the distance between the given pair of points, Theorem 1 and related work guarantee the accuracy. After the random projection, the matrix construction costs O(nrε−2 log n) time which is independent of the dimensionality d of original space. Our approach is very similar to a fast batch approximation [22] as a sampling technique for kernel methods, except the partial construction of W. Algorithm 2. Approximate affinity matrix construction by random projection Input: datasets P = {p1 , . . . , pn ∈ Rd }, Q = {q1 , . . . , qr ∈ Rd }, scale σ > 0 and distortion ε > 0; Output: approximate affinity matrix WPˆ Qˆ := [w ˆij ] ∈ Rn×r ˆ 1 set dˆ = O(ε−2 log n) and create a random matrix R ∈ Rd×d which defines a linear ˆ
map from Rd to Rd guaranteeing distance distortion within ε; 2 for each pair of pi ∈ P and qj ∈ Q do 3 w ˆij ← K(Rpi , Rqj , σ); 4 end for
In practice, it is also possible to perform the random projection first, and then compute the n × r approximate affinity matrix. 3.3
Spectral Clustering by SVD
Analogous to the EVD in Eq. (8), we define a normalised affinity matrix as −1/2
SPˆ Qˆ := DPˆ
−1/2
WPˆ Qˆ DQˆ
.
(15)
Here, DPˆ and DQˆ are diagonal matrices with the row and column sums of WPˆ Qˆ = [w ˆij ] on the diagonals, respectively. ⎛ ⎞ r r DPˆ := diag(WPˆ Qˆ 1r ) = diag ⎝ w ˆ1j , . . . , w ˆnj ⎠ (16) j=1
j=1
378
T. Sakai and A. Imiya
DQˆ := diag(1 ˆ ) = diag n WPˆ Q
n i=1
w ˆi1 , . . . ,
n
w ˆir
(17)
i=1
Equation (15) corresponds to Eq. (9), and SPˆ Qˆ = Ssym holds if r = n, especially. Furthermore, the singular vectors of the rectangular affinity matrix SPˆ Qˆ are equivalent to those of the normalised affinity matrix used in the so-called “coclustering” [23]. In fact, our similarity graph G can be considered as a complete ˆ The SVD of S ˆ ˆ provides us the relaxed bipartite graph between Pˆ and Q. PQ ˆ as the leading left and right singular vectors, cluster indicators for Pˆ and Q respectively. We show a fast spectral clustering algorithm in Algorithm 3. Most parts of this algorithm are based on Algorithm 1. The major differences are in the construction of the affinity matrix WPˆ Qˆ at Step 2 and the eigendecomposition at Step 6. We note that the number of examples, r, is independent of data cardinality n. One may set r ≥ r0 = αk where α is constant independent of n, because the role of the example points is to sketch roughly the k clusters. Therefore, the required memory space is reduced to O(nr) ≈ O(αkn) throughout the algorithm. The construction of n × r matrix WPˆ Qˆ at Step 2 costs O(nrε−2 log n) ≈ O(αε−2 kn log n) flops, which is considerably smaller than O(dn2 ) of the naive arithmetic in Algorithm 1 for large d. At Step 6, the left singular vectors of SPˆ Qˆ is computed by SVD. The full SVD of n × r matrix SPˆ Qˆ costs O((n + r) min2 (n, r)) ≈ O(α2 k 2 n) time [10, 24]. This cost is very low comparable to EVD of S by the Krylov subspace-based methods for leading k eigenvectors. Consequently, Algorithm 3 works in quasi-linear time and space for a fixed number of clusters even if we calculate all r left singular vectors. Besides, if SPˆ Qˆ is sparse, the state-of-the-art SVD driver routines, e.g., [25], are helpful for further acceleration. Algorithm 3. Fast spectral clustering Input: dataset P = {p1 , . . . , pn }, the cluster number k < n, the number of examples r ≤ n and distortion ε > 0; Output: cluster set C = {C1 , . . . , Ck }; 1 set Q to be a set of r examples randomly chosen from P ; 2 construct the approximate affinity matrix WPˆ Qˆ ∈ Rn×r by Algorithm 2; 3 compute the diagonal matrix DPˆ ∈ Rn×n whose ii-th entry is the sum of i-th row of WPˆ Qˆ ; 4 compute the diagonal matrix DQˆ ∈ Rr×r whose ii-th entry is the sum of i-th column of WPˆ Qˆ ;
5 6 7 8 9
−1/2
form the matrix SPˆ Qˆ ← DPˆ
−1/2
WPˆ Qˆ DQˆ
;
compute the k leading left singular vectors ZPˆ ∈ Rn×k ; form the matrix X ∈ Rn×k by normalising each row vector of ZPˆ ; execute k-means clustering for n row vectors of X; assign pi to the cluster Cl iff the i-th row vector of X is in the l-th cluster.
Fast Spectral Clustering with Random Projection and Sampling
379
Our algorithm seems similar to the Nystr¨om approximation of a Gram matrix or affinity matrix [5, 6, 11], because they also use the n × r part of W. The Nystr¨ om methods compute r-dimensional eigenvectors of an r × r submatrix of W, and then interpolate them to estimate the n-dimensional eigenvectors using the n × r part. On the other hand, our method directly estimates the principal components from the n × r part, which is fairly simple and faster. 3.4
Updating Clusters
We can update the clusters when we want to add new samples and examples. Since the matrix SPˆ Qˆ is extended by adding rows and columns for new samples and examples, the updating procedure requires adjustment of the principal components to those of the extended matrix. If we want to revise the clusters by adding r examples, we can update the singular values and vectors by the column incremental SVD (CiSVD) algorithm [26–33]. Likewise, we can add new data samples to P and update the clusters by the row incremental SVD (RiSVD). For detail of the CiSVD algorithm, see Appendix A. We present the algorithm for revising the clusters by appending r examples to Q in Algorithm 4. This algorithm evaluates the kernel function to construct only Wa and Sa , i.e., the column parts to append to WPˆ Qˆ and SPˆ Qˆ . This costs O(nr ε−2 log n) time. The SVD components are updated by CiSVD using Sa . For the algorithm of CiSVD, we refer to [29] in detail. The CiSVD takes O((n + r)r2 + nr2 ) ≈ O(α2 k 2 n) time if r n. Therefore, we can update the clusters faster than clustering from scratch by Algorithm 3. Algorithm 4 also Algorithm 4. Updating clusters by increment of examples Input: dataset P = {p1 , . . . , pn }, a set of r examples Qa = {q1 , . . . , qr } to add, the cluster number k, distortion ε, SVD triplet (ZPˆ , ΔPˆ Qˆ , ZQˆ ), and diagonal matrices DPˆ and DQˆ ; new new new Output: updated SVD triplet (Znew ˆ , ΔP ˆQ ˆ , ZQ ˆ ), updated diagonal matrix DP ˆ P new and DQˆ , and new cluster set C = {C1 , . . . , Ck };
1 construct matrix Wa ∈ Rn×r , whose ij-th entry is the approximate similarity between pi ∈ P and qj ∈ Qa , calculated by Algorithm 2;
2 update the diagonal matrices as Dnew ← DPˆ + diag(Wa 1r ) and Dnew ← ˆ ˆ P Q
DQˆ O ; O diag(1 n Wa )
−1/2 −1/2 3 Sa ← (Dnew Wa (Dnew ∈ Rn×r ; ˆ ) ˆ ) P Q new new 4 update the SVD triplet (ZPˆ , ΔPˆ Qˆ , ZQˆ ) to the new triplet (Znew ˆ , ΔP ˆQ ˆ , ZQ ˆ ) by P
CiSVD using Sa ;
5 form the matrix X ∈ Rn×k by normalising each row vector of Znew ˆ ; P 6 execute k-means clustering for n row vectors of X; 7 assign pi to the cluster Cl iff the i-th row vector of X is in the l-th cluster.
380
T. Sakai and A. Imiya
has the advantage of lower memory consumption because of no reconstruction of the affinity matrix WPˆ Qˆ . The RiSVD and CiSVD are dual to each other, since updating the SVD triplet (ZPˆ , ΔPˆ Qˆ , ZQˆ ) by appending rows to SPˆ Qˆ is equivalent to updating ˆ ˆ , Δ ˆ ˆ , Z ˆ ) by appending columns to S . If one want to add n data sam(Z Q
PQ
ˆ Pˆ Q
P
ples and update the clusters, exchange ZPˆ and ZQˆ , and call Algorithm 4 with P and Qa being the set of r examples and the n samples, respectively. One can also remove any samples and examples by downdating the SVD triplet [30] in the similar way, although we do not show the detailed algorithms in this paper.
4
Test Experiments
We tested our fast spectral clustering on synthetic datasets. We assume the number of clusters, k, is known in our experiments, and set the number of examples r = 100k. As suggested in [4], spectral clustering with a suitable value of σ yields tight clusters in k-dimensional space. We manually search and set σ suitable for each dataset, and examine the performance especially in computation time with respect to data cardinality. Figure 3 shows clustering results of our fast spectral clustering for a toy example. We fixed the number of examples as r = 300 for three clusters regardless of the cardinality n. Nevertheless, we obtained the same results as those by normal spectral clustering by Algorithm 1. As we suggested about the role of the example points, r = 300 examples were sufficient to figure out the structures of these datasets. We evaluate the computation time over a wide range of data cardinality. Our clustering algorithm actually runs in O(n) time for low-dimensional data as shown in Fig. 4. Even if we implement our algorithm using the direct method of SVD for a full n × r matrix, it is faster than the normal algorithm accelerated by the Krylov subspace-based methods e.g., the Lanczos method of EVD. This outperformance is apparent unless the matrix is sparse. In fact, if we employ the Gaussian kernel function in Eq. (11), the affinity matrix is generally dense.
(a)
(b)
(c)
Fig. 3. Clustering with different cardinalities. Randomly chosen r = 300 points indicated by circles are used as the examples. (a) n = 700, (b) n = 5, 000 and (c) n = 20, 000.
Fast Spectral Clustering with Random Projection and Sampling
381
Fig. 4. Computation time with respect to cardinality n
Another advantage of our algorithm is the low memory consumption. The experiments could be performed on large datasets with up to O(105 ) cardinality even on a 32bit system. Such machine can hardly run the EVD of a full matrix of O(104 ) size by in-memory algorithms.
5
Concluding Remarks
We achieved substantial improvement in computation of spectral clustering. The random projection for the approximate affinity matrix computation and random sampling for the subspace approximation are the two major techniques used in our fast clustering algorithm. These techniques are concrete and practical comparable to the Nyst¨om matrix approximation [5, 6] and the Krylov subspace methods [7–9]. We also present a scheme of updating clusters with low computational expense. The algorithm for updating clusters will be helpful for dynamic clustering of large-scale data. These algorithms should be tested using practical datasets to confirm our clustering scheme. Further theoretical research should focus on the problems on selecting the scale parameter and finding the number of clusters. These problems can be related to the error bound and the sampling number in our spectral clustering. The eigenvectors indicating clusters will be helpful for resolving these issues in practice [15, 16].
References 1. Hagen, L., Kahng, A.: New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on Computer-Aided Design 11(9), 1074–1085 (1992) 2. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
382
T. Sakai and A. Imiya
3. Ding, C.H.Q., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings of ICDM 2001, pp. 107–114 (2001) 4. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849– 856. MIT Press, Cambridge (2001) 5. Fowlkes, C., Belongie, S., Chung, F., Malik, J.: Spectral grouping using the Nystr¨ om method. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 214–225 (2004) 6. Drineas, P., Mahoney, M.W.: On the Nystr¨ om method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6, 2153–2175 (2005) 7. Freitas, N.D., Wang, Y., Mahdaviani, M., Lang, D.: Fast Krylov methods for Nbody learning. In: Advances in Neural Information Processing Systems, vol. 18, pp. 251–258. MIT Press, Cambridge (2006) 8. Song, Y., Chen, W.Y., Bai, H., Lin, C.J., Chang, E.Y.: Parallel spectral clustering. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS, vol. 5212, pp. 374–389. Springer, Heidelberg (2008) 9. Mahadevan, S.: Fast spectral learning using Lanczos eigenspace projections. In: AAAI, pp. 1472–1475 (2008) 10. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. Johns Hopkins University Press (1996) 11. Williams, C.K.I., Seeger, M.: Using the Nystr¨ om method to speed up kernel machines. In: Advances in Neural Information Processing Systems, vol. 13, pp. 682– 688. MIT Press, Cambridge (2001) 12. Von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007) 13. Fiedler, M.: Algebraic connectivity of graphs. Czechoslovak Mathematical Journal 23, 298–305 (1973) 14. Fiedler, M.: A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czechoslovak Mathematical Journal 25, 619–633 (1975) 15. Scott, G.L., Longuet-Higgins, H.C.: Feature grouping by relocalisation of eigenvectors of the proximity matrix. In: British Machine Vision Conference, pp. 103–108 (1990) 16. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems, vol. 17, pp. 1601–1608. MIT Press, Cambridge (2004) 17. Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics 26, 189–206 (1984) 18. Dasgupta, S., Gupta, A.: An elementary proof of the Johnson-Lindenstrauss lemma. Technical report, UC Berkeley (1999) 19. Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences 66, 671–687 (2003) 20. Brigham, E., Maninila, H.: Random projection in dimensionality reduction: applications to image and text data. In: ACM SIGKDD ICKDDM, pp. 245–250 (2001) 21. Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: ACM SIGKDD ICKDDM, pp. 517–522 (2003) 22. Achlioptas, D., Mcsherry, F., Olkopf, B.S.: Sampling techniques for kernel methods. In: Annual Advances in Neural Information Processing Systems, vol. 14, pp. 335– 342. MIT Press, Cambridge (2002)
Fast Spectral Clustering with Random Projection and Sampling
383
23. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD, pp. 269–274. ACM, New York (2001) 24. Gu, M., Eisenstat, S.C.: A stable and fast algorithm for updating the singular value decomposition. Tech. Rep. YALEU/DCS/RR-966, Yale University (1994) 25. Berry, M.W.: Large scale sparse singular value computations. International Journal of Supercomputer Applications 6, 13–49 (1992) 26. Bunch, J.R., Nielsen, C.P.: Updating the singular value decomposition. Numerische Mathematik 31, 111–129 (1978) 27. Gu, M., Eisenstat, S.C.: A stable and fast algorithm for updating the singular value decomposition. Tech. Rep. YALEU/DCS/RR-966, Yale University (1994) 28. Chandrasekaran, S., Manjunath, B.S., Wang, Y.F., Winkeler, J., Zhang, H.: An eigenspace update algorithm for image analysis. Graphical models and image processing 59(5), 321–332 (1997) 29. Brand, M.: Incremental singular value decomposition of uncertain data with missing values. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 707–720. Springer, Heidelberg (2002) 30. Brand, M.: Fast online SVD revisions for lightweight recommender systems. In: SIAM International Conference on Data Mining, pp. 37–46 (2003) 31. Skocaj, D., Leonardis, A.: Weighted and robust incremental method for subspace learning. In: Proc. ICCV 2003, vol. 2, p. 1494 (2003) 32. Davies, P.I., Smith, M.I.: Updating the singular value decomposition. Journal of Computational and Applied Mathematics 170, 145–167 (2004) 33. Skocaj, D., Leonardis, A.: Incremental and robust learning of subspace representations. Image and Vision Computing 26, 27–38 (2008)
A
Incremental Singular Value Decomposition
We describe in Algorthm 5 the incremental singular value decomposition for appending columns to a decomposed matrix. This algorithm ensures reconstructivity: new new SPˆ Qˆ Sa = ZPˆ ΔPˆ Qˆ ZQˆ Sa = Znew (18) ˆ ΔP ˆQ ˆ ZQ ˆ P and orthonormality: new new Znew ZPˆ = Znew ZQˆ = I ˆ ˆ P Q
if
Z ˆ = ZQ ˆ = I. ˆ ZP ˆ ZQ P
(19)
The contribution of Sa to relieving the degeneration of the subspace spanned by ZPˆ is explicitly quantified as the Frobenius norm at Step 3.
384
T. Sakai and A. Imiya
Algorithm 5. Column-incremental SVD Input: SVD triplet (ZPˆ , ΔPˆ Qˆ , ZQˆ ) (n ≥ s) and Sa ; n×s
s×s
r×s
n×r
new new Output: updated SVD triplet (Znew ˆ , ΔP ˆQ ˆ , ZQ ˆ ); P 1 Y ← ZPˆ Sa ; 2 P ← Sa − ZPˆ Y; 3 p ← ||P||2F ; 4 if p = 0 then 5 do QR decomposition of P to obtain an orthogonal matrix G and an upper n×r
7
triangular matarix T such that P = GT and G G = I; r ×r ΔPˆ Qˆ Y ← ; B O T (s+r )×(s+r ) do singular value decomposition of B to obtain UB , ΔB and
8
that UB Δ B VB = B and UB UB = VB VB = I; Znew ← Z G U ; B ˆ ˆ P P
6
(s+r )×s s ×s
VB (s+r )×s
such
VB (s+r )×s
such
n×s
9 10
Δ new ← ΔB ; s ×s Znew ˆ Q (r+r )×s
11 else 12 B
s×(s+r )
13 14
←
ZQˆ O VB ; O I r ×r
← ΔPˆ Qˆ Y ;
do singular value decomposition of B to obtain UB , ΔB and that UB KB Δ B = B and UB UB = VB VB = I; Znew ← Z U ; B ˆ ˆ P P n×s
15
Δnew ˆQ ˆ ← ΔB ; P s ×s
16
Znew ˆ Q (r+r )×s
17 end if
←
ZQˆ O VB ; O I r ×r
s×s
s ×s
How Much True Structure Has Been Discovered? Validating Explorative Clustering on a Hold-Out Test Set F. H¨ oppner University of Applied Sciences Braunschweig/Wolfenb¨ uttel D-38440 Wolfsburg, Germany [email protected]
Abstract. Comparing clustering algorithms is much more difficult than comparing classification algorithms, which is due to the unsupervised nature of the task and the lack of a precisely stated objective. We consider explorative cluster analysis as a predictive task (predict regions where data lumps together) and propose a measure to evaluate the performance on an hold-out test set. The performance is discussed for typical situations and results on artificial and real world datasets are presented for partitional, hierarchical, and density-based clustering algorithms. The proposed S-measure successfully senses the individual strengths and weaknesses of each algorithm.
1
Introduction
Cluster analysis is about finding groups in data. The most prominent conception is that clusters are clouds of data objects that should be compact and well separated from each other [1–3] or an agglomeration of such clouds [4]. Alternatively “clusters may be described as connected regions of multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points” [2], which is closer to the density-based view on clustering. Many clustering algorithms have been proposed, such as prototype-based methods, density-based or hierarchical methods. Each algorithm has its individual strengths, weaknesses, biases and assumptions and usually it is not a priori clear which algorithm is best for the data at hand. It is still an open question how to compare the performance of clustering algorithms. Typically, a validity measure [5–11] is used to assess the quality of a result such that the performance can be compared indirectly. Compared to classification, the variety of performance measures is much larger and less established. Many validity measures make use of the individual outcome of the clustering algorithm and thus cannot be used with a competing algorithm from a different paradigm (applies to [6–8], for instance). Others do not address the quality of the clustering, but the accordance of two results (such as [5]). A good match, however, can be achieved likewise with pairs of good clusters and pairs of poor clusters. To assess the performance, this measure would require the P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 385–397, 2009. c Springer-Verlag Berlin Heidelberg 2009
386
F. H¨ oppner
existence of a reference solution, which is usually not present in unsupervised learning. Yet other measures verify the clusters against local distortions of the data [10] or resampling [11, 12], but they do not allow a comparison of the most stable partitions of two different clustering algorithms. Why does clustering suffer from this unsatisfying situation but classification does not? A classifier can be seen as a function f : X → C ∪ {⊥} providing for a record x from the d-dimensional data space X the predicted class c ∈ C (or ⊥ in case no prediction is made). This common way of using a classifier makes it easier to define validation procedures that can be used with any classifier. With clustering there is not a single common usage – in many applications cluster analysis is used as a means to an end (e.g. in preprocessing tasks, local regression model, etc.), but in others as an end in itself (e.g. descriptive or explorative tasks). Focussing on one kind of application may help us to find a validation procedure more general than existing validity measures. In this paper, we will follow this idea and focus on explorative clustering (pattern discovery). We discuss the typical use in this case, in what respect the known clustering algorithms provide this information and propose a first validity measure that goes into this direction. Results on three different kinds of clustering algorithms (partitional: k-Means [13], density-based: DBScan [14], agglomerative hierarchical clustering (AHC), e.g. [4]) demonstrate the applicability of the measure across different approaches to cluster analysis.
2
Explorative Analysis and Cluster Assignment
Suppose we are interested in finding groups in our database (of customers, for example). Note that we are not interested in artificially dividing the customers into similar groups, but we want to know if the data itself supports a partition into different groups of customers. We then want to understand the result, i.e., in the explorative setting we need a compact, comprehensive description: 1. Where do the clusters lie in the data space and how far do they extend? – Without loss of generality we assume that clusters are represented by a set of geometrical elements or shapes such as hyperballs, -boxes, -planes, -ellipsoids, Voronoi cells, etc. For the sake of simplicity, in this paper we assume that all clusters c ∈ C are composed out of hyperspherical shapes B(x, r) = {y ∈ X | x − y ≤ r}.1 2. How large is the gap between the clusters, how well is a cluster separated from the remainder? – We indicate the gap by a separation area around the clusters of width 2ε (such that hyperballs of range ε will fit into this area). This range may vary from cluster to cluster, of course. Thus, a clustering algorithm may deliver a set S ⊆ IN × X × IR+ × IR+ where (c, x, r, ε) ∈ S denotes a hyperball B(x, r) whose elements belong to cluster 1
Note that we do not assume that all clusters have hyperspherical shapes, we only assume that we can approximate their shape by the union of several hyperspheres.
How Much True Structure Has Been Discovered?
387
Table 1. Providing the necessary information for explorative analysis DBScan [14] DBScan represents a cluster already by a set of small hyperspheres with some fixed radius ε around the core points of the cluster. This set corresponds exactly to the set of basic shapes. The algorithm guarantees only that neighboring clusters have at least a separation of 2ε, therefore this parameter can be taken unaltered. k-Means [13] The original k-Means algorithm is a partitional clustering algorithm that assigns every record to a cluster, it does not consider the possibility of noise or outliers. Here, we heuristically define the extent of each cluster by means of hyperspheres: if data does not fall into their volume, it does not belong to the cluster. Furthermore, if the hyperspheres of two prototypes intersect, we consider them as representing a single cluster, so the number of obtained clusters is not always identical to k. Let Ci contain all data assigned to cluster #i. For each cluster #i with prototype pi ∈ X we calculate the average distance of all data points assigned to this cluster: ri =
1 x − pi |Ci | x∈C i
Since this is the average distance, we choose the size of the cluster twice as large, i.e., we define the shape of the cluster as a hypersphere with the prototype as its center and the radius ri• = d 2 · rid where d is the dimensionality. The distance between the hyperspheres of cluster #i and #j is given by di,j = pi − pj − ri• − rj• . In case di,j < 0 the hyperspheres intersect and we consider them as a single cluster (represented by two (or even more)) prototypes. From the prototype location and the cluster size we can directly calculate the separation margin between the clusters. If the resulting ε gets very small, 10% of the cluster radius is used as a lower bound for ε (if ε is very small, the estimated density may easily become very high because the volume of the neighborhood gets very small). Agglomerative Hierarchical Clustering (AHC), e.g. [4] We consider the classical single-linkage AHC algorithm here. The resulting dendrogram is used to come up with the final clusters, by cutting off all edges that cross a certain distance level. This cut-off distance dcut is often determined by visual inspection of the dendrogram. The leaves of the remaining trees represent the elements of the cluster. By including all data in hyperspheres of radius dcut around each data object in the cluster, the clusters remain disjoint and are represented by simple shapes. Thereby a separation of ε = dcut is guaranteed.
#c ∈ IN with a separation of at least 2ε (with ε ≤ r, cf. Sect. 3). To avoid ambiguities when clusters are composed out of multiple shapes, we require ∀(c, x, r, ε), (c , x , r , ε ) ∈ S : B(x, r) ∩ B(x , r ) = ∅ ⇒ c = c (that is, overlapping hyperspheres belong to the same cluster). This information gives a concise impression of the reported clusters and also suits an assignment task, where new data has to be associated with the identified clusters (or regarded as noise). Although probably all clustering algorithms claim that they are (more or less) suited for explorative analysis and the assignment task, we can already see at
388
F. H¨ oppner
this point that prominent clustering algorithms do not provide all of the abovementioned information. For instance, k-Means provides no information about the size or separation of the clusters2 ; agglomerative hierarchical clustering delivers the points belonging to the clusters, but no extension of the cluster. Nevertheless, such algorithms are used for such tasks – and typically they are heuristically extended to provide the missing properties. But these extensions are usually not part of the original algorithm and often not subject to the evaluation procedure. Our proposition is that for explorative purposes a clustering algorithm must deliver the above-mentioned information and if a heuristic is involved in obtaining this information then it must also be part of the evaluation. While the validity measure proposed in the next section is not applicable to standard kMeans alone (because separation information is missing), it applies very well to numerous variants of k-Means with different heuristic extensions. For this paper, we canonically extend three common clustering algorithms in a straightforward fashion as it is often done in the literature (cf. Table 1). We intentionally selected three clustering algorithms from different paradigms to show the wide applicability of the approach. These algorithms (k-Means, DBScan, AHC) are widely known and we refer to the literature for a detailed description. We restrict ourself to a brief summary of their outcomes. The k-Means algorithms delivers (an a priori specified number of) k prototypical data objects that represent a whole cluster. All data is associated with its closest prototype. The DBScan algorithm delivers for each cluster a subset of the dataset, the set of core points, which offer some minimal data density. The core points as well as all data within some epsilon-range constitute the cluster. The AHC algorithm returns a binary tree with the data at the leaves (dendrogram). Each inner node represent a union of its descendents at some specific distance, representing the distance between all data in the respective subtrees. By cutting the tree at some distance d, it resolves into several subtrees whose set of leaves represent the individual clusters.
3
Measuring Recovered Structure
Given that we have the desired information, we now consider how to measure how well the dataset-inherent structure has been recovered. In all definitions of a cluster, it must distinguish itself from its neighborhood by an increased data density. To verify the existence of a cluster, we have to approve a change in the data density: we expect the data density within a cluster c (interior density •c ) to be larger than the data density outside the cluster c (exterior density ◦c ). The larger the difference •c − ◦c , the more distinctive is the discovered structure. If there is no difference at all or the exterior density is even larger than the interior density, the algorithm has made an error in its prediction (false 2
During k-Means iterations, all data is associated with the closest prototype, but this does not necessarily mean that once the final prototype locations are found all new data will also belong to one cluster – in this case there would be no ’void space’ between the clusters.
How Much True Structure Has Been Discovered? (a)
(b)
(c)
(d)
(e)
true cluster
389 (f)
reported cluster
Fig. 1. Problems that occur with some clustering algorithms
positive). This gives us a very intuitive understanding of the amount of structure correctly discovered by a clustering algorithm: We define the S-measure (structure measure) as the sum of differences in the data densities as observed in a hold out test set: S= (•c − ◦c ) (1) c∈C
Since the recognized structure should be substantial, that is, generalize to the population from which the data was sampled, we measure this difference in the data density on a hold-out test set. While the evaluation of a classifier is done per record (correct class prediction), we evaluate a partition per cluster (correctly predicted existence of a cluster). Rather than using the data from the hold-out test set one by one to verify the results, it is used to get estimates of • and ◦ . The data densities within and outside the cluster are derived from multiple estimates. In a first run, we use the test data as seeds for probing points at which we measure the data densities. In a second run, we estimate the data density at these locations and calculate interior and exterior densities. To critically evaluate the cluster c, we are particularly interested in (cf. Fig. 1) ... ... regions of high data density outside the cluster, because this may indicate false positives (Fig. 1a), poor shape recognition (Fig. 1b,1c) or cluster splitting (Fig. 1e). Therefore we use the data from the hold out test set as seeds for a set Pc◦ of exterior probing points. We thereby ignore regions without any test data, but the estimated density would be zero anyway. ... regions of low data density inside the cluster, because this indicates a poor shape recognition (Fig. 1b,1d) or a noise-bridge between clusters (Fig. 1f). So we do not use the test data as seed but draw random samples x ∈ B(x, r) from a cluster-defining hyperball B(x, r) to obtain a set of interior probing points Pc• . Finally, for each x ∈ Pc◦ ∪ Pc• we estimate the data density x . For a set of neighborhoods Pc∗ (either Pc◦ or Pc• ), let (Pc∗ ) = {x | x ∈ Pc∗ }. Then we define for each cluster c •c as the 33rd percentile of (Pc• ) and ◦c as the 66th percentile of (Pc◦ )
390
F. H¨ oppner
The rationale for using different percentiles (rather than the median) is again the critical evaluation of the reported clusters as discussed above (bias to low densities within cluster and to high densities outside the cluster). With this definition of •c and ◦c the S-measure penalizes false positives because the summand becomes even negative in case there is no difference in the (average) data density within and outside the cluster (Fig. 1a). Due to the variance in the measurements, ◦c (66th percentile) will yield higher values than •c (33rd percentile) and thus •c − ◦c < 0. There is no penalty for false negatives (clusters that were not discovered) because we have no knowledge about such clusters3 . However, since S is not bounded, other clustering algorithms that discover the missed clusters will outperform the former algorithm in terms of recovered structure. Note that we intentionally have removed any factor for the size of the cluster in order to not discriminate against small but substantive clusters (cf. [15]). We use hyperballs B(·, ε) of the same size for estimating interior and exterior density of a shape (c, z, r, ε). The rationale behind this decision is that interior and exterior of a cluster should be analyzed at the same resolution: if there is only a small margin between two clusters, we have to look at the data carefully to distinguish the clusters. In this case we do not expect the clusters themselves to have gaps of similar size, because this would make the current partition questionable. Therefore, the interior density should be measured at the same resolution as the separation area. Similar arguments apply for the case of well separated clusters. Implementation and Examples for Probing Point Selection Fig. 2 illustrates how the probing points are determined in our experiments. For a given x the closest shape element (c, z, r, ε) is identified. If x lies within the hyperball B(z, r), it belongs to the cluster c. We randomly determine an x within B(z, r − ε), such that the ε-neighborhood around x is completely contained in B(z, r) (case A). Otherwise, if there is no hyperball containing x, x does not belong to any cluster and we identify the closest separation area. We avoid border effects by shifting x towards z such that N = B(x , ε) lies completely in the separation area (case B). In all other cases, x does neither belong to a cluster nor a separation area. Once the probing points have been determined, the densities at these spots are estmated during a second run over the test set. To illustrate the probing point selection, Fig. 3 shows the probing points for some exemplary cases. In the top row, three clusters identified by DBScan are shown. One can clearly see how the probing points for the separation area are aligned around the respective cluster. The space between the exterior probing points and the interior probing points within the cluster corresponds to the 3
Penalizing false negatives would require their identification by, e.g., a clustering algorithm. But it is exactly the performance of these clustering algorithms that we seek to measure, so using them during the evaluation would be circular reasoning.
How Much True Structure Has Been Discovered? in: set of shapes S, x from hold-out test set out: returns probing point 1 (c, z, r, ε) = argmin(c,z,r,ε)∈S z − x − r 2 if z − x ≤ r (case A: x lies within cluster c) 3 randomly sample x ∈ B(z, r − ε) 4 return interior probing point x for cluster c 5 endif 6 (c, z, r, ε) = argmin(c,z,r,ε)∈S z − x − r − 2ε 7 if z − x ≤ r + 2ε (x in separation area) 8 let x = x 9 if z − x ≤ r + ε (case B: adjust x’) r+ε 10 x = x + z−x (z − x ) 11 return exterior probing point x for cluster c 12 endif 13 return ⊥ (case C: x is not used as a seed)
391
C
A
cluster core area
B
C
cluster separation area
Fig. 2. Determining probing points for • and ◦
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Location of probing points for selected clusters found by DBScan (a-c), AHC (d, e) and k-Means (f). Data too far away from the selected cluster is shown unaltered. In case (e) artificial lines have been added to support the visibility of the contour.
diameter of the neighborhood. One half of this space is covered by the neighborhoods of cluster core and separation area, resp. The first two examples from the second row illustrate the case for AHC, which is quite similar to the DBScan results. The second example shows the case where the single linkage algorithm has resulted in a joint cluster for the two data agglomerations at the bottom.
392
F. H¨ oppner
Since the space between the three data agglomerations is sparsely populated, there are fewer exterior probing points in this case and their alignment is less clearly visible. (But introducing more probing points does not make sense – as there is no data in our hold out test set in this area, the estimated density will be zero.) Finally, the last example shows the case of k-Means. Two cases (b, d) illustrate the report of false positives, which will be penalized by the S-measure, because the exterior density is equally high or higher than the interior density. Regarding runtime complexity, the complete evaluation consists of the selection of k probing points (which is O(n) with a test set of size n), the density estimation (which is O(n·k)) and the determination of the respective percentiles (which is O(k log k) in a na¨ıve implementation). For our datasets of moderate size we used k ≈ n, but for large datasets it is sufficient to consider a random sample of fixed size s per cluster, that is, k = s · |C|. Thus, the approach is also suited for large data sets.
4
Experimental Evaluation
Various artificial and real datasets have been used to test the S-measure. The tables show the averaged S-values over a 10-fold cross validation. Since the density estimation in the validation phase also requires sufficient data, the training and test sets were of the same size (50% each), that is, the training/test datasets consisted of subsets #1-5 (test)/#6-10 (train), #2-6 (test)/#1,7-10 (train) etc. Note that the S-values in the tables evaluate the clustering as a whole; a poor S-value does not necessarily mean that none of the clusters has been recovered successfully, but may also be caused by penalizing a number of false positives. A more detailed “per cluster”-inspection can be carried out by looking at the individual summands. For each algorithm, we tried a range of settings for the main parameter, but there was no exhaustive search for the best setting. As already mentioned in Sect. 1, a comparison with competitive measures is not possible, because to the best of our knowledge none of the known measures can be used for all three algorithms.4 4.1
Typical Situations (Artificially Generated)
The design of the following artificially generated datasets is driven by the known weaknesses of clustering algorithms, such as the reporting of false positives (kMeans), difficulties with slowly changing data densities (DBScan), poor shape recognition (k-Means), sensitivity to outliers (AHC), etc. The datasets provoke such situations and our aim is not to compare the performance of the clustering algorithms in the first place (we know in advance which will perform best), but to see if the judgment of the S-measure corresponds to our expectations, that is, if it is sensitive enough to recognize the poor performance caused by the weaknesses of the respective algorithms. 4
Some approaches are applicable to all three algorithms, but measure the robustness or sensitivity, but not the overall quality of the clustering (see also introduction).
How Much True Structure Has Been Discovered?
k 2 3 4 5 6
k-Means DBScan (eps=0.2) AHC |C| μ σ MinPts |C| μ σ cut |C| μ 1 -0.8 22.1 7 1 128.1 5.8 0.10 1-5 -57.2 1 -2.7 22.1 10 1-2 114.5 26.5 0.13 1 58.3 2-4 -81.4 60.5 13 1 97.8 28.8 0.16 1 114.3 3-4 -162.1 46.0 16 1 56.5 23.8 0.19 1 111.0 4 -182.3 38.7 19 1-3 -26.2 50.3 0.22 1 106.5
393
σ 44.5 42.0 7.8 6.1 5.1
Fig. 4. 1024 records regularly distributed within [−1, 1]2
No clusters present: Some clustering algorithms (e.g. k-Means) report a cluster in some area, where the data density is high, but there is no difference between the data density within and outside the reported cluster (cf. Fig. 1a). Such arbitrary selected clusters are incidental and the S-measure should penalize the reporting such clusters (cf. discussion in Sect. 3). The dataset in Fig. 4 consists of a uniform grid without any internal structure. The table shows, for each algorithm, the selection of the main parameter, the number of obtained clusters (minimum/maximum number in 10-fold cross validation) and mean and standard deviation of the S-measure. Note that each clustering algorithms gets only half of this dataset (random sample) and therefore it looks not quite as regular to the clustering algorithm as the full dataset shown in Fig. 4. The omitted data introduces a light perturbance in the data density. This dataset represents an uncomfortable situation for k-Means, since the k prototypes have to be placed somewhere. For small values of k, the prototypes are united to one cluster (cf. Table 1), but still the union of 2-3 spheres poorly approximates a rectangle. On the test set, there are no differences in the average interior and exterior densities and the S-measure correctly indicates the poor performance. For small values of MinPts, DBScan correctly identifies a single cluster that contains (almost) all data (for MinPts=4 and 7 we obtain the true density of 128 (half of 1024 records within [−1, 1]2)). As MinPts is increased, more and more data is required to become a core point. This is achieved by chance in regions where most of the data from the full set is contained in the training set. But during validation against the test set, the reported cluster shape cannot be confirmed and the S-value decreases. For AHC, if the cut-off distance is chosen too small (0.1), the small perturbations induce up to 5 clusters that cannot be justified by the test set. For larger values all records are included in a single cluster, but AHC does not achieve the correct (maximal) S-measure of 128 because in hierarchical clustering there is no distinction between core and border points of a cluster (as in DBScan). Therefore, the density estimation near the border includes empty areas which damps the obtained interior densities. (Invalid) Assumptions on Shape. Another problem with many clustering algorithms is that assumptions on the shape of the clusters do not hold in a given
394
F. H¨ oppner
k 2 3 4 5 6
k-Means DBScan (eps=0.12) AHC |C| μ σ MinPts |C| μ σ cut |C| μ 1 -158.6 146.0 4 3 1427.9 39.2 0.10 3-5 1314.6 3 638.4 12.0 7 3 1439.0 45.9 0.13 3-5 1243.1 3 598.2 58.0 10 3 1430.1 45.9 0.16 3-4 1100.4 3 638.0 42.7 13 3 1430.1 54.1 0.19 3 1066.9 3 573.7 63.4 16 3 1423.5 73.8 0.22 3 971.3
σ 155.9 144.5 92.7 19.0 14.8
Fig. 5. Results on dataset mixed (500 records per cluster)
dataset. As a result, the models fit the true cluster poorly or multiple models are used to approximate a single cluster in the data. If such a case occurs, there is either (a) some part of the separation area with a data density as high as the interior area or (b) some part of the interior area with a data density as low as the separation area. The probing point selection method as well as the definition of •c and ◦c as the 33rd and 66th percentiles were designed to detect such situations (cf. Sect. 3). The dataset in Fig. 4 is closely related to this problem, because there is a single cluster of rectangular shape and the assumption on hyperspherical clusters does not hold for k-Means. We have seen already, that the poor performance of k-Means has been detected by the S-measure. Another example is shown in Fig. 5, consisting of three types of clusters: a Gaussian cluster, a box and a ring. Although the size of the ring-shaped cluster is recognized by the k-Means clusters quite well (the data-to-cluster association is correct), it receives small interior densities since most of the neighborhoods of interior probing points are empty. Again, AHC and DBScan detect the correct shapes and receive higher S-values. Separation and Noise. If clusters of the Gaussian type are close together, it becomes more difficult to sharply distinguish them. The examples in Fig. 6c and 7d consist of three clusters each, but with different separation. The larger the separation area (with a reduced data density) is, the better the cluster can be distinguished from its surrounding. The k-Means algorithm has an overview about the (relative) position of each cluster and can actively adjust the width of
k 2 3 4 5 6
k-Means DBScan (eps=0.12) |C| μ σ MinPts |C| μ σ cut 1 -76.2 84.6 4 2-5 285.1 67.8 0.10 3 748.7 51.3 7 3-4 369.1 73.7 0.13 3 541.1 65.8 10 3-4 406.7 76.0 0.16 3 600.1 78.8 13 3-4 296.2 118.4 0.19 3 567.1 51.4 16 2-3 137.0 155.8 0.22 Fig. 6. Results on dataset three-3-7
AHC |C| μ σ 6-12 -474.2 361.8 2-5 43.3 123.0 1-2 46.0 142.7 1 140.1 8.7 1 143.3 8.6
How Much True Structure Has Been Discovered?
k 2 3 4 5 6
k-Means DBScan (eps=0.12) AHC |C| μ σ MinPts |C| μ σ cut |C| μ 1 -48.8 113.7 7 3-4 355.8 67.8 0.10 4-8 -98.6 3 691.9 21.5 10 3 391.2 45.4 0.13 3-6 210.9 3 679.9 52.3 13 3 444.3 54.5 0.16 3-4 391.6 3 572.4 32.0 16 3 457.5 66.7 0.19 3-4 388.8 3 521.2 44.6 19 3-4 419.9 138.2 0.22 3 391.6
395
σ 241.2 247.9 75.5 86.5 75.5
Fig. 7. Results on dataset three-3-10
the separation area (cf. Table 1). Accordingly, the S-values of k-Means are very good for all cases but k = 2. It is also remarkable that the correct number of clusters is determined in all cases for k = 3, 4, 5, 6 due to the prototype-merging heuristic in Table 1. For k > 3 at least one cluster is composed out of two prototypes and this influences the shape of the cluster – it is no longer spherical. The largest S-measure, however, is obtained for k = 3 where each cluster is correctly approximated by a single hypersphere. In contrast to k-Means, DBScan and AHC take a fine-grained look at the clusters and do not provide much information about their true separation. DBScan has difficulties with clusters that level out smoothly, because incidental data agglomerations can influence the shape of the cluster dramatically: As shown in Fig. 3c), although the position of the cluster has been determined correctly, we have a flat ellipsoid rather than a spherical shape. This shape was justified by the training set, but was not completely verified by the test set, which is why the difference •c − ◦c and therefore its contribution to the S-measure is rather small. AHC suffers from single linkage distance: As shown in figure 3d), the clusters get easily connected by chance due to single data points. A consistently high data density cannot be verified in the test set, which is a poor recovery is penalized by the S-measure. The bias of k-Means is best suited for these datasets and this is reflected by the S-values. Again, the expected performance of the algorithm is well recovered by the S-measure. 4.2
Real Data
We show the results on three real datasets, namely the iris data, dimensions 4 and 8 of the cloud dataset, and an extract of the wine dataset (dimensions 2, 7, 10 and 11). The overall best performance for iris (Fig. 8) is achieved by DBScan due to the good shape adaption. The extremely high variance in the AHC results again indicate the deficiencies of the single linkage method (for such a small dataset as iris the existence of a single record may be crucial for AHC). DBScan has an overall tendency to 2 clusters, whereas with k-Means the majority of runs ended with 3 clusters.
396
F. H¨ oppner
k 2 3 4 5 6
IRIS k-Means DBScan (eps=0.5) |C| μ σ MinPts |C| μ σ 2 11.6 3.3 4 2-3 68.7 13.3 2-3 27.4 10.6 7 2 78.7 15.2 2-3 17.8 13.7 10 2 76.8 10.3 2-3 10.8 29.1 13 1-2 68.4 7.4 2-4 37.2 32.2 16 1-2 60.3 6.3
k 2 3 4 5 6
CLOUD (dimensions 4 and 8) k-Means DBScan (eps=0.16) AHC |C| μ σ MinPts |C| μ σ cut |C| μ 2 677.9 66.0 4 2-3 1449.8 868.8 0.13 2-7 1817.5 3 181.3 42.4 7 1-4 1965.8 635.7 0.16 2-5 1203.6 3 -93.7 78.3 10 2-4 2320.1 190.0 0.19 1-3 1176.2 4 614.8 256.9 13 4 2392.2 220.1 0.22 1-2 957.5 5-6 814.1 202.9 16 3-5 2386.0 215.2 0.25 1 217.9
k 2 3 4 5 6
WINE (dimensions 2, 7, 10 and 11) k-Means DBScan (eps=0.75) AHC |C| μ σ MinPts |C| μ σ cut |C| μ 2 16.0 3.5 4 3-5 -4.3 7.3 0.60 5-8 -19.2 2 16.1 2.6 7 2 8.8 2.4 0.65 4-6 -7.7 2 17.5 2.7 10 2 13.0 2.0 0.70 3-5 -4.5 2 15.4 4.6 13 2 14.4 2.9 0.75 3-5 -10.3 2 17.9 6.4 16 1-2 14.4 4.1 1.10 1 2.3
AHC cut |C| μ
σ
0.19 1-3 -9.6 366.5 0.22 2-4 -69.2 215.0 0.25 1-4 -41.5 21.1
σ 848.1 771.2 654.1 516.3 22.3
σ 13.1 10.4 11.5 7.6 0.2
Fig. 8. Results on real datasets (IRIS, WINE, CLOUD)
The 2-dimensional subset of the cloud dataset is comparable to the dataset in Fig. 5 with respect to the variability in the cluster shapes. Similar to earlier results, the k-Means results are inferior to DBScan and AHC, although the variances of DBScan and AHC are at least in some cases extremely high again (compared to k-Means), which is due to their high sensitivity to noise and the occassional induction of small local clusters that cannot be verified in the test set. The 4-dimensional subset of the wine dataset is roughly comparable to the dataset in Fig. 6 in terms of cluster shape and noise level. Similar to earlier results, k-Means performs best here.
5
Conclusions
We have investigated into a validity measure that offers some imporant new properties. Firstly, it allows the direct comparison of results of clustering algorithms from different paradigms. Secondly, focussing on explorative analysis, we have proposed to validate the results of a clustering algorithm by verifying the obtained clusters on a hold-out test set: The larger the difference of interior and exterior density of the cluster, the more pronounced is the cluster. Thirdly,
How Much True Structure Has Been Discovered?
397
the measure penalizes the detection of false positives (reported clusters that are actually none). Experiments have been carried out on clustering algorithms from three different paradigms (partitional, hierarchical, density-based). The experiments on various datasets have shown that it rewards strengths and penalizes weaknesses of clustering algorithms from different paradigms. Therefore we consider the S-measure as a promising new direction. In some experiments the observed variance of the S-values was quite high. There are multiple reasons for this, but the most influential aspect is that for each of the three algorithms there are certain situations to which they respond very sensitive, thereby inducing large variance in the results. This is well-known in the literature and the experiments just underline that it remains an important research problem to reduce the sensitivity of the clustering algorithms such that they deliver clusters more robustly.
References 1. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data – An Introduction to Cluster Analysis. Wiley, Chichester (1989) 2. Everitt, B.S.: Cluster Analysis. Wiley, Chichester (1974) 3. Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, Chichester (1975) 4. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988) 5. Rand, W.M.: Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association 66(336), 846–850 (1971) 6. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981) 7. Xie, X.L., Beni, G.: A Validity Measure for Fuzzy Clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(8), 841–847 (1991) 8. Fisher, D.H.: Knowledge Acquisition Via Incremental Conceptual Clustering. Machine Learning 2(2), 139–172 (1987) 9. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering Validity Methods: Part I 31(2), 40–45 (2002) 10. M¨ oller, U., Radke, D.: A Cluster Validity Approach based on Nearest Neighbour Resampling. In: Proc. 18th Int. Conf. Pattern Recognition, pp. 892–895 (2006) 11. Levine, E., Domany, E.: Resampling Methods for Unsupervised Estimation of Cluster Validity. Neural Computation 13, 2573–2595 (2001) 12. Borgelt, C., Kruse, R.: Finding the Number of Fuzzy Clusters by Resampling. In: IEEE Int. Conf. on Fuzzy Systems, pp. 48–54 (2006) 13. McQueen, J.B.: Some methods of classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symp. on Mathematical Statistics and Probability, pp. 281–297 (1967) 14. Ester, M., Kriegel, H.P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of the 2nd ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, Portland, Oregon, pp. 226–331 (1996) 15. H¨ oppner, F.: Local pattern detection and clustering – are there substantive differences? In: Morik, K., Boulicaut, J.-F., Siebes, A. (eds.) Local Pattern Detection. LNCS (LNAI), vol. 3539, pp. 53–70. Springer, Heidelberg (2005)
Efficient Clustering of Web-Derived Data Sets Lu´ıs Sarmento1 , Alexander Kehlenbeck2, Eug´enio Oliveira1 , and Lyle Ungar3 1
Faculdade de Engenharia da Universidade do Porto - DEI - LIACC Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal [email protected], [email protected] 2 Google Inc New York, NY, USA [email protected] 3 University of Pennsylvania - CS 504 Levine, 200 S. 33rdSt, Philadelphia, PA, USA [email protected]
Abstract. Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation, where large classes are incorrectly divided into many smaller clusters, and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well on web-type data.
1 Introduction Clustering data sets derived from the web - either documents or information extracted from them - provides several challenges. Web-derived data sets are usually very large, easily reaching several million of items to cluster and terabyte sizes. More fundamentally, web-derived data sets have specific data distributions, which are not usually found in other datasets, that impose special requirements on clustering approaches. First, webderived datasets usually involve sparse, high-dimensional features spaces (e.g., words). In such spaces, comparing items is particularly challenging, not only because of problems arising from high-dimensionality [1], but also because most vectors in sparse spaces will have similarities close to zero. Also, class distributions of the web-derived data are usually highly unbalanced (often Zipfian), with one or two dominant classes and a long tail of smaller classes. This causes a problem for clustering algorithms, which need to be able to deal with such an unbalanced distribution in web-derived data, and still correctly cluster items of non-dominant classes. Additionally, methods to cluster such large data sets have to deal with the fact that “all-against-all” comparison of items is impossible. In practice, items can only be compared to cluster summaries (e.g., centroids) or to only a few other items. The most widely used methods for clustering extremely large data sets are streaming clustering methods [2] that compare items against centroids. Streaming clustering has linear computational complexity and (under ideal conditions) modest RAM requirements. However, as we will show later, standard streaming clustering methods are less than ideal for web-derived data because P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 398–412, 2009. c Springer-Verlag Berlin Heidelberg 2009
Efficient Clustering of Web-Derived Data Sets
399
of the difficulty in comparing items in high-dimensional, sparse and noisy spaces. As a result, they tend to produce sub-optimal solutions where classes are fragmented in many smaller clusters. Additionally, their computational performance is degraded by this excessive class fragmentation. We propose a clustering algorithm that has performance comparable to that of streaming clustering for well-balanced data sets, but that is much more efficient for the sparse, unevenly sized data sets derived from the web. Our method relies on an efficient strategy for comparing items in high dimensional spaces that ensures that only the minimal sufficient number of comparisons is performed. A partial link-graph of connected components of items is built which takes advantage of the fact that each item in a large cluster only needs be compared with a relatively small number of other items. Our method is robust to variation in the distribution of items across classes; in particular, it efficiently handles Zipfian distributed data sets, reducing fragmentation of the dominant classes and producing clusters whose distributions are similar to the distribution of true classes.
2 Streaming Clustering of Web Data For the purpose of explaining the limitations of streaming clustering for web-derived data sets, we will consider a single pass of a simplified streaming clustering algorithm. This simplification emphasizes the problems that streaming clustering algorithms face, while not changing the basic philosophy of the algorithm. (Later we will show that this analysis can be extended to realistic streaming-clustering approaches.) The simplified version of the streaming clustering algorithm we will be using is: 1. shuffle all items to be clustered and prepare them for sequential access; 2. while there are unclustered items, do: (a) take the next unclustered item and compare it with all existing cluster centroids; (b) if the distance to the closest centroid is less that mindist , add the item to the closest cluster and update the corresponding centroid; (c) otherwise, create a new cluster containing this item only. For n items to be clustered and if Cf clusters are found, this algorithm performs in O(n Cf ) time, since each item is compared with the centroids of Cf clusters, and in O(Cf ) space: we only need to store the description of the centroid for each clusters. The high dimensionality and sparseness of web-derived the data hurt streaming clustering because when comparing two items with sparse features there is a non negligible probability of those items not sharing any common attribute. This is so even when the items being compared belong to the same class. Such false negatives have a very damaging effect on streaming clustering. If a false negative is found while performing comparisons between an item to be clustered and existing cluster centroids, the streaming clustering algorithm will assume that the item belongs to an yet unseen class. In such cases a new cluster will be created, leading to an artificial increase in the number of clusters generated for each class, with two direct consequences: (i) during streaming, clustered items will have to be compared with additional clusters, which will degrade computational performance in time and space; and (ii) the final clustering result will be composed of multiple clusters for each class, thus providing a fragmented solution.
400
L. Sarmento et al.
Fig. 1. Markov model for fragmentation in streaming clustering
Whether this degradation is significant or not depends basically on how probable it is to find a false negative when comparing items with existing clusters. Our claim is that on web generated data the probability is in fact quite large since the dimensionality of the spaces is very high and vector representations are very sparse. To make matters worse, fragmentation starts right at the beginning of the clustering process because most items will have nothing in common with the early clusters. To make a more rigorous assessment of the impact of false negatives on the performance of streaming clustering, let us consider only the items belonging to one specific arbitrary class, class A. In the beginning no clusters exist for items of class A, so the first item of that class generates a new cluster, Cluster 1. The following elements of class A to be clustered will have a non-zero probability of being a false negatives. i.e, of not being correctly matched with the already existing cluster for class A. (We assume for now that there are no false positives, i.e. that they will not be incorrectly clustered with elements of other classes.) In this case a new cluster, Cluster 2, will be generated. The same rationale applies when the following items of class A are compared with existing clusters for that class. We assume that in any comparison, there is a probability pf n of incorrectly mismatching the item with a cluster. Therefore, one expects new clusters for class A to be generated as more items are processed by streaming clustering. This behavior can be modeled by an infinite Markov Chain as depicted in Figure 1. The probability of having created s clusters after performing streaming clustering for n + 1 items is the probability of being in state s (1, 2, 3, ...) of the chain. Assuming independence, as more clusters are generated the probability of creating a new (false) cluster decreases exponentially because that would require more consecutive false negative comparisons. Despite the regularities of this Markov Chain, deriving general expressions for the probability of a given state after n iterations is relatively hard except for trivial cases (see [3]). However, for the purpose of our analysis, we can perform some simplifications and obtain numeric values for comparison. By truncating the size of a chain to a maximum length (smax ) and changing the last state of the chain to become an “absorbing state” that represents all subsequent states, numeric computation of state probabilities becomes straight-forward for any value of p. Table 1 shows the most probable state, smp and its corresponding probability, pmp after clustering 10,000 and 100,000 items (with smax = 16) for various values of pf n . As can be easily seen, even for very low probabilities for false negatives (pf n ≤ 0.3), the chances of replicating the number of clusters several times is considerable. In a realistic scenario, values of pf n > 0.5 can easily occur for dominant classes because item diversity in those clusters can be very significant. Therefore, when performing streaming clustering in such conditions, cluster fragmentation of at least one order of magnitude should be expected.
Efficient Clustering of Web-Derived Data Sets
401
Table 1. Most probable state of the Markov chain, for 10k and 100k items clustered pf n smp (10k) pmp (10k) smp (100k) pmp (100k) 0.2 6 0.626 8 0.562 0.3 8 0.588 10 0.580 0.4 10 0.510 13 0.469 0.5 13 0.454 16 0.844 0.6 16 0.941 16 1.000
2.1 Impact on Realistic Streaming Clustering Actual streaming clustering implementations attempt to solve the fragmentation problems in two ways. The first option is to perform a second pass for clustering the fragmented clusters based on their centroids. The problem with this is that the information that could be used for safely connecting two clusters (i.e., the points in between them) has been lost to centroid descriptions, and these might be too far apart to allow a safe merge since centroids of other clusters may be closer. This situation can more easily occur for large clusters in high-dimensional and sparse spaces, where sub-clusters of items might be described by almost disjoint sets of features, and thus be actually distant in the hyperspace. Thus, for web derived data, re-clustering will not necessarily solve the fragmentation problem, although such an approach is often successful in lower-dimensional and homogeneous datasets. A second variation of streaming clustering algorithms keeps a larger number of clusters than the final target, and alternates between adding more new items to clusters and considering current clusters for merging. However, if each of the items included in the cluster has a sparse representation, and if such “intermediate” clusters have a high level of intra-cluster similarity (as they are supposed to be in order to avoid adding noisy items), then the centroids will probably also have a sparse feature representation. As more items are clustered, each of these many intermediate clusters will tend have only projections in small set of features, i.e. those of the relatively few and very similar items it contains. Therefore, feature overlap between clusters will tend to be low, approximately in the same way item feature overlap is low. Such centroids will thus suffer from the same false negative problems as individual items do, and the number of potential clusters to hold in memory may grow large. In practice, unless one reduces the minimum inter-cluster similarity for performing merge operations (which could lead to noisy clusters), this strategy will not lead to as many cluster merging operations as expected, and many fragmented clusters will persist in the final solution. Again, the fragmentation effect should be more visible for larger clusters, in high-dimensional and sparse space.
3 Clustering by Finding Connected Components It is easy to understand that overcoming the problems generated by false negatives involves changing the way comparisons are made: somehow we need to obtain more information about similarity between items to compensate the effect of false negatives, but that needs to be done without compromising time and space restrictions. Complete
402
L. Sarmento et al.
information about item similarity is given by the Link Graph, G, of the items. Two items are linked in G if their level of pair-wise similarity is larger than a given threshold. The information contained in the Link Graph should allow us to identify the clusters corresponding to the classes. Ideally, items belonging to the same class should exhibit very high levels of similarity and should thus belong to the same connected component of G. On the other hand, items from different classes should almost never have any edges connecting them, implying the they would not be part of the same connected components. In other words, each connected component should be a cluster of items of the same class, and there should be a 1-1 mapping between connected components (i.e. clusters) and classes. Clustering by finding connected-components is robust to the problem of false negatives, because each node in G is expected to be linked to several other nodes (i.e. for each item we expect to find similarities with several other nodes). The effect of false negatives could be modeled by randomly removing edges from G. For a reasonably connected G, random edge removal should not affect significantly the connectivity within the same connected component, since it is highly unlikely that all critical edges get removed simultaneously. The larger the component, the more unlikely it is that random edge removal will fragment that component because more connectivity options should exist. Thus, for web-derived data sets, where the probability of false negatives is non-negligible, clustering by finding the connected-components of the link graph seems to be an especially appropriate option. Naive approaches to building G would attempt an all-against-all comparison strategy. For large data sets that would certainly be infeasible due to time and RAM limitation. However, an all-against-all strategy is not required. If our goal is simply to build the Link Graph for finding the true connected components then we only need to ensure that we make enough comparisons between items to obtain a sufficiently connected graph, Gmin , which has the same set of connected components as the complete Link Graph G. This means that Gmin only needs to contain the sufficient number of edges to allow retrieving the same connected components as if a complete all-against-all comparison strategy had been followed. In the most favorable case, Gmin can contain only a single edge per node and still allow retrieving the same connected components as in G (built using an all-against-all comparisons strategy). Since efficient and scalable algorithms exist for finding the connected components of a graph ([4], [5]), the only additional requirement needed for obtaining a scalable clustering algorithm that is robust to the problem of false negatives is a scalable and efficient algorithm for building the link graph. We will start by making the following observation regarding web derived data sets: because the distribution of items among class is usually highly skewed, then for any item that we randomly pick belonging to a dominant class (possibly only one or two) we should be able to rather quickly pick another item that is “similar” enough to allow the creation of an edge in the link graph. This is so even with the finite probability of finding false negatives, although such negatives will force us to test a few more elements. In any case, for items in the dominant classes one can establish connections to other items with vastly fewer comparisons than used in an all-against-all comparison scheme. We only need enough connections (e.g., one) to ensure enough connectivity in order to later retrieve the original complete connected components. For the less frequent items many more comparisons will be needed to find another “similar enough” item,
Efficient Clustering of Web-Derived Data Sets
403
since such items are, by definition, rare. But since rare items are rare, the total number of comparisons is still much lower than what is required under a complete all-againstall-strategy. We use a simple procedure: for each item keep comparing it with the other items until kpos similar items are found, so as to ensure enough connectivity in the Link Graph. More formally, we will start by shuffling items in set S(n) to obtain Srand (n). Each item in Srand (n) will be given a sequential number i. Then, for all the items starting with i = 0: 1. take item at position i, ii 2. Set j = 1 3. Repeat until we find kpos positive comparisons (edges) (a) Compare item ii with item ii+j (b) Increment j One can show (Appendix A) that the average computation cost under this “amortized comparison strategy” is: n · |C| · kpos ˜ O (1) 1 − pf n with n the number of items in the set, |C| the number of different true classes, pf n is the probability of false negatives and kpos as the number of positive comparisons, corresponding to the number of edges we wish to obtain for each item. This cost is vastly lower than what would be required for a blind all-against-all comparison strategy, without significantly reducing the chances of retrieving the same connected components. Notice that computation cost is rather stable to variation of pf n when pf n < 0.5. For pf n = 0.5 the cost is just the double of the ideal case (pf n = 0), which is comparatively better than values presented in Table 1. One can also show (Appendix A) that the expected value for the maximum number of items that have to be kept in memory during the comparison strategy, nRAM is equal to E(nRAM ) = kpos /(pmin · (1 − pf n )), where pmin is the percentage of items of the smallest class. This value depend solely on the item distribution for the smallest class and on the probability of false negatives, pf n . If only 0.1% of the elements to be clustered belong to the the smallest class kpos = 1, and pf n = 0.5 then E(nRAM ) = 2000. It is perfectly possible to hold information in RAM that many vectors with standard computers. Imposing a hard-limit on this value (for e.g. 500 instead of 2000) will mostly affect the connectivity for less represented classes. Another important property of this strategy is that link graphs produced this way do not depend too much on the order by which items are picked up to be compared. One can easily see that, ideally (i.e., given no false negatives), no matter which item is picked up first, if we were able to correctly identify any pair of items of the same class as similar items, then the link graph produced would contain approximately the same connected components although with different links. In practice, this will not always be the case because false negatives may break certain critical edges of the graph, and thus make the comparison procedure orderdependent. A possible solution for this issue is to increase the number of target positive comparison to create more alternatives to false negative and thus reduce the order dependency.
404
L. Sarmento et al.
3.1 Finding Connected Components Given an undirected graph G with vertices {Vi }i=1..N and edges {Ei }i=1..K , we wish to identify all its connected components; that is, we wish to partition G into disjoint sets of vertices Cj such that there is a path between any two vertices in each Cj , and such that there is no path between any two vertices from different components Cj and Ck . There is a well-known [4] data structure called a disjoint-set forest which naturally solves this problem by maintaining an array R of length N of representatives, which is used to identify the connected component to which each vertex belongs. To find the representative of a vertex Vi , we apply the function Find(x) { if(R[x] == x) return x; else return Find(R[x]); } starting at i. Initially R[i] = i for all i, reflecting the fact that each vertex belongs to its own component. When an edge connecting Vi and Vj is processed, we update R[F ind(i)] ← F ind(j). This naive implementation offers poor performance, but it can be improved by applying both a rank heuristic, which determines whether to update via R[F ind(i)] ← F ind(j) or R[F ind(j)] ← F ind(i) when processing a new edge and path compression, under which F ind(i) sets each R[x] it ever visits to be the final representative of x. With these improvements, the runtime complexity of a single Find() or update operation can be reduced to O(α(N )), where α is the inverse of the (extremely fast-growing) Ackermann function A(n, n) [4]. Since A(4, 4) has on the 19729 ) order of 2(10 digits, the amortized runtime per Find() or or update operation is effectively a small constant.
4 Experimental Setup We compared the (simplified) streaming clustering (SC) algorithm with our connected component clustering (CCC) approach on artificially generated data-sets. Data-sets were generated with properties comparable to web-derived data, namely: (i) Zipfian distribution of class sizes, with one or two dominant classes; (ii) the number of features associated with each class increases sub-linearly with class size; (iii) the number of nonnegative features in each item is Zipfian distributed, and larger for larger classes (items have at least three non-negative features); and (iv) feature distribution inside each class is lightly Zipfian (exponent 0.5), meaning that there is a subset of features that occurs more frequently but often enough to make them absolutely discriminant of the class. Each class has its own set of exclusive features. Therefore, in the absence of noise, items of different classes will never share any feature and thus will always have 0 similarity. Overlap between items of different classes can be achieved by adding noisy features, shared by all classes. A given proportion of noise features can be randomly added to each item. To ensure a realistic scenario, we generated a test set with 10,000 items with Zipfian-like item distribution over 10 classes. Noise features were added so that clustering would have to deal with medium level noise. Each item had an additional 30% noise features added, taken from a noise class with 690 dimensions. Noise
Efficient Clustering of Web-Derived Data Sets
405
Table 2. Properties of the test set S30 Class 1 2 3 4 5
Items 6432 1662 721 397 249
dim 657 556 493 448 413
avg(#ft) avg(#ftnoise ) 54.14 15.95 48.25 14.14 44.13 12.88 39.83 11.60 34.04 9.84
Pno Class Items dim avg(#ft) avg(#ftnoise ) 0.53 6 187 392 34.70 10.06 0.56 7 133 366 35.03 10.18 0.568 8 87 334 29.64 8.56 0.589 9 77 325 26.71 7.61 0.57 10 55 300 24.6 7.05
Pno 0.59 0.58 0.58 0.61 0.61
features have a moderately decaying Zipfian distribution (exponent 1.0). Table 2 shows some statistics regarding this test set, S30 . We show the average number of features per item, avg(#ft), and the average number of noise features per item, avg(#ftnoise). Pno is the probability of not having any overlap between two items randomly picked from a given class (this should be a lower bound for Pf n ). 4.1 Measures of Clustering Performance Given a set of |T | test clusters T to be evaluated, and a gold standard, C, containing the true mapping from the items to the |C| classes, we wish to evaluate how well clusters in T , t1 , t2 ,...t|T | represent the classes in C, c1 , c2 ,... c|c| . Ideally, all the items from any given test cluster, tx , should belong to only one class. Such a tx cluster would then be considered “pure” because it only contains items of a unique class as defined by the Gold Standard. On the other hand, if items from tx are found to belong to several gold standard classes, then the clustering algorithm was unable to correctly separate classes. To quantify how elements in test cluster tx are spread over the true classes, we will measure the entropy of the distribution of the elements in tx over all the true classes, cy . Let ixy be the number of items from test cluster tx that belong to class cy and let |tx | be the total number of elements of cluster tx (that can belong to any of the |C| true classes). The cluster entropy of the test cluster tx over all |C| true classes is: et (tx ) =
|C|
−
y=0
ixy ixy · ln( ) |tx | |tx |
(2)
For all test clusters under evaluation we can compute Et , the weighted average of the entropy of each individual test cluster, e(tx ): Et =
|T |
|tx | · et (tx ) |T | x=0 |tx |
x=0
(3)
In the most extreme case, all test clusters would have a single element and be “pure”. This, however, would mean that no clustering had been done, so we need to simultaneously measure how elements from the true classes are spread throughout the test clusters. Again, we would like to have all items from a given true class in the fewest test clusters possible, ideally only one. Let |cy | the the number of items in class cy .
406
L. Sarmento et al.
Then, for each true class, cy , we can compute the class entropy, i.e. the entropy of the distribution of items of such class over the all test clusters by: ec (cy ) =
|T |
−
x=0
ixy ixy · ln( ) |cy | |cy |
(4)
A global clustering performance figure can be computed as a weighted average over all classes of each individual class entropy: |C| Ec =
|cy | · ec (cy ) |C| y=0 |cy |
y=0
(5)
Ideally, both Et and Ec should close to zero as possible, meaning that test clusters are “pure” and that they completely represent the true classes. In the case of a perfect clustering (a 1-to-1 mapping between clusters and classes), both Et and Ec will be 0.
5 Results We compared the performance of our connected components clustering (CCC) algorithm with two other algorithms: simplified 1-pass stream clustering (1p-SC) and 2-pass streaming clustering (2p-SC). The simplified 1-pass streaming clustering was described in Section 3 and was included in the comparison for reference purposes only. The 2-pass streaming clustering consists in performing a re-clustering of the clusters obtained in the 1-pass, using information about the centroids of the clusters obtained. The re-clustering is made using the exact same stream-clustering procedure, merging clusters using their centroid information. The 2-pass SC algorithm is thus a closer implementation of the standard streaming clustering algorithm. Each of the algorithms has parameters to be set. For the CCC algorithm we have three parameters that control how the “amortized comparison strategy” is made: (i) minimum item similarity, smincc ; (ii) target positive comparisons for each item, kpos ; and (iii) maximum sequence of comparisons that can be performed for any item, kmax (which is equivalent to the maximum number of items we keep simultaneously in RAM). The kpos and kmax parameters was kept constant in all experiments: kpos = 1, kmax = 2000 (see Section 3). The 1-pass SC algorithm has only one parameter, sminp1 , which is the minimum distance between an item and a cluster centroid to merge it to that cluster. The 2-pass SC algorithm has one additional parameter in relation to the 1-pass SC. sminp2 controls the minimum distance between the centroids for the corresponding clusters to be merged together in the second pass. The vector similarity metric used in all algorithms was the Dice metric. Since all algorithms depend on the order of the items being processed, items were shuffled before being clustered. This process (shuffling and clustering) was repeated 5 times for each configuration. All Results shown next report the average over 5 experiments. Figure 2 shows the Et (“cluster impurity”), Ec (“class dispersion”) curves obtained for the three algorithms, using the test set S30 . Results were obtained by changing smincc , sminp1 and sminp2 , from relatively high values that ensured almost pure yet fragmented clusters (Et ≈ 0 but Ec >> 0) to lower values that lead to the generation of less but much
Efficient Clustering of Web-Derived Data Sets
407
Fig. 2. Ec (y-axis) vs. Et (x-axis) for four clustering methods. CCC gives better results than most streaming clustering configurations, and is comparable to a carefully tuned streaming method.
noisier clusters (Ec < 1 but Et >> 0). We compared the results of the CCC algorithm with results obtained from the 1-pass SC (1p-SC) and two different configuration for the two pass stream-clustering algorithm: 2p-SC(A) and 2p-SC(B). Configuration 2p-SC(A) was obtained by changing sminp2 while keeping the value sminp1 constant at a level that ensured that the partial results from the first pass would have high purity (yet very high fragmentation). For the configuration 2p-SC(B), we followed for a different strategy for setting parameters: we kept sminp2 constant at a medium level, and slowly decreased sminp1 to reduce the fragmentation of partial clusters. Configuration 2p-SC(B) was found to the best performing combination among all (several dozens) of configuration tried for the two pass clustering algorithm. We manually verified that, for this test set, values of Et larger than 0.3 indicate that the clusters produced are mixing items from different classes, so Figure 2 only shows results for Et < 0.4. We made further comparisons between our CCC algorithm and the best performing configuration of the 2p-SC algorithm. Table 3 shows the results of this comparison when aiming at a target value of Et = 0.15. Relevant criteria for comparing clustering quality are the Et and Ec values, the number of clusters generated (# clusters) and the number of singleton clusters (# singleton) produced. For comparing computational performance we present the number of comparisons made (# comparisons) and the overall execution time of each algorithm. For 2p-SC we show statistics regarding both the intermediate results (i.e., after pass 1) and the final results (after pass 2), so as to emphasize their relative contributions. Table 4 shows a typical example of the cluster / true class distribution of the top 10 clusters for the results obtained. (Compare with Table 2). The existence of two or more clusters for Class 1 (and sometimes also for Class 2) was a common result for the 2p-SC algorithm.
6 Analysis of Results The results plotted in Figure 2 show that the connected components clustering (CCC) algorithm we propose gives clustering qualities very close to those of the best
408
L. Sarmento et al. Table 3. Comparison between 2p-SC and CCC for target cluster purity Et = 0.15 Et Ec # clusters # singletons # comparisons t (secs.)
2p-SC (pass 1) 2p-SC (final) 0.08 0.15 7.64 1.1 755.4 184 66.4 66.4 4.2M 74k 142 4
CCC 0.15 1.53 647.6 478.2 2.2M 42
performing 2p-streaming clustering approach (2p-SC). Additionally, the CCC algorithm consistently required approximately only half the number of comparisons to produce results comparable to the 2p-SC, as the first pass of streaming clustering tends to generate heavy fragmentation (and hence Ec > 6). This is especially the case for the relevant part of the Et / Ec curve (Et ≤ 0.3); Thus, we can obtain a significant improvement in computational performance in the regime we most care about. The results in Table 3 suggest that in practice, CCC may have better results than 2p-SC. The Ec (fragmentation) values that the CCC algorithm obtains are worsened by the extremely large tail of singleton or very small clusters that are produced. (These are outliers and items in the end of the buffer that ended up not having the chance to be compared to many others). So, if one were to ignore these smaller clusters in both cases (since filtering is often required in practice), the new corresponding Ec values would become closer. The question of filtering is, in fact, very important and helps to show another advantage of the CCC for clustering data when processing Zipfian distributed classes on sparse vector spaces. As can be seen from Table 4, 2p-SC failed to generate the single very large cluster for items in Class 1. Instead it generated two medium-size clusters. This type of behavior, which occurred frequently in our experiments for large classes (e.g., 1, 2 and 3), is an expected consequence of the greedy nature of the streaming clustering algorithm. During streaming clustering, if two clusters of the same class happen to have been started by two distant items (imagine, for example, the case of a class defined by “bone-like” hull), greedy aggregation of new items might not help the two corresponding centroids to become closer, and can even make them become more distant (i.e. closer to the two ends of the bone). In high dimensional and sparse spaces, where classes are very large and can have very irregular shapes, such local minima can easily occur. Thus, if we were to keep only a few of the top clusters produced by 2p-SC (e.g., the top 5), there would be a high probability of ending up only with fragmented clusters corresponding only to the one or two (dominant) classes, and thus loose the other medium-sized, but still important, clusters. The CCC algorithm we propose, in contrast, is much more robust to this type of problem. CCC tends to transfer the distribution of true classes to the clusters, at least for the larger classes, where the chances of finding a link between connected components of the same class is higher. Only smaller classes will be affected by fragmentation. Thus, filtering will mostly exclude only clusters from these smaller classes, keeping the top clusters that should directly match the corresponding top classes. Excluded items might be processed separately later, and since they will be only a small fraction of the initial set of items, more expensive clustering methods can be applied.
Efficient Clustering of Web-Derived Data Sets
409
Table 4. Typical cluster / true class distribution for target cluster purity Et = 0.15 CCC 2p-SC CCC 2p-SC Cluster True Class [#Items] True Class [#Items] Cluster True Class [#Items] True Class [#Items] 1 1 [6113] 1 [3302] 7 7 [96] 6 [150] 2 2 [1405] 1 [3087] 8 9 [40] 7 [100] 3 3 [582] 2 [1573] 9 4 [38] 8 [68] 4 4 [321] 3 [636] 10 8 [37] 9 [58] 5 5 [170] 4 [323] 11 1 [32] 10 [36] 6 6 [134] 5 [192] 12 10 [30] 2 [18]
7 Related Work Streaming clustering [2,6] is one of the most famous classes of algorithms capable of processing very large data sets. Given a stream of items S, classic streaming clustering alternates between linearly scanning the data and adding each observation to the nearest center, and, when the number of clusters formed becomes too large, clustering the resulting clusters. Alternatively, data can be partitioned, each partition clustered in a single pass, and then the resulting clusters can themselves be clustered. BIRCH is another classic method for clustering large data sets. BIRCH performs a linear scan of the data and builds a balanced tree where each node keeps summaries of clusters that best describe the points seen so far. New items to be clustered are moved down the tree until they reach a leaf, taking into account the distance between its features and node summaries. Leafs can be branched when they are over-crowded (have too many items), leading to sharper summaries. BIRCH then applies hierarchical agglomerative clustering over the leaf summaries, treating them as individual data points. The overall complexity is dominated by the tree insertion performed in first stage. A different approach to reducing computational complexity is presented in [7]. In a first stage data is divided into overlapping sets called canopies using a very inexpensive distance metric. This can be done, for examples using and inverted index of features. Items under the same inverted index entry (i.e. that share the same feature) fall into the same canopy. In a second stage, an exact - and more expensive - distance metric is used only to compare elements that have been placed in the same canopy. These three last methods process data in two passes, unlike our method which uses only a single pass. None of the other methods deal explicitly with the problem of false negatives, which is crucial in web-derived data. The first two methods also suffer a non-negligible risk of reaching sub-optimal solutions due to their greedy nature. Another line of work aims at finding efficient solutions to the problems arising from high-dimensionality and sparsity, specially those concerned with measuring similarities between items in such spaces [1]. CLIQUE [8] is a density-based subspace clustering algorithm that circumvents problems related to high-dimensionality by first clustering on a 1-dimension axis only and then iteratively adding more dimensions. In [9], the authors use an approximation to a nearest-neighbor function for very high dimension feature space to recommend news articles, based on user similarity. Instead of directly comparing users, a Locality Sensitive Hashing [10] scheme named
410
L. Sarmento et al.
Min-Hashing (Min-wise Independent Permutation Hashing) is used. For each item ij (i.e. user) in the input set S, the hash function H(ij ) returns the index of the first nonnull feature from the corresponding the feature vector (corresponding to a click from the user on a given news item). If random permutations of feature positions are performed to S, then it is easy to show ([11], [10]) that the probability of two items hashing to the same value, H(ij ) = H(ik ) is equal to their Jaccard coefficient J(ij , ik ). Min-hashing can thus be seen as a probabilistic clustering algorithm that clusters together two items with a probability equal to their Jaccard Coefficient. The hash keys for p different permutations can be concatenated so that two item will converge on the same keys with probability J(ij , ik )p , leading to high-precision, yet small, clusters. Repeating this process for a new set of p permutations will generate different high-precision clusters, giving increased recall. For any item ij it is possible to obtain the list of its approximate nearest-neighbors by consulting the set of clusters to which ij was hashed. Since clusters produced by min-hashing are very small, it will produce extremely fragmented results when directly used for clustering large data sets. It could, however, potentially be used as an alternative technique for building the link graph because it provides a set of nearest neighbors for each item. However, there is no assurance that the link graph thus created would contain the complete connected components. Clusters extracted from that graph could thus be very fragmented.
8 Conclusion and Future Work We have seen that the Zipfian distribution of features and of feature classes for problems such as web-document clustering can lead to cluster fragmentation when using methods such as streaming clustering, as individual items often fail to share any features with the cluster centroid. (Streaming clustering using medoids, as is often done in the theory literature, would be much worse, as most items would fail to intersect with the medoid.) Connected component clustering does a better job of addressing this problem, as it keeps searching for items close to each target item being clustered until they are found. This is not as expensive as it sounds, since it will be easy to find connected items for the many items that are in large classes. We showed that a reasonably connected link graph can be obtained using an item comparison procedure with cost amortized to O(n · C). We showed that the performance of our algorithm is comparable to best performing configurations of a streaming clustering approach, while consistently reducing the number of comparisons to half. Another important characteristic of our algorithm is that it is very robust to fragmentation and can thus transfer the distribution of true classes in the resulting clusters. Basically, this means that the top largest clusters will represent the top largest classes, which is fundamental when filtering is required. The above work has described the clustering as if it were done on a single processor. In practice, web scale clustering requires parallel approaches. Both stages of our algorithm (the amortized comparison procedure and procedure for finding the connected components on the graph) are specially suited for being implemented in the Map-Reduce paradigm [12]. Future work will focus on parallel implementation of our algorithm using the Map-Reduce platform and studying its scalability and performance.
Efficient Clustering of Web-Derived Data Sets
411
Acknowledgments This work was developed while Lu´ıs Sarmento was an engineering intern and Lyle Ungar was a visiting researcher at Google offices in NYC. The authors would like to thank the Google team for all the help and support. Also, special thanks to Paula Milheiro (University of Porto, Portugal) for valuable discussions regarding Markov chains.
References 1. Aggarwal, C., Hinneburg, A., Keim, D.: On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000) 2. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515– 528 (2003) 3. Samuel-Cahn, E., Zamir, S.: Algebraic characterization of infinite markov chains where movement to the right is limited to one step. Journal of Applied Probability 14, 740–747 (1977) 4. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press and McGraw-Hill Book Company (1990) 5. Hopcroft, J., Tarjan, R.: Algorithm 447: efficient algorithms for graph manipulation. Commun. ACM 16(6), 372–378 (1973) 6. Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: STOC 2003: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pp. 30–39. ACM, New York (2003) 7. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178. ACM, New York (2000) 8. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998) 9. Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 271–280. ACM, New York (2007) 10. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. of 30th STOC, pp. 604–613 (1998) 11. Broder, A.Z.: On the resemblance and containment of documents. In: SEQS: Sequences 1991 (1998) 12. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation (2004)
A Demonstrations Consider the set of I containing |I| items that belong to C classes c1 , c2 , c3 ,... cC . Let pji be the probability of an item (or element) ej randomly picked from I belonging to class ci : P (ej ∈ ci ) = pji with 1 < i < C. Now consider the problem of sequentially comparing items in I (previously shuffled) in order to find items similar to the
412
L. Sarmento et al.
initial (target) item. If we randomly pick one item ej from I, we wish to estimate the number of additional items that we need to pick (without repetition) from I before we find another item that belongs to the same class. For a sufficiently large set of items the probabilities P (ej ∈ ci ) do not change significantly when we pick elements out of I without replacement, and we can consider two subsequent draws to be independent. We can thus make P (ej ∈ ci ) = pi and approximate this procedure by a Bernoulli Process. Therefore, for a given element of class ci , the number of comparisons ki needed for finding a similar item follows a Geometric Distribution with parameter, pi . The expected value for k is E(ki ) = p1i . For C classes, the average number of comparisons is: E(k) =
|C|
pc · E(kc ) =
c=1
|C|
pc ·
c=1
1 = |C| pc
(6)
For sufficiently large |I|, the number of classes will remain constant during almost the entire sampling process. Thus, the total number of comparisons for the |I| items is: Ncomp = |I| · |C|. If we extend the previous item comparison procedure to find kpos similar items to the target item,n we can model the process by a Negative Binomial Distribution (or Pascal Distribution) with parameters pi and kpos : ki − 1 k k −k Bneg (ki , kpos ) = · pi pos · (1 − pi ) i pos (7) kpos − 1 In this case, the average number of comparisons made, given by the corresponding Expected Value is: EBneg (ki , kpos ) = kpos /pi . The longest series of comparison wills be made for the class with the lowest pi , i.e. the small class. However, it lead us to a average number of comparisons when considering all the |C| of classes of: Ecomp (k) =
|C|
pc · EBneg (kc , kpos ) = kpos · |C|
(8)
c=1
For all |I| items we should thus have Ncomp = |I| · |C| · kpos . If we now consider that there a probability of pf n of having a false negative when comparing two items, and that pf n is constant and independent of classes, the pi should be replaced by pi · (1 − pf n ), i.e. the probability of a random pick finding another item in class ci has to be multiplied by the probability of not having a false negative. Then all the above equations will change by a constant factor, giving: Ncomp =
|I| · |C| · kpos 1 − pf n
(9)
Likewise, the expected value for longest series of comparisons will be given by performing the same substitution in Equation 10, and making pi = pmin : kpos Els = (10) pmin · (1 − pf n )
A Probabilistic Approach for Constrained Clustering with Topological Map Khalid Benabdeslem and Jihene Snoussi University of Lyon, F69622-Lyon, France University of Lyon1 - LIESP (EA 4125), Villeurbanne {kbenabde,jihene.snoussi}@bat710.univ-lyon1.fr
Abstract. This paper describes a new topological map dedicated to clustering under probabilistic constraints. In general, traditional clustering is used in an unsupervised manner. However, in some cases, background information about the problem domain is available or imposed in the form of constraints in addition to data instances. In this context, we modify the popular GTM algorithm to take these ”soft” constraints into account during the construction of the topology. We present experiments on synthetic known databases with artificial generated constraints for comparison with both GTM and another constrained clustering methods. Keywords: Constrained Clustering, Soft constraints, GTM.
1
Introduction
Clustering is known as an important task in knowledge discovery from data mining process. It allows to find in one hand, the relationships between data set observations and in other hand, the relationships between these observations and their characteristics (variables). It aims to group observations into classes or categories, such that all objects in the same group are similar to each other, while dissimilar from objects in other groups [14]. However, traditionnel clustering algorithms only access to variables which describe each data but they do not deal with any other kind of given information. Nevertheless, taking a priori knowledge into account in such algorithms, if there exists, is an important problem and a real challenge in nowadays clustering research. It concerns a recent area in learning and data mining research which is constrained clustering [1] Furthermore, The last eight years have seen extensive work on incorporating instance-level constraints into clustering methods. The first work in this area proposed a modified version of COBWEB [13] that strictly enforced pairwise constrains [21]. it was followed by an enhanced version of widely used k-means algorithm [18] that could also accommodate constraints, called COP-kmeans [22]. Moreover, in [10], an exploration of the use of instance and cluster-level constraints was performed with agglomerative hierarchical clustering. In [12] we have proposed a new graph based constrained clustering algorithm called COPb-coloring where we have shown improvements in quality and computational P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 413–426, 2009. c Springer-Verlag Berlin Heidelberg 2009
414
K. Benabdeslem and J. Snoussi
complexity of clustering by using constraints in a graph b-coloring clustering algorithm. theoretically, it was proven that clustering with constraints raised an intractable feasibility problem [7] [9] for simply finding any clustering that satisfies all constraints via a reduction from graph coloring. Recently, we have developed a new version of Self organizing map algorithm (SOM) [15] based on the control of neurons respecting or violating some given constraints on patterns during the construction of the topological map [3]. In all of cited methods, the constraints are available and presented in a deterministic manner. We talk here about ”hard” constraints, that means that one constraint exists (1) or not (0). But sometimes, these contraints can be presented by probabilities. Thus, it is important to carry out a probabilistic framework to deal with these kind of ”soft” constraints. In this context, some works are proposed to accommodate noise or uncertainty in the constraints [16][17][20]. Other methods seek to satisfy as many constraints as possible, but not necessarily all of them [2][9]. Methods such as PKM and MPKM permit the specification of an individual weight for each constraint, addressing the issue of variable per-constraints confidences [5]. The main contribution of this paper is to adopt soft constraints in a selforganizing based clustering. Each constraint becomes a real value between 0 and 1.The value of the constraint reflects the certainty of the prior knowledge that a pair of objects comes from the same cluster. Our main clustering algorithm is based on a generative model, where constraints are integrated and identified in the neurons of the corresponding topographic neural networks. In this sense, we extend the work by Bishop et al. [4] whose method (GTM: Generative topographic mapping) offers a popular probabilistic paradigm of SOM into a probabilistic topological based constrained clustering that we call PrTM.
2
Generative Topographic Mapping
The Generative Topographic Mapping (GTM) is proposed by Bishop, Svensen and Williams in [4]. It’s presented as a probabilistic re-formulation of SOM with a certain number of advantages. GTM has already been used in a variety of applications. It represents a transformation process of observations between two layers, a first layer corresponding to the input space which represents the amount of data and a second layer corresponding to output space which represents the set of obtained clusters (Fig. 1). GTM is considered as a non-linear model of latent (hidden) variables which represent the probability density of high-dimensional data (data space) in terms of a small set of latent variables (the latent space) based on a constrained mixture of Gaussians whose parameters can be optimized using the EM (Expectation Maximization) algorithm [11]. The latent space is situated between both, the input space and the output one. It enables the non-linear transformation between them. By considering a regular grid of nodes in latent space (corresponding to the neurons in SOM), non-linear functions are used between data space and latent space for representing of the one in the other (Fig. 1). Some visualization based models regarded it as the projection of the data space in a 2-dimensional space.
A Probabilistic Approach for Constrained Clustering with Topological Map
415
Fig. 1. The non-linear transformations between data space and laten space, Here D = 3 L=2
Nevertheless, the GTM model defines the visualization of the data by the configuration of latent space in data space. therefore we carry out the opposite projection by using the theorem of Bayes, causing a posterior distribution in latent space. GTM model is defined by: - a data space X of dimension D - a latent space U of dimension L. In this space, a matrix of nodes uk are labelled by k = 1..K (corresponding to neurons in SOM) is presented. - a set of M non-linear basis functions φ = {φj } which provide the non-linear mapping into data space. That allows to each node uk (neuron in SOM) from U to be associated to an observation yk (referent vector in SOM) in data space obtained by a transformation function y = (u; W ) (Fig. 2). This function then maps the latent variables space into an L-dimensional non-euclidian S embedded within the data space. y(x; W ) can be given by a generalized linear regression model of the term: yk = W φ(x).
(1)
where the elements of φ(x) consist of M fixed basis function φj and W is a D × M matrix. For each node u in the grid of the latent space, its probability is defined by:
p(u) =
K 1 δ(u − uk ). K
(2)
k=1
Each observation xi can be then mapped to a corresponding observation y(xi ; W ) in data space which forms the centre of a Gaussian density function having variance β −1 . The conditional probability of x to be in uk is done by: p(x/uk , W, β) = (
β D/2 β ) exp(− yk − x2 ). 2π 2
(3)
416
K. Benabdeslem and J. Snoussi
Fig. 2. Mapping of each node ui to a corresponding observation y(ui ; W ) in data space
The distribution of x-space, for a given W , is then obtained by integration over the the u-distribution. p(x/W, β) =
p(x/u, W, β)p(u)du =
K 1 β D/2 β ( ) exp(− yk − x2 ). (4) K 2π 2 k=1
The parameters to be optimized are : W and β. For that, the model uses EM algorithm by alternating the E-Step and the M-step. This algorithm allows to find the log-likelihood of probabilistic models parameters when depending on hidden variables. We denote the data set {xn } where n = 1, ..., N . The log likelihood function is given by: N L(W, β) = ln p(xn /W, β). (5) n=1
where p(x/W, β) is given by (4). We can maximize this log likelihood function by finding expressions for its derivatives and using these in a standard non-linear optimization algorithm such as conjugate gradients. Alternatively, we can exploit the latent-variable structure of the model and use the expectation- maximization EM algorithm. In the E-step, we use the current values of the parameters W and β to evaluate the posterior probability, or responsibility, which each component i takes for every data instance xn , which, using Bayes’ theorem, is given by: p(xn /i) Rni = p(i/xn ) = . j p(xn /j) in which the prior probabilities P (i) = we can rewrite this in the form:
1 K
(6)
have cancelled denominator. Using (3)
2 exp( −β 2 mi − xn ) Rni = −β 2 j exp( 2 mj − xn )
(7)
A Probabilistic Approach for Constrained Clustering with Topological Map
417
Then in the M-step we use the responsibilities to re-estimate the weight matrix W by solving the following system of linear equations T (φT Gφ)Wnew = φT RX.
(8)
which follow by maximization of the expected complete-data log likelihood. In (8) φ is a K × M matrix with elements φij = φj (ui ), X is an N × D matrix with elements xnk , R is a K × N matrix with elements Rni , and G is a K × K diagonal matrix with elements Gii = n Rni . The inverse variance parameter is also re-estimated in the M-step using 1 βnew
=
N K 1 Rni Wnew φ(ui ) − xn 2 . N D n=1 i=1
(9)
A detailed derivation of EM algorithm for the GTM can be found in [4].
3
Topographic Mapping and Soft Constraints
In this section we present our probabilistic approach called PrTM which is an adaptation of GTM to soft constraints. First, we define the various types of constraints which can be applied to data. Then, we present an automatic process for generating soft constraints from training data set. lastly, we show how to modify GTM model in order to adapt it to this type of constraints. 3.1
Constraints
Constraints provide guidance about the desired partition and make it possible for clustering algorithms to increase their performance [8]. Let X = x1 , ..., xn denote the given set of observations which must be partitioned into K clusters, denoted by u1 , ..., uK . For any pair of observations xi , xj in X, we denote the distance between them by d(xi , xj ). The different types of constraints that we can meet are: (a) Must-Link constraint (M L): involving xi and xj , specifies that they must be placed into the same cluster. (b) Cannot-Link constraint (CL): involving xi and xj , specifies that they be placed into deferent clusters. (c) δ-constraint: this constraint specifies a value δ > 0. In any solution satisfying this constrain, the distance between any pair of observations which are in two different clusters must be at least δ. (d) -constraint: This constraint specifies a value > 0 and the feasibility requirement is the following: for any cluster ui containing two or more observations and for any observation xp ∈ ui , there must be another observation xq ∈ Si such that d(xp , xq ) ≤ .
418
K. Benabdeslem and J. Snoussi
(e) Soft constraints: they can reflect the uncertainty associated with a priori knowledge about pairs of observations that should or should not belong to the same cluster. They are directly related to M L and CL constraints: (e1) Probabilistic or soft Must-Link constraint(PML ): involving xi and xj , specifies a probability that these two observations should be into the same cluster. (e2) Probabilistic or soft Cannot-Link constraint(PCL ): involving xi and xj , specifies a probability that these two observations should be into different clusters. 3.2
Generating Soft Constraints
Constraints can be generated from background knowledge about the data set (directly specified by users) or from a subset of data with known labels. Based on the strong positive empirical results that have been reported, the opinion of the community is that constraints help improve clustering performance with respect to accuracy, as measured on the set of extrinsic labels used to generate the constraints [8]. In this sens, we show how to produce soft constraints PML and PCL based on distance between observations from labelled data set. First, a constraint is generated from randomly selected pair of observations as follow: If they have the same label, we generate an M L constraint, otherwise we generate a CL constraint. The constraints can be expanded, while taking into account the transitive closure: - M L(xi , xj ) ∧ M L(xj , xk ) =⇒ M L(xi , xk ) - M L(xi , xj ) ∧ CL(xj , xk ) =⇒ CL(xi , xk ). Then, in the same way, we impose constraints over the probabilities to be generated. In other word, for a soft constraint PML between xi and xj , we must satisfy: 1 ≤ PML (xi , xj ) ≤ 1. (10) 2 and for PCL , we must verify : 1 . (11) 2 Since the calculation of probabilities is mainly based the distance between observations, another condition must be satisfied. It is that, as long as the distance between two observations increases, the probability of being in the same cluster decreases. In other words, for two pairs of observations (xi , xj ); (xk , xl ) with PML constraints for both: d(xi , xj ) ≤ d(xk , xl ) ⇒ PML (xi , xj ) ≥ PML (xk , xl ). For verifying (10) for generating PML (respectively (11), for PCL ), we must define the appropriate probabilities. In fact, a probability can be written in form: a + b , where a, b ∈ R and a < b. So, we must find a relationship between a and b that satisfy (10) for PML . It is always true to put: a 1 − < 1. (12) b 0 < PCL (xi , xj ) ≤
A Probabilistic Approach for Constrained Clustering with Topological Map
Then, we specify for PML , a probability higher than observations even if they are distant: 1−
1 2
419
to make together two
a 1 > . b 2
(13)
With Δ > 0, if we put b = 2a + Δ, we satisfy (13). Thus, if we replace a by d(xi , xj ), we can satisfy: d(xi , xj ) < 2d(xi , xj ) < 2d(xi , xj ) + Δ . a
(14)
b
We can follow the same approach for PCL by substituting (13) by 1 a − > 0. 2 b
(15)
Finally, the proposed formulas for probabilities are given : - For PML constraint : d(xi , xj ) . 2d(xi , xj ) + Δ
(16)
1 d(xi , xj ) − . 2 2d(xi , xj ) + Δ
(17)
PML (xi , xj ) = 1 − - For PCL constraint: PCL (xi , xj ) =
3.3
Integrating Soft Constraints in GTM
This step consists in modifying GTM algorithm to adapt it to the soft constraints. In this model, we calculate the posterior probability of an observation xi to belong to a given neuron, knowing the parameters of the model: W and β, from the equation (3) that we summarize in: R(xi ) = p(xi /W, β).
(18)
The idea is to incorporate the soft constraints during the learning of the map. That is done in the assignment step (when calculating the posterior probabilities over the map) of GTM. We verify if this observation is related to another one by a constraint or not, if yes, its posterior probability over the neurons is influenced by the probability of the constraint (PML or PCL ). Otherwise, we keep the same way as GTM for its probability of belonging to the various neurons. Indeed, let us notice that, according to the formula (18), the posterior probability of an observation to belong to a neuron is conditioned by the parameters of the model: W , β. That means that GTM can be seen as a constrained Mixture of distributions. Thus, we propose to add another constraint when a background information is presented as a soft constraint:
420
K. Benabdeslem and J. Snoussi
If we have a PML constraint between xi and xj , then we calculate: p(xi /W, β, xj ).
(19)
R(xi ) × PML (xi , xj ) × R(xj ).
(20)
And the probability is given:
For PCL constraint between xi and xj , we calculate: R(xi )(1 − PCL (xi , xj )) × R(xj ).
(21)
Subsequently, PrTM proceed as follow: - We initialize W and β for calculating initial probabilistes (E-step of EM) as done in GTM. - In M-step, we re-estimate the values of W and β for calculating the new probabilities. Our main modification consists here in testing for each observation xi if it is related to another observation xj with a constraint or not. if yes, its posterior probability to belong to each neuron is modified by taking into account the probability of the constraint according to (20) or (21). the probability of xj is also modified according to the same formulas. Otherwise, the probabilities are calculated as proposed by GTM, i.e. by (18).
4
Results
In this section, we present some experimental results for validating our approach. We have chosen 6 labelled data sets from UCI directory [6]. Their labels are used for generating constraints and evaluation purposes but not visible to the clustering algorithm. They are voluntarily chosen (c.f. Table 1) for comparing our approach PrTM with different methods: an unconstrained probabilistic clustering: GTM [4]; three deterministic constrained clustering algorithms(CrTM [3], COP-b-coloring [12], COP-Kmeans [22]); and two probabilistic constrained clustering: (PKM [2], MPKM [2]). 4.1
Evaluation of PrTM
For the evaluation of the efficiency of the PrTM algorithm, we propose to use the Rand index [19]. This index measures the correspondance between two partitions Π1 and Π2 of a data set X. In our case, Π1 is the correct partition produced by labels of predefined classes and Π2 is the partition obtained from the P rT M algorithm. Each partition is regarded as a set of N (N − 1)/2 pairs of decisions. For each pair of observations (xi , xj ), Πi assigns them to the same class or to two different classes. Assuming a is the number of decisions where zi belongs to the same class as xj in Π1 and Π2 and b is the number of decisions where xi and xj do not belong to the same class in Π1 and Π2 , we obtain (a + b) correct decisions and the overall accuracy between Π1 and Π2 is: Acc = Rand(Π1 , Π2 ) =
a+b . N (N − 1)/2
(22)
A Probabilistic Approach for Constrained Clustering with Topological Map
421
Table 1. Characteristics of used data sets Data sets Glass Rings Tic-Tac-Toe Heart Ionosphere Iris
N 214 1000 958 303 351 150
D #labels 9 6 3 2 9 2 13 2 34 2 4 3
We also show that knowledge brought by constraints may even improve the performance of clustering algorithms on patterns which are not constrained. Then, we compute aside overall accuracy, the one on a held-out test set which is a subset of data set composed of instances that are not directly or transitively affected by the constraints. This represents a real learning performance measure since such a Held-Out improvement reveals if the algorithm managed to learn constraints and generalize this type of knowledge so as to influence the classification of unconstrained patterns. Note that in the first time, Rand index is used for both, showing the improvement of GTM clustering and comparing PrTM with deterministic constrained clustering methods. For that, we select the maximum of posterior probabilities for each observation when affecting it to a class (neuron) in Π2 . For ”Glass” data set, without any constraint the overall accuracy (Acc) obtained with GTM is 80%. This measure increases with the incorporation of 150 random soft constraints reaching 97% . Held-out accuracy (HO) also improves, achieving 95% (i.e. yielding an improvement of 18,75% over the baseline). For ”Rings” data set, GTM gives an overall accuracy: Acc = 75%. PrTM achieves (Acc,HO)=(89%,88%) after 500 soft constraints (Fig. 3). We can see by these first results that intergrading soft constraints allows significant improvements to GTM model. Comparing our approach PrTM with another deterministic constrained clustering methods, also shows an significant improvement of accuracy. For ”Tic-TacToe” data set, COP-COBWEB achieves a low accuracy even after incorporating 500 constraints (as reported in [22]). COP-Kmeans achieves (Acc,HO)=(92%, 56%) with the same number of constraints. With PrTM, we obtain (Acc,HO)= (94%,89%) with 700 soft constraints (Fig. 4). The best results are obtained by both CrTM and COP-b-Coloring with (Acc,HO)=(96%,91%) and (Acc,HO)= (95%,82%) respectively (with 500 constraints). For ”Heart” data set, after incorporating 500 random constraints, with COPb-Coloring (Acc,HO)= (89%,66%), PrTM gives (Acc,HO)=(91%,88%) with only 300 soft constraints. The same result is also obtained by CrTM (Fig. 5)1 . 1
This figure shows only the Acc’ results. HO’ results for CrTM and COP-b-coloring can be found in [3] and [12].
422
K. Benabdeslem and J. Snoussi
Fig. 3. Evaluation of PrTM over : ”Glass”, ”Rings”, ”Heart” and ”Tic-Tac-Toe” data sets
In addition, PrTM is also compared with two soft constrained clustering methods (PKM: Probabilistic constrained K-Means and MPKM: metric and probabilistic constrained K-Means). Table. 2 compares the results for each algorithm in terms of its unconstrained and constrained performance, when provided with 25 randomly selected constraints. The evaluation of these algorithms is done on 3 UCI data sets : ”Glass”, ”Ionosphere” and ”Iris”. We have added our PrTM results to those shown in [8]. The best result for each algorithm/data set combination is in bold. In one hand, PrTM shows that integrating soft constraints in GTM model provides a clear improvement to clustering accuracy. In other hand, the results obtained by PrTM are similar and sometimes better than another constrained clustering methods. But in addition, PrTM produces an important smoothness with the representation of each observation in all neurons in the map under background information represented by soft constraints. Table 2. Average performance (Rand Index) of 3 constrained clustering algorithms with 25 randomly selected constrains Data sets
PKM Unconst Const Glass 43.4% 68.8% Ionosphere 58.8% 58.9% Iris 84.3% 88.3%
MPKM Unconst Const 39.5% 67.8% 58.9% 58.9% 88.0% 91.8%
PrTM Unconst Const 80.0% 90.0% 66.0% 82.3% 97.17% 100%
A Probabilistic Approach for Constrained Clustering with Topological Map
423
Fig. 4. Evaluation results over ”Tic-Tac-Toe” data set. PrTM vs CrTM, COP-bColoring and COP-Kmeans.
4.2
Data Visualization
For comparing the GTM’ visualization with the PrTM’ one, we project a selected subset of ”Rings” data set (Fig. 6). This figure represents the posterior mean ’o’ and mode ’*’ of this subset in the latent spaces. We can see in this figure that integrating soft constraints in GTM improves the data visualization. In fact, we can clearly see that the ”ring” formed by PrTM neurons is well projected over the ”ring” formed by data. The one formed by GTM neurons is worse. 4.3
Topological Maps of PrTM
In this section, we present the difference between topological maps obtained by GTM and those obtained by PrTM. Each map represents the affectation of an observation (its posterior probability) in the latent space. For that, two observations (1st and 50th ) are randomly chosen from the ”Glass” data set. These observations have the same label, so a PML constraint is generated between them.
Fig. 5. Evaluation results over ”Heart” data set. PrTM vs CrTM and COP-b-Coloring.
424
K. Benabdeslem and J. Snoussi
Fig. 6. Projection of a subset of ”Rings” Data set in the latent space. In left by GTM and in Right by PrTM with 200 constraints.
GTM produces the maps illustrated in (Fig. 7). Note that these observations are relatively distant in the data space, which explains the difference between the two maps (GT M1 and GT M50 ). After integrating 200 generated soft constraints, two new maps are obtained in (Fig. 7): (P rT M1 , P rT M50 ) which are more similar than those obtained by GTM because of PML constraint imposed between the two observations. We can also remark that their posterior probabilities have increased in some neurons. Another test is done with two other observations (1st and 200th) related by a PCL constraint. These observations are relatively distant in the data space, so GTM produces the higher probabilities in distant neurons in the map. With the incorporation of the soft constraint, PrTM increases these probabilities and provides clear difference between the two observations in the map (Fig. 8).
Fig. 7. Posterior distribution of PM L related observations (1st , 50th ) by GTM (left) and PrTM(Right)
A Probabilistic Approach for Constrained Clustering with Topological Map
425
Fig. 8. Posterior distribution of PCL related observations (1st , 200th ) by GTM (left) and PrTM(Right)
5
Conclusion
In this work, we have presented a new approach for integrating soft constraints in a probabilistic self-organizing based clustering. For that, we have studied the proprieties of GTM model which represents a probabilistic paradigm of SOM model. By adding some modifications to this algorithm, we could adapt it to soft constraints. Different experiments are given and have shown important results comparing our approach to GTM and another constrained clustering methods. Many interesting issues were raised from this work and some ones are currently being developed: (1) how to appropriately weight the information contained in the data and the constraints (2) The number of nodes, K in the latent space, is assumed to be given. Since we are using a probabilistic model, the idea is to use traditional criteria (AIC, BIC, MLD) to estimate K in the presence of constraints and (3) extending this work in an incremental framework.
References 1. Basu, S., Davidson, I., Wagstaff, W.: Constrained clustering: Advances in algorithms, theory and applications. Chapman and Hall/CRC (2008) 2. Basu, S., Bilenko, M., Mooney, R.-J.: A probabilistic framework for semi-supervised clustering. In: Proceeding of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, pp. 59–68 (2004) 3. Bellal, F., Benabdeslem, K., Aussem, A.: SOM based clustering with instance level constrains. In: European Symposium on Artificial Neural Networks, Bruges, Belgium, pp. 313–318 (2008) 4. Bishop, C.M., Svens´en, M., Williams, C.-K.-I.: GTM: the Generative Topographic Mapping. Neural Computation 10(1), 215–234 (1998) 5. Bilenko, M., Basu, S., Mooney, R.-J.: Integrating constraints and metric learning in semi-supervised clustering. In: Proceeding of the twenty first international conference on machine learning, pp. 11–18 (2004)
426
K. Benabdeslem and J. Snoussi
6. Blake, C., Merz, C.: UCI repository of machine learning databases. Technical Report, University of California (1998) 7. Davidson, I., Ravi, S.-S.: The complexity of non-hierarchical clustering with instance and cluster level constraints. Data mining and knowledge discovery 14(25), 61 (2007) 8. Davidson, I., Wagstaff, K., Basu, S.: Measuring Constraint-Set Utility for Partitional Clustering Algorithms. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS, vol. 4213, pp. 115–126. Springer, Heidelberg (2006) 9. Davidson, I., Ravi, S.-S.: Clustering with constraints: feasibility issues and the kmeans amgorithm. In: Proceeding of the 2005 SIAM international conference on data mining, Newport beach, CA, pp. 138–149 (2005) 10. Davidson, I., Ravi, S.-S.: Agglomerative hierarchical clustering with constraints: theorical and empirical results. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS, vol. 3721, pp. 59–70. Springer, Heidelberg (2005) 11. Dempster, A.-P., Laird, N.-M., Rubin, D.-B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal statistical society, B 39(1), 1–38 (1977) 12. Elghazel, H., Benabdelslem, K., Dussauchoy, A.: Constrained graph b-coloring based clustering approach. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2007. LNCS, vol. 4654, pp. 262–271. Springer, Heidelberg (2007) 13. Fisher, D.: Knowledge acquisition via incremental conceptual clustering. Machine learning 2, 139–172 (1987) 14. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988) 15. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1994) 16. Law, M., Topchy, A., Jain, A.-K.: Clustering with Soft and Group Constraints. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 662–670. Springer, Heidelberg (2004) 17. Law, M., Topchy, A., Jain, A.-K.: Model-based Clustering With Probabilistic Constraints. In: Proceedings of SIAM Data Mining, Newport Beach, CA, USA, pp. 641–645 (2005) 18. MacQueen, J.-B.: Some methods for classification and analysis of multivariate observations. In: Proceeding of the fifth symposium on Math, statistics ans probability, Berkley, CA, vol. 1, pp. 281–297 (1967) 19. Rand, W.-M.: Objective criteria for the evaluation of clustering method. Journal of the American Statistical Association 66, 846–850 (1971) 20. Shental, N., Bar-Hillel, A., Hertz, T., Weinshall, D.: Computing Gaussian mixture models with EM using equivalent constraints. In: Advances in Neural information processing systems, vol. 16 (2004) 21. Wagstaff, K., Cardie, C.: Clustering with instance level constraints. In: Proceeding of the seventeenth international conference on machine learning, pp. 1103–1110 (2000) 22. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: Proceedings of eighteenth international conference on machine learning, pp. 577–584 (2001)
Relational Frequent Patterns Mining for Novelty Detection from Data Streams Michelangelo Ceci, Annalisa Appice, Corrado Loglisci, Costantina Caruso, Fabio Fumarola, Carmine Valente, and Donato Malerba Dipartimento di Informatica, Universit` a degli Studi di Bari via Orabona, 4 - 70126 Bari - Italy {ceci, appice, loglisci, caruso, ffumarola, malerba}@di.uniba.it, [email protected]
Abstract. We face the problem of novelty detection from stream data, that is, the identification of new or unknown situations in an ordered sequence of objects which arrive on-line, at consecutive time points. We extend previous solutions by considering the case of objects modeled by multiple database relations. Frequent relational patterns are efficiently extracted at each time point, and a time window is used to filter out novelty patterns. An application of the proposed algorithm to the problem of detecting anomalies in network traffic is described and quantitative and qualitative results obtained by analyzing real stream of data collected from the firewall logs are reported.
1
Introduction
A data stream is an ordered sequence of data elements which arrive on-line, with no control on their order of arrival, such that once an element has been seen or processed, it cannot be easily retrieved or seen again unless it is explicitly stored in the memory [3]. Data streams are common to a variety of applications in the realm of telecommunications, networking, and real-time monitoring. The huge amount of data generated by these applications demands for the development of specific data mining techniques which can effectively and efficiently discover the hidden, useful knowledge embedded within data streams. Several data stream mining algorithms have already been proposed in the literature, mainly for clustering, classification, association analysis and time series analysis [9]. Some works focus on the problem of novelty detection, i.e., identifying new or unknown situations which were never experienced before. In particular, Spinosa et al. [15] propose an incremental learning method to cluster data elements as they arrive, and identify novelties with new clusters formed over time. Ma and Perkins [11] propose to learn a regression function which reflects the normal behavior of a system and define novelties as those data elements which significantly differ from the prediction made by the regression function. Keogh et al. [10] take a different perspective on the problem and propose a method which discovers patterns whose frequency deviates from the expected value. A review of novelty detection methods is reported in [13]. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 427–439, 2009. c Springer-Verlag Berlin Heidelberg 2009
428
M. Ceci et al.
Although all cited works present interesting results, they can only process data elements such that each of them is described by a feature vector. When data elements are complex objects represented by several database relations, these novelty detection algorithms cannot be directly applied, and some kind of data transformation has to be performed which may result in information loss. This observation motivates this work whose main contribution is that of investigating the novelty detection problem in a (multi-)relational setting [8]. In particular, we propose and evaluate a novelty detection method which processes ordered sequences of objects collected at consecutive time points and described by multiple database relations. The method first discovers relational patterns [2] which are frequent at a single time point and then it considers a time window to establish whether the pattern characterizes novelties or not. The proposed algorithm has been evaluated on data extracted from network connection logs. Indeed, malfunctions and malicious connections can be considered as a form of anomaly in network traffic, and their automatic detection is of great help in daily work of network administrators. The direct representation of all packets of a connection demands for a relational representation which expresses properties of both connections and packets, as well as relationships between connections and packets and relationships between packets. This relational representation was actually proposed in a previous work [5] which aimed to detect anomalies by comparing the connections ingoing a network firewall one day with the connections ingoing the same firewall another day (not necessarily consecutive). The comparison is based on relational emerging patterns [2] which capture differences between objects (the connections) belonging to different classes (the days) [6]. The main limitation of previous work is the lack of a temporal dimension in the analysis which prevents the investigation of the evolution of pattern support over time. Therefore, an additional contribution of this paper is an improved method for anomaly detection from network connection logs. The paper is organized as follows. Some definitions relevant for the formalization of the novelty detection problem are introduced in the next section, while a method that solves the problem is described in Section 3. Section 4 introduces the dataset and reports both a quantitative and a qualitative analysis of the results obtained with the proposed method. Lastly, some conclusions are drawn.
2
Problem Definition
In the relational data mining setting, data describing complex objects are scattered over multiple tables of a relational database D. Let S be the schema of D. We assume that S includes the definition of a table TR , named target table, which stores properties (or attributes) of a set R of reference (or target ) objects. These are the main subject of analysis and there is a unit of analysis for each reference object. The support of discovered patterns is computed as the number of reference objects which satisfy the conditions expressed in the pattern. For instance, in the application to novelty detection from network connection
Relational Frequent Patterns Mining for Novelty Detection
429
logs, the reference objects are the connections, since novelty patterns refer to connections. We also assume S includes a number of additional (non-target) tables TTi , such that each TTi stores attributes of a set Ri of task-relevant objects. These contribute to define the units of analysis and are someway related to the reference objects, but they are not the main subject of analysis. In the application to network traffic analysis, packets play the role of task-relevant objects and each unit of analysis includes all packets of a connection. The “structure” of units of analysis, that is, the relationships between reference and task-relevant objects, is expressed in the schema S by foreign key constraints (F K). Foreign keys make it possible to navigate the data schema and retrieve all the task-relevant objects in D which are related to a reference object. Definition 1 (Unit of Analysis). A unit of analysis D(o) consists of the reference object o ∈ TR and all task-relevant objects in D that are related to o according to foreign key constraints. In this work, units of analysis are associated time points. More precisely, if τ is a sequence of consecutive and discrete time points and is a total order relation defined on τ , we associate each unit of analysis D(oi ) with a time point ti ∈ τ . Therefore, the input data is a series of time-stamped units of analysis, DS = {D(o1 ), t1 , D(o2 ), t2 , . . . , D(on ), tn }, where ti ti+1 . It is important to observe that several units of analysis can be associated with the same time point. This allows us to compute the support of a relational pattern at a specific time point. In order to formalize the concept of relational pattern, we define three types of predicates, namely key, structural and property predicates. Definition 2 (Key Predicate). The “key predicate” associated with the target table TR in S is a unary predicate p(t) such that p denotes the table TR and the term t is a variable that represents the primary key of TR . Definition 3 (Property Predicate). A property predicate is a binary predicate p(t, s) associated with the attribute AT T of the table Ti . The name p denotes the attribute AT T , the term t is a variable representing the primary key of Ti and s is a constant which represents a value belonging to the range of AT T in Ti . Definition 4 (Structural Predicate). A structural predicate is a binary predicate p(t, s) associated with a pair of tables Tj and Ti , with Tj and Ti related by a foreign key F K in S. The name p denotes F K, while the term t (s) is a variable that represents the primary key of Tj (Ti ). A relational pattern is defined as follows: Definition 5 (Relational Pattern). A relational pattern P over the schema S is a conjunction of predicates:
430
M. Ceci et al.
p0 (t10 ), p1 (t11 , t21 ), p2 (t12 , t22 ), . . . , pm (t1m , t2m ) where p0 (t10 ) is the key predicate associated with the table TR and pi (t1i , t2i ), i = 1, . . . , m, is either a structural predicate or a property predicate over S. In this work we also use the set notation of relational patterns, i.e., the conjunction p0 (t10 ), p1 (t11 , t21 ), p2 (t12 , t22 ), . . . , pm (t1m , t2m ) is represented as the set {p0 (t10 ), p1 (t11 , t21 ), p2 (t12 , t22 ), . . . , pm (t1m , t2m )}. The two representations are slightly different (neither sequential ordering nor multiple occurrences of atoms are relevant in the set notation), but in this work these differences are not meaningful. The support of a relational pattern P can be computed at a specific time point t as follows: |{D(o)|D(o), t ∈ DS, ∃θ : P θ ⊆ D(o)}| , (1) |{D(o)|D(o), t ∈ DS}| where θ is a substitution of variables into constants and P θ denotes the application of the substitution θ to the pattern P . Therefore, we define a relational pattern P as frequent with respect to a minimum support threshold minSupp if a time point t ∈ τ exists, such that suppt (Pi ) ≥ minSupp. The notion of frequent relational pattern allows us to define a novelty pattern. suppt (P ) =
Definition 6 (Novelty Pattern). Let – W (i, w) = ti , ti+1 , . . . , ti+w be a time window, i.e., a subsequence of w consecutive time points in τ (i + w ≤ |τ |); – P be a relational pattern that is frequent in at least one time point ti in τ according to a user-defined threshold minSupp, i.e. ∃i ∈ τ, suppti (P ) ≥ minSupp; – ΘP : [0, 1] → Ψ be a discretization function which associates a support value of P in the interval [0, 1] with a discrete values ψ ∈ Ψ . Then, P is a novelty pattern for the time window W (i, w) if and only if: Θ(suppti (P )) = . . . = Θ(suppti+w−1 (P )) = Θ(suppti+w (P )).
(2)
Intuitively, a pattern P characterizes novelty in a time window W (i, w) if it has approximately the same support for all time points in W (i, w), except for the last one. Therefore, novelty detection depends on two user-defined parameters: the minimum support (minSupp) and the size (w) of the time window. The novelty detection problem can be formalized as follows: Given: – a sequence of consecutive and discrete time points τ ; – a series of time-stamped units of analysis DS = {D(o1 ), t1 , D(o2 ), t2 , . . . , D(on ), tn }, ti ∈ τ , 1 = 1, 2, . . . , n, derived from a database D with a target table TR and m non-target tables TTi ; – a minimum support threshold minSupp; – a time window size w; Find the sets N PW (i,w) of novelty patterns associated with the time windows W (i, w), i = 1, 2, |τ | − w. An algorithmic solution to this problem is presented in the next section.
Relational Frequent Patterns Mining for Novelty Detection
3
431
Novelty Pattern Discovery
The proposed solution consists of two phases. In the first phase, relational patterns are mined, while in the second phase they are filtered out in order to keep only those which represent a novelty according to Definition 6. The relational pattern discovery is performed by exploring level-by-level the lattice of relational patterns ordered according to a generality relation () between patterns. Formally, given two patterns P1 and P2 , P1 P2 denotes that P 1 (P2 ) is more general (specific) than P2 (P1 ). Hence, the search proceeds from the most general pattern and iteratively alternates the candidate generation and candidate evaluation phases as in the levelwise method [12]. Candidate novelty patterns are searched in the space of linked relational patterns, which is structured according to the θ-subsumption generality order [14]. Definition 7 (Key Linked Predicate). Let P = p0 (t10 ), p1 (t11 , t21 ), . . . , pm (t1m , t2m ) be a relational pattern over the database schema S. For each i = 1, . . . , m, the (structural or property) predicate pi (t1i , t2i ) is key linked in P if – pi (t1i , t2i ) is a predicate with t10 = t1i or t10 = t2i , or – there exists a structural predicate pj (t1j , t2j ) in P such that pj (t1j , t2j ) is key linked in P and t1i = t1j ∨ t2i = t1j ∨ t1i = t2j ∨ t2i = t2j . Definition 8 (Linked Relational Pattern). Let S be a database schema. Then P = p0 (t10 ), p1 (t11 , t21 ), . . . , pm (t1m , t2m ) is a linked relational pattern if ∀i = 1 . . . m, pi (t1i , t2i ) is a predicate which is key linked in P and two structural predicates do not insist on the same foreign key. Definition 9 (θ-subsumption). Let P1 and P2 be two linked relational patterns on a data schema S. P1 θ-subsumes P2 if and only if a substitution θ exists such that P2 θ ⊆ P1 . Having introduced θ-subsumption, generality order between linked relational patterns can be formally defined. Definition 10 (Generality Order Under θ-subsumption). Let P1 and P2 be two linked relational patterns. P1 is more general than P2 under θ-subsumption, denoted as P1 θ P2 , if and only if P2 θ-subsumes P1 . Example 1. Let us consider the linked relational patterns: P1 : connection(C). P2 : connection(C),packet(C,P). P3 : connection(C),service(C,’http’). P4 : connection(C),packet(C,P), starting time(P,8). P5 : connection(C), packet(C,P), next(I,P,Q). P6 : connection(C), packet(C,P), next(I,P,Q), distance(I,35). Then it can be proved that the patterns are ordered as follows: P1 θ P2 , P1 θ P3 , P1 θ P4 , P1 θ P5 , P1 θ P6 , P2 θ P4 , P2 θ P5 , P2 θ P6 , P5 θ P6 .
432
M. Ceci et al.
θ-subsumption defines a quasi-ordering, since it satisfies the reflexivity and transitivity property but not the anti-symmetric property. The quasi-ordered set of patterns in example 1 is structured as follows: P4
P2
P1
P3
P5 ↓ P6
It can be searched according to a downward refinement operator which computes the set of refinements for a completely linked relational pattern. Definition 11 (Refinement Operator Under θ-subsumption). Let G, θ be the space of linked relational patterns ordered according to θ . A (downward) refinement operator under θ-subsumption is a function ρ : G → G such that ρ(P ) ⊆ {Q ∈ G|P θ Q}. In particular, the downward refinement operator ρ used in this work is defined as follows. Definition 12 (Downward Refinement Operator). Let P be a linked relational pattern. Then ρ (P ) = {P ∪ {p(t1 , t2 )}|p(t1 , t2 ) is a structural or property predicate key linked in P ∪ {p(t1 , t2 )}}. We observe that in order to return a set of linked relational patterns, the predicate p(t1 , t2 ) added to a pattern P by ρ should not insist on the same foreign key of another structural predicate in P . It can be proved that ρ is a refinement operator under θ-subsumption, i.e., P θ Q for all Q ∈ ρ (P ). The refinement operator ρ allows for a levelwise exploration of the quasiordered set of linked relational patterns. Indeed, the implemented algorithm starts from a set ℘ containing only the most general pattern, i.e. the pattern that contains only the key predicate, and then updates ℘ by repeatedly applying ρ to all patterns in ℘. For each candidate pattern P , the support suppti (P ) is computed at each discrete time point ti . In generating each level of the quasi-ordered set, the candidate pattern search space is represented as a set of enumeration trees (SE-trees)[17]. The idea is to impose an ordering on atoms such that all patterns in the search space are enumerated. Practically, a node g of a SE-tree is represented as a group comprising: the head (h(g)), i.e. the pattern enumerated at g, and the tail (t(g)) that is the ordered set consisting of all atoms which can be potentially appended to g by ρ in order to form a pattern enumerated by some sub-node of g. A child gc of g is formed by taking an atom q ∈ t(g) and appending it to h(g). Therefore, t(gc ) contains all atoms in t(g) that follows q (see Figure 1). In the case q is a structural predicate (i.e., a new relation is introduced in the pattern), t(gc ) contains both
Relational Frequent Patterns Mining for Novelty Detection
433
Fig. 1. The enumeration tree over the atoms A = {a, b, c} to search the atomsets a, b, c, ab, ac, bc, abc
atoms in t(g) that follows q and new atoms directly linkable to q according to ρ not yet included in t(g). Given this child expansion policy, without any pruning of nodes or pattern, the SE-tree enumerates all possible patterns and prevents the generation and evaluation of candidate equivalent under θ-subsumption to some other candidate. As pruning criterion, the monotonicity property of the generality order θ with respect to the support value (i.e., a superset of an infrequent pattern cannot be frequent) [1] can be exploited to avoid generation of infrequent relational patterns. Let P be a refinement of a pattern P . If P is an infrequent pattern (∀ti ∈ τ, suppti (P ) < minsup), then P has a support that is always lower than the user-defined threshold (minsup) for each ti ∈ τ . According to the definition of novelty pattern, P cannot be “novel”. This means that it is possible to avoid the refinement of patterns which are infrequent. An additional pruning criterion stops the search when a maximum number of literals (M axN umLiterals) have been added to a novelty pattern, where M axN umLiterals is a user-defined parameter. Once patterns are estracted, they are further processed in order to identify novelty patterns according to Definition 6. In this work, function ΘP is the classical equal-width discretization function [7].
4
Experiments
The method to discover (relational) novelty patterns has been applied to anomaly detection on the network connection logs which are recorded on consecutive days (each day represents a discrete time point). In this context a unit of analysis is described in terms of accepted ingoing connections (reference objects), packets (task-relevant objects) and relations “connections/packets” and “packets/packets”. The reason for considering only ingoing connections is that we are ultimately interested in discovering possible attacks to network services, which are assumed to come from outside. In the experiments reported in this section parameters are set as follows: Ψ includes only five values (i.e., ΘP discretizes the support into five bins), minsup = 0.1 and M axN umLiterals = 5.
434
4.1
M. Ceci et al.
Dataset Description
Experiments concern 28 successive days of firewall logs of our University Department, from June 1st to June 28th, 2004 [4]. Each log is mapped into a relational database (Oracle 10g). A connection is described by: – – – – – – – – – –
the identifier (integer); the protocol (nominal) which has only two values (udp and tcp); the starting time (integer), that is, the starting time of the connection; the destination (nominal), that is, the IP of department public servers; the service (nominal), that is, the requested service (http, ftp, smtp and many other ports); the number of packets (integer), that is, the number of packets transferred within the connection; the average packet time distance (integer), that is, the average distance between packets within the connection; the length (integer), that is, the time length of the connection; the nation code (nominal), that is, the nation the source IP belongs to; the nation time zone (integer), that is, time zone description of the source IP. The source IP is represented by four groups of tree digits and each group is stored in a separate attribute (nominal).
Each packet is described by the identifier (integer) and the starting time (number) of the packet within the connection. The interaction between consecutive packets is described by the time distance. Numeric attributes are discretized through an unsupervised equal-width discretization that partitions the range of values into a fixed number (i.e., 10) of bins. The relation “connections/packets” indicates that one packet belongs to a connection, while the relation “packets/packets” represents the temporal distance between two packets within the same connection. The considered database collects 380,733 distinct connections, 651,037 packets and 270,304 relations “packets/packets” and 651,037 relations “connections/ packets”. 4.2
Analysis of Results
Quantitative results are reported in Table 1, where the number of novelty patters for different time windows is shown. As expected, the number of discovered patterns decreases by increasing the window size (w = 3, . . . , 6), since the patterns found in a time window also belong to the set of patterns extracted for smaller time windows. Interestingly, the number of patterns extracted for each time windows is rather large. This is due to the high number of similar extracted patterns. In fact, in most of cases, the system extracts the patterns that are related each other according to the θ-subsumption generality order (one is the specialization of the other). However, the number of discovered novelty patters significantly decreases for w = 6, where the average number of patterns extracted
Relational Frequent Patterns Mining for Novelty Detection
435
Table 1. Number of discovered relational novelty patterns. Results are obtained with different W (i, w); i = 1, . . . , 28, while w = 3, . . . , 6 Time-Points
w=3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
52 333 108 38 472 7 145 147 84 315 202 164 148 99 56 481 200 369 381 310 107 114 447 79 142 224
w=4 w=5 w=6
26 78 5 281 3 2 114 20 226 134 22 81 10 26 371 198 357 49 234 63 32 351 27 34 142
12 5 13 3 0 59 36 170 110 13 31 1 24 234 198 352 45 100 63 12 39 25 30 34
5 4 1 0 55 4 160 108 13 21 0 24 144 157 352 40 96 59 12 29 19 30 30
Total No of Novelty Patterns 5224 2886 1609 1363 Average No of Novelty Patterns 200.92 115.44 67.04 59.26
for each time point is less than 60. This makes it possible to manually analyze patterns. A more interesting analysis can be performed by considering a graphical representation of the same results (see Figure 2), where it is possible to notice the smoothing of peaks in the number of novelty patterns/time point histogram by increasing the window size. In particular, while for w = 3 the cardinality of N PW (i,w) presents a high variance over the different time points, this is somehow mitigated by increasing values of w. This would help the user to identify and analyze critical days, when attacks may have occurred. Figure 2 shows that there are several critical time points (days) when w = 3 and less when w = 6. In particular, days where the number of extracted novelty patters is greater than 200 are:
436
– – – –
M. Ceci et al.
4, 7, 12, 13, 18, 19, 20, 21, 22, 25, 28 when w = 3, 7, 12, 18, 20, 22, 25 when w = 4, 18, 20 when w = 5 and 20 when w = 6.
According to a manual analysis performed by the network administrator, it results that on June 20th 2004 (Sunday) there were attacks which masked the requested service (or port). In particular, there were 1455 connections (the double of the http connections) characterized by “unknown” service. In contrast, there was no connection with “unknown” service in the previous day. A qualitative evaluation confirms this analysis. In fact, the following novelty pattern is extracted by the algorithm: P1 : connection(C), packet(C, P ), service(C,“unknown”). since its support on June 20th is in the interval [0.428; 0.535] while in the previous days its support is in the interval [0.0; 0.107] (this is a novelty pattern for W (20, 3), W (20, 4), W (20, 5), W (20, 6)). P1 states that a connection C with at least one packet P and with unknown service could be considered as an anomaly. Another example of extracted novelty pattern is the following: P2 : connection(C), packet(C, P ), destination(C,“XXX.XXX.XXX.127”). P2 is characterized by a support value of 0.119 on the June 18th 2004, while its support is in the interval [5.89 · 10−4 ; 0.024] in the previous days (this is a novelty pattern for W (18, 6) and, thus, for W (18, 3), W (18, 4), W (18, 5)). P2 states that a connection C with at least one packet P and with destination IP address “XXX.XXX.XXX.127”1 could be considered as an anomaly. The following pattern is obtained by specializing P 2: P3 : connection(C), packet(C, P ), destination(C,“XXX.XXX.XXX.127”), nationcode(C,“IT ”). P3 is characterized by a support value of 0.115 on the June 18th 2004, while its support is in the interval [2.48 · 10−5 ; 0.023] in the previous days (this is a novelty pattern for W (18, 6)). An example of pattern which takes into account the relational nature of data is the following: P4 : connection(C), packet(C, P ), packet time(P,“[34559; 43199]”), packet to packet(P, Q). P4 is characterized by a support value of 0.091 on the June 20th 2004, while its support is in the interval [0.003; 0.066] in the previous days (this is a novelty pattern for W (20, 6)). This pattern states that a connection C with at least two 1
The complete IP address is not specified for privacy reasons.
Relational Frequent Patterns Mining for Novelty Detection
437
Fig. 2. Distribution of discovered relational novelty patterns. Results are obtained with different W (i, w); i = 1, . . . , 28 w = 3, . . . , 6.
438
M. Ceci et al.
packets P and Q, where P is sent after a relatively high time with respect to the start of the connection (between 34,559 and 43,199 ms), could be considered as an anomaly.
5
Conclusions
In this paper, we face the problem of discovering novelties from data streams and we propose an algorithm whose peculiarity is that it works on data represented in the form of complex objetcs possibly stored in several tables of a relational database. The algorithm uses a time window in order to establish whether the pattern expresses a novelty or not. Discovered novelty patterns are expressed in a first-order logic formalism. The algorithm is applied to real network traffic data in order to solve a problem of anomaly detection and then support the control activity of a network administrator. Both quantitative (i.e. number of extracted novelty patterns) and qualitative (i.e., novelty patterns themselves) results proved the effectiveness of the proposed approach in detecting possible malicious attacks. By increasing the size of the time window, the number of discovered novelty patterns decreases and, thus, it is possible to simplify the manual analysis of extracted patterns by the expert (network administrator). As future work, we intend to cluster similar patterns according to syntactic or semantic distance measures [16] in order to further simplify the analysis of extracted novelty patterns by the expert, who can focus his/her attention only on few groups. Moreover, we plan to develop an incremental novelty pattern discovery algorithm in order to face scalability issues.
Acknowledgments This work is supported by the Strategic Project PS121: “Telecommunication Facilities and Wireless Sensor Networks in Emergency Management”.
References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) International Conference on Management of Data, pp. 207–216 (1993) 2. Appice, A., Ceci, M., Malgieri, C., Malerba, D.: Discovering relational emerging patterns. In: Basili, R., Pazienza, M.T. (eds.) AI*IA 2007. LNCS (LNAI), vol. 4733, pp. 206–217. Springer, Heidelberg (2007) 3. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS 2002: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16. ACM, New York (2002) 4. Caruso, C., Malerba, D., Papagni, D.: Learning the daily model of network traffic. In: Hacid, M.-S., Murray, N.V., Ras, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS, vol. 3488, pp. 131–141. Springer, Heidelberg (2005)
Relational Frequent Patterns Mining for Novelty Detection
439
5. Ceci, M., Appice, A., Caruso, C., Malerba, D.: Discovering emerging patterns for anomaly detection in network connection data. In: An, A., Matwin, S., Ras, Z.W., Slezak, D. (eds.) ISMIS 2008. LNCS, vol. 4994, pp. 179–188. Springer, Heidelberg (2008) 6. Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: International Conference on Knowledge Discovery and Data Mining, pp. 43–52. ACM Press, New York (1999) 7. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning, pp. 194–202 (1995) 8. Dˇzeroski, S., Lavraˇc, N.: Relational Data Mining. Springer, Heidelberg (2001) 9. Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005) 10. Keogh, E., Lonardi, S., Chiu, B.Y.-C.: Finding surprising patterns in a time series database in linear time and space. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 550–556. ACM, New York (2002) 11. Ma, J., Perkins, S.: Online novelty detection on temporal sequences. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 613–618. ACM, New York (2003) 12. Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1(3), 241–258 (1997) 13. Markou, M., Singh, S.: Novelty detection: a review—part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003) 14. Plotkin, G.D.: A note on inductive generalization. Machine Intelligence 5, 153–163 (1970) 15. Spinosa, E.J., de Carvalho, A.P.d.L.F., Gama, J.: Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In: SAC 2008: Proceedings of the 2008 ACM symposium on Applied computing, pp. 976–980. ACM, New York (2008) 16. Tsumoto, S., Hirano, S.: Visualization of similarities and dissimilarities in rules using multidimensional scaling. In: Hacid, M.-S., Murray, N.V., Ra´s, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS, vol. 3488, pp. 38–46. Springer, Heidelberg (2005) 17. Zhang, X., Dong, G., Kotagiri, R.: Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets. In: Knowledge Discovery and Data Mining, pp. 310–314 (2000)
A Comparative Study of Outlier Detection Algorithms Charlie Isaksson and Margaret H. Dunham Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas, USA {charliei, mhd}@lyle.smu.edu
Abstract. Data Mining is the process of extracting interesting information from large sets of data. Outliers are defined as events that occur very infrequently. Detecting outliers before they escalate with potentially catastrophic consequences is very important for various real life applications such as in the field of fraud detection, network robustness analysis, and intrusion detection. This paper presents a comprehensive analysis of three outlier detection methods Extensible Markov Model (EMM), Local Outlier Factor (LOF) and LCS-Mine, where algorithm analysis shows the time complexity analysis and outlier detection accuracy. The experiments conducted with Ozone level Detection, IR video trajectories, and 1999 and 2000 DARPA DDoS datasets demonstrate that EMM outperforms both LOF and LSC-Mine in both time and outlier detection accuracy. Keywords: anomaly, detection, data mining, outlier.
1 Introduction Recently outlier detection has gained an enormous amount of attention and become one of the most important problems in many industrial and financial applications. Supervised and unsupervised learning techniques are the two fundamental approaches to the problem of outlier detection. Supervised learning approaches build models of normal data and detect deviations from the normal model in observed data. The advantage of these types of outlier detection algorithms is that they can detect new types of activity as deviations from normal usage. In contrast, unsupervised outlier detection techniques identify outliers without using any prior knowledge of the data. It is essential for outlier detection techniques to detect sudden or unexpected changes in existing behavior as soon as possible. Assume for example the following three scenarios: 1.
A network alarm is raised indicating a possible attack. The associated network traffic is abnormal from the normal Network traffic. The security analyst discovers that the enormous traffic is not produced from the Internet, but from its Local Area Network (LAN). This scenario is characterized as zombie effect in a Distributed Denial of Services (DDoS) attack [1], where the LAN is utilized in the DDoS attack to deny the services for a targeted Network. It also means that the LAN has been compromised long before the discovery of DDoS attack.
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 440–453, 2009. © Springer-Verlag Berlin Heidelberg 2009
A Comparative Study of Outlier Detection Algorithms
2.
3.
441
Computer systems in a LAN provide services that correspond to certain types of behavior, if a new service is started without system administrator permission, and then it is extremely important to set an alarm and discover suspicious activities as soon as possible in order to avoid disaster. Video surveillance [2] is frequently encountered in commercial, residential or military buildings. Finding outliers in the video data involves mining massive surveillance video databases automatically collected to retrieve the shots containing independently moving targets. The environment where it operates is often very noisy. Today it is not news that the ozone layer is getting thinner and thinner [29]. This is harmful to human health, and affects other important parts of our daily life, such as farming, tourism etc. Therefore an accurate ozone alert forecasting system would facilitate issuance of warnings to the public at an early stage before the ozone reaches a dangerous level.
One recent approach to outlier detection, Local Outlier Factor (LOF) [5], is based on the density of data close to an object. This algorithm has proven to perform well, but suffers from some performance issues. In this paper we compare the performance of LOF and one of its extensions, LSC-Mine [6], to the use of our previously proposed modeling tool Extensible Markov Model (EMM) [3]. This comparative study provides a study of these three outlier algorithms and denotes their time and detection performance. Extensible Markov Model (EMM) is a spatiotemporal modeling technique that interleaves a clustering algorithm with a first order Markov Chain (MC) [4], where at any point in time EMM can provide a high level summary of the data stream. Local Outlier Factor (LOF) [5] is an unsupervised density-based algorithm that assigns to each object a degree to be an outlier. It is local in that, the degree depends on how isolated the object is with respect to the surrounding neighborhood. LSC-Mine [6] was constructed to overcome the disadvantages of the LOF technique proposed earlier. In the next section an overview of outlier detection techniques. Section 3 examines results of experiments performed. We conclude the paper in Section 4.
2 Outlier Detection Techniques 2.1 Related Work Different approaches and methodologies has been introduced to address the outlier/anomaly detection problem; they vary between statistical approaches, supervised and unsupervised learning techniques, neural networks and machine learning techniques. We can not provide a complete survey here, but refer the interested reader to available surveys [34], [35], [36]. We briefly mention some representative techniques. Grubbs’ method (extreme studentized deviate) [7] introduced a one dimensional statistical method in which all parameters are derived from the data, it requires no users’ parameters. It calculates the mean and standard deviation from all attribute values, and then it calculates Z value as the difference between the mean value for the attribute and the query value divided by the standard deviation for the attribute, then the Z value for the query is compared with a 1% or
442
C. Isaksson and M.H. Dunham
5% significance level. An optimized k-NN was introduced by Ramaswamy et al. [8]; it gives a list of potential outliers and their ranking. In this approach the entire distance matrix needed to be calculated for all the points, but the authors introduced a partitioning technique to speed the k-NN algorithm up. Another outlier/anomaly detection set of approaches are the Neural Networks ones. They are non-parametric and model based in general. They require both training and testing to determine the threshold and be able to classify the data. Most of them also suffer when the data has high dimensionality. Nairac et al. [9] and Bishop [10] identify novelties in time-series data for fault diagnosis in vibration signatures of aircraft engines and Bishop monitors processes such as oil pipeline flows. They both use a supervised neural network (multi-layer perception), which is a feed forward network with a single hidden layer, where hidden layer is generally used to make a bottleneck, forcing the network to make a simple model of the system generating the data with the ability to generalise to previously unseen patterns, which according to the authors provides improvement in the performance over the linear technique. Japkowicz et al. [11] use an autoassociative neural network which is also a feedforward perceptron-based network which uses supervised learning. Caudell and Newman [12] introduced a detection technique for time series monitoring based on the Adaptive Resonance Theory (ART) (Carpenter and Grossberg) [13] incremental unsupervised neural network. An approach that works well with high dimensional data is using decision trees as in John [14] and Skalak and Rissland [15] where they use a C4.5 decision tree to detect outliers in categorical data to identify unexpected entries in databases. They pre-select cases using the taxonomy from a case-based retrieval algorithm to prune outliers, and then use these cases to train the decision tree. Lane and Brodley [16], [17] introduced an approach that uses similarity-based matching for monitoring activities. 2.2 Extensible Markov Model Extensible Markov Model (EMM) [3] takes the advantage of distance-based clustering for spatial data as well as that of the Markov chain for temporality. And as proved in our previous work [33] EMM achieves an efficient modeling by mapping groups of closely located real world events to states of a Markov chain. EMM is thus an extension to the Markov chain. EMM uses clustering to obtain representative granules in the continuous data space. Also by providing a dynamically adjustable structure, EMM is applicable to data stream processing when the number of states is unknown in advance and provides a heuristic modeling method for data that hold approximation of the Markov property. The nodes in the graph are clusters of real world states where each real world state is a vector of sensor values. The EMM defines a set of formalized procedures such that at any time t, EMM consists of a Markov Chain (MC) and algorithms to modify it, where algorithms include: 1.
2.
EMMCluster defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. This is a clustering algorithm which determines if the new data point or event should be added to an existing cluster (MC state) or whether a new cluster (MC state) should be created. A distance threshold th is used in clustering. EMMIncrement algorithm updates (as well as adds, deletes, and merges) MC at time t + 1 given the MC at time t and output of EMMCluster at time t + 1.
A Comparative Study of Outlier Detection Algorithms
3.
443
EMMapplications are algorithms which use the EMM to solve various problems. To date we have examined EMM for prediction (EMMPredict) [3] and anomaly (rare,outlier event) detection (EMMRare) [18].
Throughout this paper, EMM is viewed as directed graph with nodes and links. Link and transition are used interchangeably to refer to a directed arc; and use node, state, and cluster interchangeably to specifically refer to a vertex in the EMM. The first two phases are used to model the data. The third phase is used to perform applications based on the synopsis created in the modeling process. The synopsis includes information of cluster features [19] and transitions between states. The cluster feature defined in [19] includes at least a count of occurrence, CNi (count on the node) and either a medoid or centroid for that cluster, LSi. To summarize, elements of the synopsis of an EMM are listed in Table 1. The frequency based anomaly detection [18] is used to compare with LOF, and LSC-Mine algorithms, that is one of the several applications of EMM. The idea for outlier detection comes from the fact that the learning aspect of EMM dynamically creates a Markov chain and captures past behavior stored in the synopsis. No input into the model identifies normal or abnormal behavior – instead this is learned based on the statistics of occurrence of transitions and states within the generated Markov chain. By learning what is normal, the model can predict what is not. The basic idea is to define a set of rules related to cardinalities of clusters and transitions to judge outlier. An outlier is detected if an input event (or a data point), Et, is determined not to belong to any existing cluster (state in EMM), if the cardinality of the associated cluster (CNn) is small, or if the transition (CLij) from the current state, i, to the new state, j, is small. When any of the predefined rules are met, a Boolean alarm is set to indicate capture of outlier. 2. 3 Density Based Local Outliers (LOF Approach) Local Outlier Factor (LOF) [5] identifies outliers in a multidimensional dataset. This method does not only identify if a point is an outlier or not, it also determines the degree of outlier-ness. The idea of defining outliers locally means that only a certain neighborhood surrounding the object is taken into consideration. Identifying local Table 1. Notations of EMM Elements Ni
The ith EMM node, labeled by CNi and LSi
CNi
Count of occurrences of data points found in the cluster (EMM node or EMM state) Ni
LSi
A vector representing the representative data point of the cluster, usually being centroid or medoid of the cluster
Lij
The directed link from Ni to Nj, labeled by CLij
CLij
Count of occurrences of the directed link from Ni to Nj
m
Number of EMM states
n
Number of attributes in the vector representing a data points, or dimensions of the data space
444
C. Isaksson and M.H. Dunham
Fig. 1. Advantages of the LOF approach. Modified from [5]
outliers is important since the dataset can be hosting data of versatile characteristics that may contain complex data structures for which global outlier’s identification is not sufficient. Therefore it is more meaningful in these cases to decide on the possibility of an object being an outlier on a local scope. We illustrate the advantages of the LOF approach by observing the clusters C1 and C2 in Figure 1 [5] we notice that cluster C1 contain much larger number of examples q then in cluster C2; however the density of cluster C2 is significantly higher than C1, due to the close distance between each example q and its nearest neighbor. Using simple nearest neighbor approaches P1 may be detected as an outlier however P2 may not be. LOF is able to capture both outliers (P1 and P2) due to the fact that it considers the density around the points. The LOF Algorithm is shown below [5]. LOF Algorithm Input: Data objects, MinPts (k) Outputs: Ranked list of n objects with highest LOF (1) Determine the k-distance of P. In this step, for every point P in the dataset: calculate the distance between point P and every other point in the dataset. Then those distances are sorted. The next step is to find the least MinPts (k) (minimum number of objects within the neighborhood of point P) distinct distances. Then finally, selecting the maximum value of those least MinPts (k) distinct distances. (2) Finding the k-distance neighborhood of P. The main purpose of this step is to find the nearest neighbors of each object. This is done by selecting the points whose distances from point P are less than or equal to the k-distance of P. (3) Computing Reachability Distance of P. This step ensures that objects in the neighborhood are of the same type. This is done for a point P after identifying the neighborhood points. Then for every point in the neighborhood the distance between P and that neighbor point Pj is compared with the k-distance of Pj and the maximum of that two is selected to be the reachability distance of P, Pj denoted as reachdistk(P, Pj). (4) Computing the local reachability density of P. This is computed by getting the reciprocal of the average reachability distance from the k-nearest neighbors of P.
A Comparative Study of Outlier Detection Algorithms
445
(5) Finding the local outlier factor of P. It is the ratio that determines outlier-ness of an object with respect to P. The smaller the local reachability distance of P compared to its neighbors, the higher the P's local outlier factor will be. End //LOF 2. 4 Density Based Local Outliers (LSC-Mine Approach) LSC-Mine [6] was constructed to overcome the disadvantages of the LOF technique [5], by avoiding the computation of reachability distances and local reachability densities. The authors introduced three new formulas that are used to determine outliers, which are: Local Sparsity Ratio: This is the ratio of cardinality of the k-distance neighborhood of P to the sum of all the actual distances in that neighborhood. Pruning Factor: This is the ratio of the sum of the absolute neighborhood distances to the overall sum of the actual neighborhood distances. Local Sparsity Coefficient: This is the average ratio of the local sparsity ratio of P to that of its k-nearest neighbors. The local Sparsity Ratio and the Pruning Factor are computed for each point. According to the Pruning Factor a set is constructed that contains the points with a high probability of being outliers. The Local Sparsity Coefficient is then computed only for this set. The objects with highest Local Sparsity Coefficients are ranked as outliers. A high Local Sparsity Coefficient implies that the neighborhood around the point is not dense, thus there is a high potential that it is an outlier. A low value of Local Sparsity Coefficient indicates that neighborhood around this point is dense, and consequently a low potential of this point being an outlier. The LSC-Mine Algorithm is shown below [6]. LSC-Mine Algorithm Input: Data objects, integer k Outputs: Ranked list of n objects with highest LSC Other variables: k-distance neighborhood of each object, local sparsity ratio of each object. Candidate set, pruning factor (1) Determine the k-distance of each P (2) Find k-distance neighborhood of each P (3) Determine local sparsity ratio of each P (4) Calculate the pruning factor of each P (5) Obtain the candidate set (6) Compute LSC using the candidate set (7) Rank outliers as those with the highest local sparsity coefficients end // LSC-Mine
3 Experimental Results The objective of our performance experiments is to compare both the time and the accuracy of the three algorithms. Our experiments were performed on several
446
C. Isaksson and M.H. Dunham
synthetic and real life datasets. In all our experiments, we have assumed that we have information about the outliers in the dataset thus facilitating the accuracy calculations. EMM, LOF and LSC-Mine are implemented in JAVA. We used R [20] with the package dprep [21] which includes the LOF implementation to verify the correctness of our LOF implementation. From the package dprep we also utilized the KNN imputation [22] for handling missing values. We used Weka [23] to perform dimensionality reduction. The experiments where performed on an IBM Lenovo ThinkPad T61 with intel Centrino Duo 2.4 GHz, the OS is Linux openSuse 11.0 (x86_64). In the following subsections we evaluate both time and outliers’ detection accuracy. 3.1 Time Analysis Our time analysis was performed on synthetic datasets generated from a synthetic dataset generator with standard Gaussian distribution. This allowed us to better control the number of data points and the dimensions of the data. As we increment the input data size and the dimensions we study the behavior of the algorithms. We selected the following data size and dimensions: data-size Data [100, 200… 5000] and the data-dimensions Dim [4, 5, and 10]. And for LOF and LSC-Mine we used the typical values for MinPts [10, 20, and 24] and for EMM we selected Threshold [0.8, 0.99]. We note that EMM threshold of 1 is the greatest threshold distance using the Euclidean distance. 3.1.1 LOF We analyze LOF time complexity by independently studying the two main steps required to compute the LOF factor for each instance of the dataset. The first step is to find the k-distance neighborhood which has a runtime complexity of O (n*time for a K-NN query). This step can be parallelized[25] that would provide an upper bound of Nvt1/O+ (kub – klb)((N(N/P *v + 2Nk)tcomp + Nktcomm)), where P is the number of processes , N is the number of instances, v is the number of variables, and klb and kub are the lower and upper bounds. If a sequential scan is used this leads to O (n2) so this step depends on the implementation of the algorithm. Secondly, finding the maximum outlier factors of all observations in the dataset, this can be done in linear time. Figures 2 show the experimental elapsed running time for LOF, where the results were the same regardless of dimensions, MinPts or threshold. We see that it grows exponentially as the data size and dimensions increases. However, the number of neighbors does not affect the running time. 3.1.2 LSC-Mine LSC-Mine also needs to find the k-distance neighborhood which has a runtime complexity of O (n*time for a K-NN query). However LSC-Mine keeps a pruning factor that can eliminate almost half of the non-possible outlier points. So, for finding the maximum local sparsity coefficient of all observations in the dataset, this can be done in linear time. Thus the complexity of LSC-Mine is still O (n2). Figures 2 show the exponential growth rate of the run time. We note that as the data size and dimensions increases the time increases exponentially.
A Comparative Study of Outlier Detection Algorithms
447
Time(ms )
Run Time for Data Dimens ion 10
60000 50000 40000 30000 20000 10000 0
LOF MinPts k 20 LSC-Mine MinPts k 20 EMM Thres hold 0.99
0
2000
4000
6000
Nu m b e r o f d a ta p o i n ts
Fig. 2. Run time for LOF, LSC-Mine, and EMM with MinPts=20 and EMM Threshold=0.99
3.1.3 EMM Unlike LOF and LSC-Mine, EMM is a linear algorithm. The two main steps required to build EMM can be done in O (nx) where, x is the number of clusters. While x has a sublinear growth rate it can be large. We would expect the behavior of EMM, then to be much better than either LOF or LSC-Mine as the size of the dataset grows. This can be seen in Figures 2. Our experiment validated this regardless of algorithm parameters and dataset. 3.2 Experiments on Real Life Data and Synthetic Datasets We have applied our experiments to three datasets (two real life and one synthetic) to study the accuracy of LOF, LSC-Mine, and EMM. These datasets are Ozone level, IR video, 1999 and 2000 DARPA DDoS. We examine two different metrics for performance evaluation: Detection Rate (can also be referred to as true positive or recall or hit rate in the literature), and False Alarm Rate (or false positive rate) [18]. Detection Rate refers to the ratio between the numbers of correctly detected risks to the total number of actual risks. False Alarm Rate is the ratio between the numbers of safe cases that are misinterpreted as risks to the total number of alarms. The two metrics are defined in (1) and (2):
TP Detection rate = TP+TN
FP False alarm rate = FP+TP
(1)
(2)
We also used Principal Component Analysis (PCA) that is one of the most accepted linear methods for extracting relevant information from high dimensional datasets. It reduces the complexity of a dataset to a lower dimension to reveal the hidden, simplified structure that often underlies it. PCA projects the data in lower dimensions along relevant direction vectors. The variance of data along the principal components is preserved as much as possible in the original data [26]. We use KNN imputation [22] for handling missing values.
448
C. Isaksson and M.H. Dunham
3.2.1 Ozone Level Detection This real life streaming dataset was created by [27] and is available from the Machine Learning database repository at the University California, Irvine [28], where ozone days can be detected. The dataset has 2536 instances and 73 attributes. The dataset is formatted as follows; the first column contains Date that can be ignored, followed with 71 continuous attributes, and 1 nominal class label indicating 0 for no Ozone day and 1 Ozone day. There are 73 Ozone days, also there are plenty of missing attribute values that were handled using KNN imputation technique. Analyzing EMM ability to detect Ozone days using Jaccard similarity measurement without PCA resulted in a detection rate and false positive rate of zero for typical used thresholds. However after reducing data dimension as observed in Table 2, we obtained 100% detection rate, while not producing any false alarm using threshold of 0.99. As can be seen in Table 2, the accuracy decreases as the threshold does. Table 2. EMM detection and false positive rates
Threshold 0.99 0.95 0.9 0.85
TN 0 10 20 50
FP 0 0 0 0
TP 73 63 53 23
Ozone Level Detection with and without Principal components Analysis (PCA) and KNN imputation With PCA Without PCA Detection Rate False Positive Rate TN FP TP Detection Rate 100% 0% 73 0 0 0% 86% 0% 73 0 0 0% 73% 0% 73 0 0 0% 32% 0% 73 0 0 0%
False Positive Rate 0% 0% 0% 0%
Tables 3 show the ability of LOF to detect Ozone days with and without using PCA it can be observed that LOF was not able to successfully detect the Ozone days for typical used MinPts {5, 10, 15, 20}. It suffered from large false alarm rate due to non-stationary data. Table 3. LOF detection and false positive rates
MinPts 20 15 10
TN 73 73 73
FP 73 73 73
Ozone Level Detection with and without Principal components Analysis (PCA) and KNN imputation With PCA Without PCA TP Detection Rate False Positive Rate TN FP TP Detection Rate False Positive Rate 0 0% 100% 69 69 4 5% 95% 0 0% 100% 69 69 4 5% 95% 0 0% 100% 69 69 4 5% 95%
Results for LSC-Mine are shown in Table 6 with and without using PCA. Notice that it also suffers from high false alarm rate and where not able to successfully detect Ozone days. However, LOF shows slightly better detection rates then LS-Mine. Table 4. LSC-Mine detection and false positive rates
MinPts 20 15 10
TN 72 72 72
FP 72 72 72
Ozone Level Detection with and without Principal components Analysis (PCA) and KNN imputation With PCA Without PCA TP Detection Rate False Positive Rate TN FP TP Detection Rate False Positive Rate 1 1% 99% 73 73 0 0% 100% 1 1% 99% 73 73 0 0% 100% 1 1% 99% 73 73 0 0% 100%
The outliers in this dataset are hard to detect. EMM performed better then either.
A Comparative Study of Outlier Detection Algorithms
449
3.2.2 IR Video Trajectories This streaming real life dataset is created by [29] where the trajectories are extracted from IR surveillance videos using motion detection and tracking algorithms. Each trajectory is represented by five equidistant points in [x,y,time] space (two spatial coordinates on the frame and the time instant), the dimensionality of this feature vector was further reduced to three using the principal component analysis [26]. The dataset corresponds to 239 video motion trajectories where only 2 trajectories (224 and 236) are visually identified as unusual behavior (person walking right and then back left and person walking very slowly). The dataset can be downloaded from [30]. There are no missing attribute values. Analyzing Tables 5, 6, and 7 we observe that EMM, LOF, and LSC-Mine were all able to successfully identify the two outliers without producing any false alarms. However, we notice that LSC-Mine needs to use MinPts =10 to correctly classify the outliers. Table 5. EMM detection and false positive rates using PCA
Threshold 0.99 0.95 0.9 0.85 0.7
TN 0 0 0 0 0
FP 0 0 0 0 0
TP 2 2 2 2 2
IR Video trajectories using PCA Detection Rate False Positive Rate 100% 0% 100% 0% 100% 0% 100% 0% 100% 0%
Table 6. LOF detection and false positive rates using PCA
MinPts 20 15 10 5 3
TN 0 0 0 0 0
FP 0 0 0 0 0
TP 2 2 2 2 2
IR Video trajectories using PCA Detection Rate False Positive Rate 100% 0% 100% 0% 100% 0% 100% 0% 100% 0%
Table 7. LSC-Mine detection and false positive rates using PCA
MinPts 20 15 10 5 3
TN 1 1 0 1 1
FP 1 1 0 1 1
TP 1 1 2 1 1
IR Video trajectories using PCA Detection Rate False Positive Rate 50% 50% 50% 50% 100% 0% 50% 50% 50% 50%
450
C. Isaksson and M.H. Dunham
Extreme outliers are easy to detect. All three algorithms work well with lower value of input parameter. 3.2.3 1999 and 2000 DARPA DDoS Dataset In 1999 and 2000, the MIT Lincoln Laboratory conducted a comparative evaluation of intrusion detection system (IDSs) developed under DARPA funding. This effort was to examine Internet traffic in the air force bases. The traffic was performed in a simulation network. The idea was to generate a set of realistic attacks, embed them in normal data, and evaluate the false alarm and detection rates of systems with these data, in order to enrich performance improvement of existing IDS [31]. We use the synthetic DARPA dataset as a test case for LOF, LSC-Mine, and EMM. In order to extract information from the tcpdump datasets of DARPA, TcpTrace utility software [32] was used. This preprocessing procedure was applied to TCP connection records, but ignores ICMP and UDP packets. The new feature-list attained from “raw tcpdump data” using the TcpTrace software is presented in [33]. The preprocessed dataset is structured in nine different features, where each feature presents the statistical count of network traffic within a fixed time interval. Preprocessed network traffic statistics is gathered at every 10 second for investigation. The DARPA 1999 dataset which is free of attacks for two weeks (1st week and 3rd week) is used as training data and DARPA 2000 dataset which contains DDoS attacks is used a test data. We obtained 20270 rows from the first week and 21174 rows from the third week to create the normal dataset and the dataset is used for modeling. The DARPA 2000 dataset which contains attacks has 1048 rows. Analyzing Tables 8, 9, and 10 we observe that EMM, LOF and LSC-Mine successfully identified the DDoS attack while not producing any false alarms. Notice that, LSC-Mine required MinPts=828 to correctly classify the DDoS attack. Table 8. EMM detection and false positive rates
Threshold 0.9 0.8 0.7 0.6
TN 0 0 0 0
FP 15 4 1 0
TP 1 1 1 1
DARPA 1999, and 2000 Detection Rate False Positive Rate 6% 94% 20% 80% 50% 50% 100% 0%
Table 9. LOF detection and false positive rates
MinPts 20 15 10 5 3
TN 0 0 0 0 0
FP 0 0 0 0 0
TP 1 1 1 1 1
DARPA 1999, and 2000 Detection Rate False Positive Rate 100% 0% 100% 0% 100% 0% 100% 0% 100% 0%
A Comparative Study of Outlier Detection Algorithms
451
Table 10. LSC-Mine detection and false positive rates
MinPts 825 20 15 10 5 3
TN 0 1 1 1 1 1
FP 0 1 1 1 1 1
TP 1 0 0 0 0 0
DARPA 1999, and 2000 Detection Rate False Positive Rate 100% 0% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100%
The outliers in this dataset are difficult to detect, not as obvious as for IR video trajectories dataset. LOF and EMM performed comparatively better than LSC-Mine due to their low values of MinPts and Threshold respectively. However, EMM and LSC-Mine performed very sensitively to their respective parameters.
4 Conclusions Our performance studies have shown that EMM outperforms both LOF and LSCMine algorithms. Its time complexity is less, while its accuracy on the tested datasets was impeccable. It is clear from our evaluation that EMM, LOF, and LCS-Mine have different strengths and weaknesses. The major drawback of the LOF algorithm lies in computing the reachability distances which involves computing distances of all objects within q's neighborhood, and each compared with the k-distance of that neighborhood. This is quite expensive and creates an exponential algorithm. LSCMine was constructed to avoid the computation of reachability distances and local reachability densities by pruning the data objects which can not possibly be outliers. However, both algorithms need to compute the LOF and LSC values for every object before the outliers are detected. This is not desirable since outliers constitute only a small fraction of the entire dataset. The advantage is that they do not have any prior knowledge about the data distribution. EMM, on the other hand, is a linear algorithm that performed with 100% accuracy on the three datasets we examined. Also EMM can be used for online processing as can be seen in the Ozone dataset where EMM successfully detected all Ozone level days. A drawback for EMM is in finding the optimal threshold value for obtaining the best detection rate. There are several directions for future research. These directions include development of an adaptable EMM for dynamically adjusting the threshold. We also need further analysis to investigate its potential for fast detection of anomalies in stream data.
References 1. Elliott, J.: Distributed Denial of Service Attacks and the Zombie Ant Effect. IT Professional 2(2), 55–57 (2000) 2. Zhang, Z.(Mark).: Mining Surveillance Video for Independent Motion Detection. In: Second IEEE International Conference on Data Mining (ICDM 2002), p. 741 (2002)
452
C. Isaksson and M.H. Dunham
3. Huang, J., Meng, Y., Dunham, M.H.: Extensible Markov Model. In: Proceedings IEEE ICDM Conference, November 2004, pp. 371–374 (2004) 4. Narayan, U., Bhat, Miller, G.K.: Elements of Applied Stochastic Processes, 3rd edn. John Wiley & sons, Chichester (2002) 5. Breunig, M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: Identifying Density-Based Local Outliers. In: Proc. of ACM SIGMOD 2000 Int. Conf. On Management of Data (SIGMOD 2000), Dallas, TX, pp. 93–104 (2000) 6. Agyemang, M., Ezeife, C.I.: Lsc-mine: Algorithm for mining local outliers. In: Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, May 2004, pp. 5–8 (2004) 7. Grubbs, F.E.: Procedures for Detecting Outlying Observations in Samples. Technometrics 11, 1–21 (1969) 8. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient Algorithms for Mining Outliers from Large Datasets. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Dallas, TX, pp. 427–438 (2000) 9. Nairac, A., Townsend, N., Carr, R., King, S., Cowley, P., Tarassenko, L.: A System for the Analysis of Jet System Vibration Data. Integrated ComputerAided Engineering 6(1), 53– 65 (1999) 10. Bishop, C.M.: Novelty detection & Neural Network validation. In: Proceedings of the IEE Conference on Vision, Image and Signal Processing, pp. 217–222 (1994) 11. Japkowicz, N., Myers, C., Gluck, M.A.: A Novelty Detection Approach to Classification. In: Proceedings of the 14th International Conference on Artificial Intelligence (IJCAI 1995), pp. 518–523 (1995) 12. Caudell, T.P., Newman, D.S.: An Adaptive Resonance Architecture to Define Normality and Detect Novelties in Time Series and Databases. In: IEEE World Congress on Neural Networks, Portland, Oregon, pp. 166–176 (1993) 13. Carpenter, G., Grossberg, S.: A Massively Parallel Architecture for a Self Organizing Neural Pattern Recognition Machine. Computer Vision, Graphics, and Image Processing 37, 54–115 (1987) 14. John, G.H.: Robust Decision Trees: Removing Outliers from Databases. In: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pp. 174– 179. AAAI Press, Menlo Park (1995) 15. Skalak, D.B., Rissland, E.L.: Inductive Learning in a Mixed Paradigm Setting. In: Proceedings of the Eighth National Conference on Artificial Intelligence, Boston, MA, pp. 840–847 (1990) 16. Lane, T., Brodley, C.E.: Applications of Machine Learning to Anomaly Detection. In: Adey, R.A., Rzevski, G., Teti, T. (eds.) Applications of Artificial Intelligence in Engineering X11, pp. 113–114. Comput. Mech. Publications, Southampton (1997a) 17. Lane, T., Brodley, C.E.: Sequence matching and learning in anomaly detection for computer security. In: AAAI Workshop: AI Approaches to Fraud Detection and Risk Management, pp. 43–49. AAAI Press, Menlo Park (1997b) 18. Meng, Y., Dunham, M.H., Marchetti, F.M., Huang, J.: Rare Event Detection in A Spatiotemporal Environment. In: Proc. 2nd IEEE Int’l Conf. Granular Computing (GrC 2006), Atlanta, GA, May 10-12 (2006) 19. Zhange, T., Ramakrishnan, R., Livny, M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: Proc. ACM SIGMOD Conference, pp. 103–114 (1996) 20. R.: The R Project for Statistical Computing (2008), http://www.r-project.org/ 21. Edgar Acuna, and members of the CASTLE group at UPR-Mayaguez, Puerto Rico.: dprep: Data preprocessing and visualization functions for classification (2008), http://cran.r-project.org/web/packages/dprep/index.html
A Comparative Study of Outlier Detection Algorithms
453
22. Acuna, E., Rodriguez, C.: The treatment of missing values and its effect in the clas-sifier accuracy. In: Banks, D., House, L., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering and Data Mining Applications, pp. 639–648. Springer, Heidelberg (2004) 23. Weka3.: Data Mining Software in Java (2008), http://www.cs.waikato.ac.nz/ml/weka/ 24. Lazarevic, A., Kumar, V.: Feature Bagging for Outlier Detection. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL (August 2005) 25. Lozano, E., Acuña, E.: Parallel Algorithms for Distance-Based and Density-Based Outliers. In: ICDM (2005) 26. Raychaudhuri, S., Stuart, J.M., Altman, R.B.: Principal components analysis to summarize microarray experiments: application to sporulation time series. In: Pacific Symposium on Biocomputing (2000) 27. Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond. Knowledge and Information Systems 14(3) (2008) 28. Blake, C., Mertz, C.: UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1998), http://www.ics.uci.edu/mlearn/MLRepository.html 29. Latecki, L.J., Miezianko, R., Megalooikonomou, V., Pokrajac, D.: Using Spatiotemporal Blocks to Reduce the Uncertainty in Detecting and Tracking Moving Objects in Video. International Journal of Intelligent Systems Technologies and Applications 1(3-4), 376– 392 (2006) 30. Lazarević, A.: IR video trajectories (2008), http://www.cs.umn.edu/~aleks/inclof 31. MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation (2008), http://www.ll.mit.edu/mission/communications/ist/corpora/ ideval/index.html 32. Tcptrace.: tcptrace is a tool for analysis of TCP dump files (2008), http://jarok.cs.ohiou.edu/software/tcptrace/ 33. Isaksson, C., Meng, Y., Dunham, M.H.: Risk Leveling of Network Traffic Anomalies. Int’l Journal of Computer Science and Network Security (IJCSNS) 6(6) (2006) 34. Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. To Appear in ACM Computing Surveys (2009) 35. Agyemang, M., Barker, K., Alhajj, R.: A Comprehensive Survey of Numeric and Symbolic Outlier Mining Techniques. Intelligent Data Analysis 10(6), 521–538 (2006) 36. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22(2004), 85–126
Outlier Detection with Explanation Facility Manuel Mejía-Lavalle and Atlántida Sánchez Vivar Instituto de Investigaciones Eléctricas, Reforma 113, 62490 Cuernavaca, Morelos, México [email protected], [email protected]
Abstract. We propose a simple and efficient method to detect exceptional data, which includes a novel end user explanation facility. After various designs, the best was based on an unsupervised learning schema, which uses an adaptation of the artificial neural network paradigm ART for the cluster task. In our method, the cluster that contains the smaller number of instances is considered as outlier data. The method provides an explanation to the end user about why this cluster is exceptional with regard to the data universe. The proposed method has been tested and compared successfully not only with well-known academic data, but also with a real and very large financial database that contains attributes with numerical and categorical values.
1 Introduction Outlier detection has become a fast growing topic in Data Mining [1]. This is because there are important varieties of applications where it is required to know if there is exceptional data immersed in a very large database. Typical applications are related with financial frauds, but also we can cite other applications of great interest for industry, medicine, astronomy, communications, and more [1]. Although currently multiple techniques there have been proposed and published to attack the problem [2], in most cases the algorithmic complexity is high, and consequently processing times are high: this is a serious drawback if we consider that typical applications are given in databases with thousands or millions of records. Additionally, in many of the methods proposed in the specialized literature, the end user receives the outlier information in a "black box" fashion [1], [2], [3]. To solve this, we propose a simple and efficient method, with low algorithmic complexity, which not only detects outlier data, but also provides to the end user a reasoned explanation of why this data is anomalous with respect to the data universe. This explanation facility has certain similarity with the ideas emerged from the Expert Systems area [4]. In our case, the proposed explanation facility permits the end user to reach a better problem understanding and consequently, it helps to take better, well-supported and informed actions. Before obtaining the method proposed here, we tried with various designs, some of them were taken directly from specialized literature and some others were realized with adaptations. At the end, the most efficient, simple and adequate method for our necessities resulted to be one based on an unsupervised learning schema, with some P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 454–464, 2009. © Springer-Verlag Berlin Heidelberg 2009
Outlier Detection with Explanation Facility
455
similarities to the Case Based Reasoning methodology. To obtain clusters and prototypes, we made diverse novel adaptations to the Adaptive Resonance Theory (ART) artificial neural network paradigm [5]. We tested and compared our method with well-known academic data and we obtained very good results. In addition, we tested the method with a real and very large Mexican financial database, with more than 30 million transactions with numerical and categorical attributes. With this real database, we observe the method’s scalability and good performance, not only to detect the outlier data, but also to provide an articulated explanation related with the reasons applied by our method in considering those transactions as exceptional. We think that the method that we propose can be of interest and of immediate application in a great variety of domains where it is needed to know, in an informed way, if exceptional transactions exist. To develop these ideas, our article is organized in the following way. In Section 2, a brief summary of the state of the art is presented. In Section 3 the proposed method is described and illustrated. In Section 4, experiments and obtained results are shown, and finally in Section 5 conclusions and future works are addressed.
2 Related Work In this Section, we describe state-of-the-art published works for outlier detection, fraud analysis and cluster methods where we can observe that they do not include an automated explanation facility. At most, they include information about the centroids that were formed and a post-run human result interpretation. In [6] the author describes and explains six steps in the cluster analysis process: variable selection, data management, selection of clustering methods, deriving cluster solutions, validating results and interpreting results of three market segments based on entertainment preferences. The purpose of the article is to encourage increased usage of this valuable means of revealing market niches by researchers in the field of hospitality. Kirkos [7] explores the effectiveness of Data Mining classification techniques in detecting firms that issue fraudulent financial statements (FFS) and deals with the identification of factors associated to FFS. This study investigates the usefulness of Decision Trees, Neural Networks and Bayesian Belief Networks in the identification of fraudulent financial statements. Ferreira also explores this area [8]. Chen and Lin [9] proposed to address the credit card fraud problem using a binary support vector system (BSVS). The proposed BSVS is based on the support vector machines (SVM). A genetic algorithm (GA) is employed to select support vectors. Experimental results show that the proposed BSVS is effective especially for predicting a high true negative rate. In [10] Pandit describes the design and implementation of NetProbe, a system for solving problems of spot anomalies and auction fraud. NetProbe models auction users and transactions as a Markov Random Field tuned to detect the suspicious patterns that fraudsters create, and employs a Belief Propagation mechanism to detect likely fraudsters. The experiments show that NetProbe is both efficient and effective for fraud detection.
456
M. Mejía-Lavalle and A.S. Vivar
Srivastava, Kundu, Sural and Majumdar [11] model the sequence of operations in credit card transaction processing using a Hidden Markov Model (HMM) and show how it can be used for the detection of frauds. An HMM is initially trained with the normal behavior of a cardholder. If an incoming credit card transaction is not accepted by the trained HMM with sufficiently high probability, it is considered to be fraudulent. The results show the effectiveness of this approach. In [12], authors describe several methods for data pre-processing that they applied to transform a large, dynamic and relational dataset describing nearly the entirety of the U.S. securities industry and they show how these methods made the dataset suitable for learning statistical relational models. In addition, they developed an innovative technique to infer professional associations by exploiting dynamic employment histories. They show how these pre-processing techniques are combined to provide the necessary foundation for learning high-performing statistical models of fraudulent activity. They empathized that detecting fraud from the highly overlapped and imbalanced fraud dataset is a challenging task. Padmaja, et.al. [13], propose a new approach called extreme outlier elimination and hybrid sampling technique, k Reverse Nearest Neighbors (kRNNs) concept used as a data cleaning method for eliminating extreme outliers in minority regions. Hybrid sampling technique, a combination of SMOTE to over-sample the minority data (fraud samples) and random under-sampling to under-sample the majority data (nonfraud samples) is used for improving the fraud detection accuracy. This method was evaluated in terms of True Positive rate and True Negative rate on the insurance fraud dataset. They conducted the experiments with classifiers namely C4.5, Naïve Bayes, k-NN and Radial Basis Function networks and compared the performance of this approach against simple hybrid sampling techniques. Obtained results shown that extreme outlier elimination from minority class, produce high predictions for both fraud and non-fraud classes. In [14] a novel algorithm for learning mixture models from multivariate data is described. The proposed algorithm takes advantage of TRUST-TECH (Transformation Under Stability-re Taining Equilibria CHaracterization) to compute neighborhoods local maxima on the likelihood surface using stability regions. This method coalesces the advantages of the traditional EM with that of the dynamic and geometric characteristics of the stability regions of the corresponding nonlinear dynamical system of the loglikelihood function. Two phases, namely, the EM phase and the stability region phase, are repeated alternatively in the parameter space to achieve local maxima with improved likelihood values. The EM phase obtains the local maximum of the likelihood function and the stability region phase helps to escape out of the local maximum by moving toward the neighboring stability regions. The algorithm has been tested on both synthetic and real data sets and the improvements in the performance compared to other approaches are demonstrated. In the next Section, we will propose and describe an algorithm that is capable to explain its cluster and outlier data results.
3 Proposed Method To present the proposed method we will explain firstly the basic algorithm that we choose for the outlier detection task. Next, we will detail the similarity metrics that
Outlier Detection with Explanation Facility
457
we applied to numerical and categorical attributes. Then, we will introduce the way that we implement to obtain data clusters (prototypes). Immediately we will describe how we conceptualize the explanation facility and finally, we will summarize the complete algorithm. 3.1 Outlier Detection Algorithm The proposed algorithm was adapted from the artificial neural networks paradigm known as Adaptive Resonance Theory or simply ART [5]. Although more sophisticated algorithms exist to obtain clusters, like k-means or Expectation Maximization [1], we choose ART due to their simplicity and acceptable algorithmic complexity. As it will be seen in Section 4, our ART variation has faster processing time than other well established cluster methods. At the beginning, ART accepts instances (records or transactions) and, depending on the similarity degree between two instances, or unites them, forming a prototype (a cluster), or separates them, forming two prototypes (two clusters). The original ART algorithm is described in [5], where we can observe that the user defines a threshold parameter, that we named U: with the U parameter the desired similarity /dissimilarity among instances is controlled. ART paradigm has certain similarity with Case Based Reasoning methodology; nevertheless, originally ART was designed only to handle only binary data, for image classification, and using an unsupervised learning schema. For our necessities, the ART paradigm was useful to us for the clusterization task: the cluster that contains smaller number of instances will be the one that will considered as the exception group, or outlier data. 3.2 Metrics To be able to apply ART paradigm to continuous and categorical attributes, we defined the following similarity metrics, applicable when an instance is compared against another instance. Continuous-numerical attributes: To apply the proposed metric, firstly attributes with numerical values should be normalized among 0 and 1. Once this is done, the distance dN between two numerical values, of a same attribute A, will be simply its absolute difference: dN = abs (value1 – value2)
(1)
Categorical attributes: In this case, a value of 1 is assigned if the categorical attributes are different, and a 0 if they are equal (this approach is often used in the Data Mining area, e.g. [15]). dC = 0 if value1 = value2, 1 in other case
(2)
Total similarity metric: Total similarity metric dT is the attribute’s distances average. That is to say, it is calculated with the sum of each one of the distances d of each attribute (attributes can
458
M. Mejía-Lavalle and A.S. Vivar
be numerical-continuous dN, or categorical dC) divided by the total number of attributes A: dT = ∑ d / A
(3)
Thus, if distance dT results with a value of 1 or close to 1, the compared instances are different; and if the value is 0 or close to 0, the two compared instances are equal or very similar. Equation (3) easily can be modified to give more weight to an attribute (or more), than others: if we knew beforehand that certain attributes are more important than others, each attribute can be multiplied by a weight factor F assigned by a human expert. dT = ∑ (d* F) / A
(3`)
The weight factors F of all the attributes should sum A units. If we do not want to use the weighted schema, then F will take a value of 1 for each attribute. 3.3 Prototype Construction In this case, the idea that better work was to apply an incremental prototype construction, following a weighted schema according to the number of instances accumulated in each prototype. For relatively small databases (less than 100 instances or transactions), the weighted schema is not relevant: its importance begins for databases with thousands of instances. For our purposes, each prototype is a cluster, where the cluster with less accumulated instances will be the exceptional cluster or, properly, the outlier data. Additional to a distance calculation, or total similarity dT between an instance and a prototype, a user-defined threshold parameter U will determine if an instance will be clustered with some existing prototype or if it will form a new independent prototype, according to: If U < dT create a new prototype, else combine the instance to the prototype
(4)
To combine an instance to a prototype we apply the following two weighted schemas: Continuous-numerical attributes: aNP = faTP / (faTP + 1) * aNP + (1 – (faTP / (faTP + 1)) * aNI
(5)
where faTP is the total number of instances accumulated in the prototype P for the a attribute; aNP is the numerical value of the prototype for the attribute a; aNI is the actual numerical value of the attribute a which instance I is going to be added to the prototype. With equation (5) we want to consider the contribution of a new instance I to the existing prototype: while more instances will be accumulated already in the prototype, smaller will be the weight that will be assigned to the numerical value of the instance that is going to be added to the prototype. For example, if a numerical attribute from the instance has a value of 0.1 and the prototype has only an instance accumulated and the numerical value for that attribute is of 0.9, then the new aNP value is: aNP = 1 / (1 + 1) * 0.9 + (1 – (1 / (1 + 1)) * 0.1 = 0.5 but if the prototype has already accumulated 99 instances, then we have:
Outlier Detection with Explanation Facility
459
aNP = 99 / (99 + 1) * 0.9 + (1 – (99 / (99 + 1)) * 0.1 = 0.892 Categorical attributes: In this case, we update the occurrence frequency of the categorical attribute’s value that is going to be added to the prototype, using: faVC = faVC + 1
(6)
where faVC is the number of instances accumulated in the prototype for a certain value V of a categorical attribute C. Thus, to calculate the distance that exists between a categorical attribute of a prototype and a new instance, we employ a weighted variant idea of the well known schema to assign the value of 1 if the categorical attributes are different, and a 0 if they are equal, described previously in the equation (2). Our distance variant consists of a weighted schema that considers the occurrence frequency of the categorical value: dBCP = (∑ faVC) / (faC + 1)
B≠V
(7)
where faC is equal to the total of accumulated instances for a certain categorical attribute, and dBCP is the distance of the prototype categorical attribute respect to the categorical attribute with value B of the new instance. With this, we seek to weigh up the distance of an instance to the prototype: while more instances of certain value of a categorical attribute B are accumulated in the prototype, smaller will be the distance to an instance with that same categorical attribute value. For example, if the prototype only had accumulated categorical values type B, the distance would be 0, because the sum of instances with different values from B (∑ faVC ) is 0. On the other hand, if there were no one value B, and the prototype already has accumulated 99 instances, the distance would be 99 / 100 = 0.99. The metrics described in Section 3.2 and in the present Section will be essential to obtain the explanation facility, which is presented in the next Section. 3.4 Explanation Facility Once all the database’s instances have been processed and all the prototypes have been created, the prototype with less accumulated instances is selected, because we consider that it represents the outlier data. This relatively small prototype is compared against the other constructed prototypes, and applying the distance equations (1), (7) and (3) or (3`), we can obtain automated explanations, which form the explanation facility. An example of the explanation facility would be: Cluster X is different to cluster Y in a p% due to: 1. 2.
Attribute g (because they differ in an qg%) Attribute h (because they differ in an qh% ... N. Attribute n (because they differ in an qn%)
460
M. Mejía-Lavalle and A.S. Vivar
where p% is the value of dT calculated with equations (3) or (3`) and expressed like a percentage; qg%, qh% and qn% are calculated with equations (1) or (7), depending if the attribute is numerical or categorical, and they are expressed in percentage. Each attribute is listed in importance order, that is to say, beginning with the highest percentage and ending with the lower one. Additionally, we can show only those attributes that have percentages over certain threshold, for example, only list the attributes greater to 70%. Following the same idea, we can show the pair of clusters that were the most distant and we can show the attribute that obtained the greater distance in general: Clusters X and Y were the most different, with a p%. The most distant attribute was g, with an qg%, . With this explanation, the end user can improve his problem understanding. For example, he can see if an attribute related with the payments is abnormal, or if an attribute related with the hour of the day in which the transaction was done is unusual. Furthermore, he can observe if an attribute related with the number of sales is abnormally rising. 3.5 Proposed Method ART-E According to the previously exposed, the proposed method, that we denominate like ART-E (ART with Explanation), can be summarized in the way showed in Fig. 1. Given a dataset with R instances and A attributes, P = 0, and a user-defined threshold U: Normalize among 0 and 1 all the numerical attributes. Until finish with all the instances, do: a) Take randomly an instance. b) Compare it against the P existing prototypes applying equation (3). c) If the smaller obtained dT is greater than parameter U, a. Create a new prototype P = P + 1. b. Otherwise, combine the instance with the most similar prototype, using equation (5) for numerical attributes and equation (6) for attributes with nominal values. d) Increment the number n of instances accumulated in the prototype P doing nP = nP + 1. Show results applying the explanation facility described in Section 3.4. End of ART-E.
Fig. 1. ART-E Algorithm
Outlier Detection with Explanation Facility
461
4 Experiments We conducted several experiments with academic and real datasets to empirically evaluate if the proposed algorithm ART-E can do better in outlier detection than other well-known cluster algorithms, in terms of processing time, explanation facility and optimal instance clusterization. We choose academic datasets in our experiments because they are well known in the Data Mining area. 4.1 Experimentation Details The experimentation objective is to observe the ART-E behavior related to clusterization quality, response time and explanation facility. First, we tested our proposed method with two well-known academic datasets obtained from the UCI Machine Learning repository [16]: the Iris data, with 150 instances, 3 classes and 4 features or attributes, and the Wine dataset with 178 samples or instances, 3 classes and 13 attributes. For these academic datasets, the class labels were deleted, thus treating it as an unsupervised learning problem. Additionally, we experiment with a real very large Mexican financial database, with more than 30 million transactions with numerical and categorical attributes. This database contains confidential information of financial transactions related with the major Mexican electric company, named Comisión Federal de Electricidad. With this real database, we observe the method’s scalability and good performance, not only to detect the outlier data, but also to provide an articulated explanation related with the reasons applied by ART-E in considering those transactions as exceptional. In order to compare the results obtained with ART-E, we use Weka´s [17] implementation of cluster algorithms (version 3.6). These experiments were executed using Weka´s default values. All the experiments were executed in a personal computer with an Intel Core 2 Duo processor, 2 GHz, and 1 Gbyte RAM. ART-E was coded with the JDK 6 Update 7 and NetBeans IDE 6.1. In the following Section, the obtained results are shown. 4.2 Experimental Results Testing over the UCI academic datasets, we can observe (Table 1) that ART-E outperforms several Weka cluster algorithms, and obtains results very near to the optimal. Additionally, ART-E requires approximately 95% less processing time than the EM algorithm (see Table 2). To verify if ART-E effectively is able to manage real data, doing the cluster task in an effective and efficient manner, and offering articulated explanations understandable to the end-user, we executed several experiments with six subsets of the Mexican financial database. In Table 2 we can observe that ART-E obtains similar results compared with EM, but much faster. For the Banks subset, EM´s Weka was unable to manage this data volume, due to memory problems. The explanations offered by ART-E were shown to the human expert’s domain and they considered that, with minor changes, these explanations could be understandable and useful for the end-users. Due to confidentiality reasons, in Fig. 2 we only show results from the ART-E explanation facility with fictitious data.
462
M. Mejía-Lavalle and A.S. Vivar Table 1. ART-E results for Iris and Wine UCI´s academic datasets
Dataset
Method ART-E Optimal solution EM SimpleKMeans Coweb FarthestFirst ART-E Optimal solution EM SimpleKMeans Coweb FarthestFirst
Iris
Wine
Threshold U 0.3
0.25
Clusters
Cluster´s instances
3 3 5 2 2 2 3 3 4 2 220 2
50 48 52 50 50 50 28 35 23 42 22 100 50 100 50 84 66 58 73 47 59 71 48 45 31 52 50 108 70 Clusters with 1 to 10 108 70
Table 2. ART-E vs EM comparison results for the Financial Database
Sub Financial Database Ware house Fixed assets Net worth Over heads Liabilities Banks
Thres hold U
Instances
Processing time (secs) ART-E EM
Outlier instances
Clusters ART-E
EM
ART-E
EM
0.3
24
0.140
122
6
5
1
1
0.2
39
0.172
72
6
3
2
2
0.7
55
0.187
76
3
3
5
4
0.8
205
0.765
89
9
7
12
9
0.98 0.99
1,746 104,345
6.375 358
729
21 637
13
N/A
N/A
28 563
N/A
Outlier cluster is different from cluster 3 in a 68.4 % due to: 1. Attribute society (because they differ in a 100%) 2. Attribute hour (because they differ in a 76.6%) 3. Attribute invoice-amount (because they differ in a 72.4%) 4. Attribute vendors (because they differ in a 70.8%) Clusters 4 and 8 were the most different, with a 73.7%. The most distant attribute was society, with a 86.9%, Fig. 2. ART-E Explanation facility example
19
Outlier Detection with Explanation Facility
463
5 Conclusions and Future Work We have presented a new algorithm for outlier detection that overcomes some drawbacks found in the area, like excessive processing time and the lack of explanation. The proposed algorithm follows an unsupervised learning schema (similar to ART), with several novel adaptations and metrics. With the proposed algorithm, we found important reductions in processing time, reasonable cluster likelihood, and we obtained reasoned explanations that help the end user gain a better problem understanding. With the experiments that we performed, we observed that the proposed algorithm ART-E obtains results comparable to, or better than, well-established cluster methods, like EM, Coweb, k-means and FarthestFirst. Some future research issues arise with respect to ART-E improvement. For example: experimenting with other real databases; comparing our approach against other similar methods (e.g. Trust-Tech [14]); using other metric variations and more efficient search methods and also to investigate the possibility of obtaining an optimal U threshold in an automatic way.
References 1. Tang, J., Chen, Z.: Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowledge and Information Systems 11(1), 45–84 (2006) 2. Caudil, S., Ayuso, M., Guillen, M.: Fraud detection using a multinomial logit model with missing information. The Journal of Risk and Insurance 72(4), 539–550 (2005) 3. Perner, P.: Concepts for novelty detection and handling base on case-based reasoning process scheme. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 21–33. Springer, Heidelberg (2007) 4. Waterman, D.: A guide to Expert Systems. Addison-Wesley, Reading (1986) 5. Carpenter, G., Grossberg, S.: Neural dynamics of category learning and recognition: Attention, memory consolidation and amnesia. In: Davis, J. (ed.) Brain structure, learning and memory. AAAS symposium series (1986) 6. Jurowski, C., Reich, A.Z.: An explanation and illustration of cluster analysis for identifying hospitality market segments. Journal of Hospitality & Tourism Research, 67– 91 (2000) 7. Kirkos, E., Spathis, C., Manolopoulos, Y.: Data mining techniques for the detection of fraudulent financial statements. Expert Systems with Applications 32(4), 995–1003 (2007) 8. Ferreira, P., Alves, R.: Establishing Fraud Detection Patterns Based on Signatures. In: Perner, P. (ed.) ICDM 2006. LNCS, vol. 4065, pp. 526–538. Springer, Heidelberg (2006) 9. Chen, T., Lin, C.: A new binary support vector system for increasing detection rate of credit card fraud. International Journal of Pattern Recognition and Artificial Intelligence 20(2), 227–239 (2006) 10. Pandit, S., Chau, D., Wang, S., Faloutsos, C.: NetProbe: a fast and Scalable System for Fraud Detection in Online Auction Networks. In: Proceedings of the 16th International World Wide Web Conference Committee, Banff, Alberta, Canada, May, 2007, pp. 201– 210 (2007) 11. Srivastava, A., Kundu, A., Sural, S., Majumdar: Credit Card Fraud Detection Using Hidden Markov Model. IEEE Transactions on dependable and secure computing 5(1), 37– 48 (2008)
464
M. Mejía-Lavalle and A.S. Vivar
12. Fast, A., Friedland, L., Maier, M., Taylor, B., Jensen, D., Goldberg, H.G., Komoroske, J.: Relational data pre-processing techniques for improved securities fraud detection. In: 13th International Conference on Knowledge Discovery and Data Mining, San Jose, California, pp. 941–949 (2007) 13. Padmaja, T., Dhulipalla, N., Bapi, R.S., Krishna, P.R.: Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. In: 15th International Conference on Advanced Computing and Communications, pp. 511–516 (2007) 14. Reddy, C.K., Chiang, H., Rajaratnam, B.: Trust-tech-based Expectation maximization for learning finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(7), 1146–1157 (2008) 15. Mitra, S., et al.: Data mining in soft computing framework: a survey. IEEE Trans. on neural networks 13(1), 3–14 (2002) 16. Blake, C., Merz, C.: UCI repository of Machine Learning databases, Univ. of California, Irvine (1998), http://www.ics.uci.edu/mlearn/MLRepository.html 17. http://www.cs.waikato.ac.nz/ml/weka (2004)
Concept Learning from (Very) Ambiguous Examples Dominique Bouthinon1 , Henry Soldano1 , and V´eronique Ventos2 1
L.I.P.N, UMR-CNRS 7030, Universit´e Paris-Nord, 93430 Villetaneuse, France 2 LRI, UMR-CNRS 8623, Universit´e Paris-Sud, 91405 Orsay, France {dominique.bouthinon, henry.soldano}@lipn.univ-paris13.fr, [email protected]
Abstract. We investigate here concept learning from incomplete examples, denoted here as ambiguous. We start from the learning from interpretations setting introduced by L. De Raedt and then follow the informal ideas presented by H. Hirsh to extend the Version space paradigm to incomplete data: a hypothesis has to be compatible with all pieces of information provided regarding the examples. We propose and experiment an algorithm that given a set of ambiguous examples, learn a concept as an existential monotone DNF. We show that 1) boolean concepts can be learned, even with very high incompleteness level as long as enough information is provided, and 2) monotone, non monotone DNF (i.e. including negative literals), and attribute-value hypotheses can be learned that way, using an appropriate background knowledge. We also show that a clever implementation, based on a multi-table representation is necessary to apply the method with high levels of incompleteness. Keywords: Symbolic concept-learning, Ambiguity, Incomplete data.
1
Introduction
We investigate here the effect of incompleteness in propositional concept learning from examples and in its first order extension: the learning from interpretations setting introduced by [1]. Concept learning from examples relies on a membership relation between hypotheses and examples denoted as cover and such that to be a solution an hypothesis has to cover positive examples and should not cover negative examples of the target concept. This set of solutions, inheriting its partial order from the Hypothesis language, is called the Version Space [2] of the learning problem. This definition of concept learning relies on a complete description of the examples. In [3], the author informally proposes to extend the notion of solution in order to use any piece of information concerning the current example set. The definition of concept learning problems has then to be modified: a hypothesis has now to be in some sense compatible with such pieces of information. We consider the general case, where an example is ambiguous in the following sense: the example is represented as a set of possible complete examples further denoted as possibilities. The idea here is that the true example, P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 465–478, 2009. c Springer-Verlag Berlin Heidelberg 2009
466
D. Bouthinon, H. Soldano, and V. Ventos
corresponding to an observation, is exactly one of these possibilities which is so hidden within the ambiguous example. To take into account this ambiguity we use two relations compatible + and compatible − : a hypothesis h is compatible + with a positive ambiguous example e if h covers at least one possibility of e, while h is compatible − with a negative ambiguous example e whether there is at least one possibility of e which is not covered by h. As an illustration, consider a world of birds from which we want to learn the concept fly. Any bird is described with the atoms {red, green, migratory , not migratory , light , not light } and a bird is either red or green, either migratory or not migratory, and either light or not light . Now suppose the only thing we know about a given bird is that it is red . Then it is extensionnaly represented as the ambiguous example e = {{red, migratory, light}, {red, migratory, not light }, {red, not migratory, light},{red, not migratory, not light }} containing 4 valid possibilities. Here a hypothesis h covers a possibility p if h is included in p. First assume that e is a positive ambiguous example, then h = {migratory} is compatible + with e since h covers {red, migratory, light }. Assume now that e is a negative ambiguous example then h is compatible − with e since h does not covers {red, not migratory, light}. An ambiguous example can also be intentionally described as a clausal theory that defines constraints on the universe of instances, together with a set of facts. This is the approach of abductive concept learning [4] in which hypothesis are clauses and the coverage relation is replaced by a procedure of abductive entailment playing the same role as our compatibility relation. Unfortunately the cost of the abductive entailment test applied to each example may become prohibitive whenever we face strong uncertainty. In contrast, the extensional approach presented here uses a simple subsumption test, but strong ambiguity can result in a huge set of possibilities and thus in a prohibitive cost. Our proposal is a rule learning algorithm that returns one of the simplest elements of the Version Space. It uses a compact multi-table representation [5] of ambiguous examples that can lead to an exponential gain in the representation size. Furthermore, we will see that only maximal possibilities (following the inclusion order on interpretations) have to be considered when considering a positive example, whereas only minimal ones have to be considered given a negative example.
2
Compatibility of DNF Formulas with Ambiguous Positive and Negative Examples
In learning from interpretations De Raedt considers an example as a Herbrand interpretation that is the assignment of truth-values to a set of grounded atoms built from a first order language. In concept learning from interpretations an hypothesis either is a CNF formula, i.e. a conjunction of clauses as in LOGAN H [6] and ICL [7] in its CNF mode, or a DNF formula, i.e. a disjunction of partial concept definitions as in ICL in its DNF mode. Our general purpose is to learn such a DNF formula representing a target concept, using both positive and negative ambiguous examples. However we only consider here the propositional case.
Concept Learning from (Very) Ambiguous Examples
467
Let us first give some notations. Let P be a set of atoms, we will note a1 ∨. . .∨ am ← b1 ∧ . . . ∧ bm a clause both containing positive and negative literals built from P , a1 ∨. . .∨am a clause only containing positive literals, and ¬(b1 ∧. . .∧bm ) a clause only containing negative literals. A clausal theory c1 ∧ . . . ∧ cn , that is a conjunction of clauses, is represented as the set of clauses {c1 , . . . , cm }. Note that an interpretation i can be represented as a clausal theory B(i) having i as its single model. For example consider the set of atoms P = {a, b, c} and the interpretation i = {a, b} (meaning that a and b are true while c is false). Then i can be represented as the clausal theory {a, b, ¬c}. In our framework 1) a hypothesis is a monotone DNF (or DNF+ for short) H = h1 ∨ . . . ∨ hn where each hk is a conjunction of positive litterals, and 2) an ambiguous example is a set of interpretations e = {i1 , . . . , in }, that also has an intentional representation as a clausal theory B(e) having e as its set of models. The purpose here is to find a hypothesis H that is compatible with all ambiguous examples contained in a set E. The compatibility relation defined hereunder extends the coverage relation used in learning from interpretations and in propositional learning: Definition 1 (compatibility relations with DNF). Let H be a DNF and let e be an ambiguous example, then H is compatible+ with e if and only if there exists an interpretation i in e such that i is a model of H, and H is compatible− with e if and only if there exists an interpretation i in e such that i is not a model of H. In what follows we will implement this compatibility relation in an exact way. Furthermore we search for a simplest element in the corresponding Version Space, i.e. a hypothesis H with a minimal number of conjunctive terms hi . For that purpose we will use, as for instance ICL, the popular standard greedy set covering strategy which tends to produce but does not insure a simplest H. It is important to note that when learning monotone DNF, in each ambiguous example we only have to consider maximal or minimal interpretations with respect to the inclusion order. More precisely let i 1 and i 2 . be two interpretations built from the same set of atoms P , each one represented as a set of ground atoms assigned to True, then i 1 is smaller than i 2 iff i 1 ⊂ i 2 . Proposition 1. Let H be a DNF+ hypothesis, then H is compatible+ with a positive ambiguous example e iff there exists a maximal interpretation in e which is a model of H, and H is compatible− with the negative ambiguous example e iff there exists a minimal interpretation in e which is not a model of H. Proof.
1
As a consequence we only need to keep maximal interpretations when e is a positive ambiguous example, and minimal interpretations when e is a negative one. 1
Proof of all propositions are available at "http://www-lipn.univ-paris13.fr/ ˜soldano/Annexe.pdf”
468
3
D. Bouthinon, H. Soldano, and V. Ventos
LEa: An Algorithm to Learn DNF from Ambiguous Examples
LEa is a standard top-down greedy set covering algorithm whose search space for each partial concept definition is restricted, as in PROGOL [8], to parts of a particular positive example denoted as a seed. LEa learns DNF+ from ambiguous examples and differs from other top-down learners as 1) it has to maintain the coherence of assumptions made on negative examples, 2) it has to handle ambiguous seeds, and 3) it uses compatibilty rather than coverage in order to deal with ambiguous examples. 3.1
Maintaining the Coherence of Assumptions
LEa described algorithm 1 works as follows: a first conjunction h1 compatible with at least one positive example (the seed) and no negative example is selected, then the positive examples compatible with h1 are discarded. Another seed is selected and a new conjunction h2 is searched for in a same way. The process continues building conjunctions hi until there is no more positive examples to consider. As each hi must be compatible − with all negative examples, in our uncertainty setting we have to ensure that the hi s relies on valid assumptions about the negative examples. Suppose for instance that our current DNF is h1 = a that is compatible − with the negative ambiguous example e = {{a}, {b}} through the second possibility. Thus h1 makes the assumption that the negative example hidden in e is {b}. Now if we check the new term h2 = b, we will find that it is compatible − with e through the first possibility, so assuming that the negative example hidden in e is {a}. As h1 and h2 rely on contradictory assumptions about e, the DNF h1 ∨ h2 is not compatible − with e. To avoid this situation, we have to discard the possibilities of e that do not match the assumptions made by any hi added to the current DNF. This process is achieved for all negative examples. 3.2
Handling an Ambiguous Seed
The core of LEa is the procedure bestRulea, described algorithm 2, whose goal is to find the conjunctive term that will be added to the current DNF. bestRulea uses a beam search that retains, at each step the W best conjunctions (i.e. the beam) according to the evaluation function. At each step the beam search applies a refinement operator. As in our framework the seed is an ambiguous positive example seed = {i1 , . . . , in }, our refinement operator ρa (h, seed) returns the maximally general specializations of h that are compatible + with seed. Let ρ(h, x) be the ususal refinement operator that returns the maximally general specializations of h that covers the positive example x, then ρa (h, {i1 , . . . , in }) = ρ(h, i1 ) ∪ . . . ∪ ρ(h, in ). The refinement operator ρa is used in the procedure maximallyGeneralSpecializations. 3.3
Handling the Ambiguity of the Examples
In algorithm bestRulea we associate to each candidate conjunction h an accurracy that is simply the proportion of examples compatible with h: accuracy(h) =
Concept Learning from (Very) Ambiguous Examples
469
Algorithm 1. LEa input E + , E − , W /∗ Examples and width of the beam. ∗/ output DN F /∗ a DNF compatible with each example of E + and E − ∗/ begin DN F ← ∅ ; /∗ Empty disjunction (compatible with no example). ∗/ while E + = ∅ do h ← bestRulea (E + , E − , W) ; DN F ← DN F ∨ h ; E + ← E + \ {examples of E + compatible+ with h} ; /∗ Update possibilities of negative examples. ∗/ for each example e in E − do discard each possibility in e that is a model of h ; end for; /∗ Now h is compatible − with each possibility of each negative example. ∗/ end while; return DN F ; end. n+p N +P
where N is the number of negative examples, P the number of positive examples still not compatible+ with the current DNF, n the number of negative examples compatible− with h, and p the number of positive examples compatible+ with h. We also introduce the function quality(h) such that quality(h) = p if h is compatible − with all the negative examples, else quality(h)= 0. Finally our evaluation function is evaluation(h) = max(quality(h), accuracy(h)). 3.4
Multi-tables Representation
The key idea of a multi-table representation is to divide the ambiguous examples in parts called tables so that the compatibility with hypothesis can be checked table by table. A table is associated to a set of connected atoms, that is atoms that depend on each others. More precisely two atoms a and b are directly connected when either a = b or a and b both appear in some clause of the background knowledge B. a and b are simply connected when (a, b) belongs to the transitive closure of the relation directly connected. Let us get back to the example of bird given in the introduction. From the background knowledge B = {red ∨ green, ¬(red ∧ green), migratory ∨ not migratory, ¬(migratory ∧ not migratory ), light ∨ not light , ¬(light ∧ not light )}, we can exhibit 3 sets of connected atoms: P1 = {red, green}, P2 = {migratory,not migratory} and P3 = {light, not light }. We use this partition to divide the previous ambiguous example e in 3 tables whose the cross product represents the four possibilities of e: e1 {red }
e2 e3 {migratory } {light } {not migratory } {not ligth}
We note that each table ei is a set of possibilities described with atoms of Pi .
470
D. Bouthinon, H. Soldano, and V. Ventos
Algorithm 2. bestRulea input E + , E − , W /∗ Width of the beam. ∗/ output best /∗ A conjunction compatible with some examples of E + and with all examples of E − . ∗/ begin seed ← any example of E + ; variabilize(seed) ; best ← ∅ ; /∗ Empty conjunction that is compatible+ with all examples and compatible− with no example. ∗/ N ← |E − | ; P ← |E + | ; quality(best) ← 0 ; P accuracy(best) ← N+P ; evaluation(best) ← accuracy(best) ; C ← {best} ; while evaluation(best) < P and C = ∅ do S ← maximallyGeneralSpecializations(C, seed) ; for each conjunction h in S do p ← number of examples of E + compatible+ with h ; n ← number of examples of E − compatible− with h ; if n < N then quality(h) = 0 ; else quality(h) = p ; endif ; n+p accuracy(h) = N+P ; evaluation(h) ← max(quality(h), accuracy(h)) end for; C ← the (at most) W conjunctions of S having the best evaluations ; if a conjunction h among C has a better evaluation than best then evaluation(best) ← evaluation(h) ; best ← h ; endif ; C ← C\{h | quality(h) > 0 } ; end while; return best ; end.
Consider now the hypothesis h ={migratory, not light}, it can be divided in 3 parts with respect to P1 , P2 and P3 : h1 = {}, h2 = {migratory} and h3 = {not light }. To check that h is compatible + with e, we check that each hi is compatible + with the corresponding ei : here h1 covers {red} in e1 , h2 covers {migratory} in e2 and h3 covers {not light} in e3 . As a consequence h covers the possibility {red, migratory, not light } and so is compatible + with e. To check whether h is compatible − with e, now considered as a negative example, we check that at least one hi does not cover the corresponding ei : here h2 does not cover {not migratory} in e2 . As a consequence h does not cover the possibilities {red, not migratory, light} and {red, not migratory, not light }, then h is compatible − with e. We propose now a formal view of this notion of multi-table representation. We will note S = S1 + · · · + Sm a partition of a set S and S = S1 ⊕ . . . ⊕ Sm a weak partition of S: Si s are subsets of S such that Sj ∩ Sk = ∅ (j
= k) and S = S1 ∪ . . . ∪ Sm but here some Si may be empty. Note that a partition is a specific weak partition. Let us give two definitions: Definition 2 (projection). Let S be a set of clauses or a set of literals using atoms of P . Let Pk be a subset of P , then Pk (S) is the maximal subset of S that uses only atoms of Pk .
Concept Learning from (Very) Ambiguous Examples
471
Definition 3 (valid partition). Let B be a set of clauses built from atoms of P . Then P = P1 + . . . + Pm is a valid partition of P with respect to B if and only if B = P1 (B) ⊕ . . . ⊕ Pm (B). As an illustration let P = {a, b, c, d, e, f } and B = {a ← b, b ← c, d ← e}. Then P = P1 + P2 + P3 = {a, b, c} + {d, e} + {f } is a valid partition of P w.r.t. B because B = P1 (B) ⊕ P2 (B) ⊕ P3 (B) = {a ← b, b ← c} ⊕ {d ← e} ⊕ ∅. We observe that P = P1 + P2 = {a, c} + {b, d, e, f } is not a valid partition of P with respect to B because B ⊃ P1 (B) ⊕ P2 (B) = ∅ ⊕ {d ← e}. Let us note M(B)P the models of the clausal theory B expressed with atoms of P . Let us note I1 × . . . × In the cross product between sets of interpretations. For example {{a}, {b}} × {{c}, {d}} = {{a, c}, {a, d}, {b, c}, {b, d}}. Then the following property shows in what circumstances we can split a clausal theory B in tables, so that B is the cross-product of these tables: Proposition 2. Let B be a clausal theory built from P and let P1 +. . .+Pm be a valid partition of P w.r.t. B, then M(B)P = M(P1 (B))P1 × . . . × M(Pm (B))Pm . A direct consequence is that each ambiguous example can be expressed as a cross-product of sets of interpretations. Consider an ambiguous example e, and let B(e) be a clausal theory having e as set of models. Let P1 + . . . + Pm be a valid partition of P w.r.t. B(e), then according to proposition 2 we have e = M(P1 (B(e)))P1 × . . . × M(Pm (B(e)))Pm . From now on M(Pk (B(e)))Pk will be simply noted as Tk (e) and called the k th table of e, and e = T1 (e) × . . . × Tm (e) is called the m-table ambiguous example e. Example 1. Consider P = {a, b, c} and let e = {{a, b}, {a, c}, {a, d}}. Then B(e) is the clausal theory {a, b∨c∨d}. Let P1 +P2 = {a}+{b, c, d} be a valid partition of P w.r.t. B(e) because B(e) = P1 (B(e)) ⊕ P2 (B(e)) = {a} ⊕ {b ∨ c ∨ d}. As a consequence we have e = M(P1 (B(e)))P1 × M(P2 (B(e)))P2 = T1 (e) × T2 (e) = {{a}} × {{b}, {c}, {d}}. Let us define that P1 + . . . + Pm is a valid partition of P w.r.t. e if and only if it is a valid partition w.r.t. B(e). Then there is a m-table representation w.r.t. E if and only if there exists a valid partition P = P1 + . . . + Pm w.r.t. each example of E. In rough words this means that each ambiguous example e of E can be expressed as the cross product T1 (e) × . . . × Tm (e). 3.5
Properties of Multi-tables Representation
When there is no obvious m-table representation (with m > 1) with respect to the set of ambiguous examples E, one can nevertheless compute a multitable representation by considering specific partitions of P . This suppose to compute B(e) for each e and either use a different multi-table representation for each e, or compute a most specific partition P which is valid for all the elements of E. A thorough discussion of this issue is out of the scope of the paper. However we briefly discuss here the intentional case in which each ambiguous example e is represented as a set of facts (ground atoms assigned to either true
472
D. Bouthinon, H. Soldano, and V. Ventos
or false) represented as a clausal theory F (e) together with a general background knowledge theory B expressing what we know about the universe of instances (e.g. that a bird cannot be both red and green). We consider here a partition of P which is valid with respect to B, then: Proposition 3. Let P1 +. . .+Pm be a valid partition of P with respect to B, and let F (e) be a clausal theory representing a set of ground atoms, then P1 +. . .+Pm is a valid partition with respect to B ∪ F (e). Consider now a partition P = P1 + . . . + Pm and a conjunctive hypothesis h expressed from predicates of P . Then P1 (h) ⊕ . . . ⊕ Pm (h) is a weak partition of h because h is a conjunction of atoms. P1 (h) ⊕ . . . ⊕ Pm (h) is called the m-table representation of h, or simpler a m-table conjunctive hypothesis. As an illustration let P = P1 + P2 + P3 = {a} + {b} + {c}, and let h = {a, b} (representing the conjunction a ∧ b). Then the 3-table representation of h is P1 (h) ⊕ P2 (h) ⊕ P3 (h) = {a} ⊕ {b} ⊕ ∅. Consider the following property : Proposition 4. Let T1 (e)× . . . ×Tm (e) be a m-table ambiguous example and let P1 (h) ⊕ · · · ⊕ Pm (h) be a m-table conjunctive hypothesis. Then: 1. h compatible+ e if and only if each table Tk (e) contains a model of Pk (h) (w.r.t. Pk ). 2. h compatible−e if and only if a table Tk (e) contains an interpretation that is not a model of Pk (h) (w.r.t. Pk ). Proposition 4 allows us to check the compatibility between conjunctive hypothesis and ambiguous examples table by table. Now let us call min(I) (respectively max(I)) the set of smaller (respectively greater) interpretations among the set of interpretations I, then: Proposition 5. Let T1 (e) × . . . × Tm (e) be a m-table ambiguous example. Then: – min(e) = min(T1 (e)) × . . . × min(Tm (e)), – max(e) = max(T1 (e)) × . . . × max(Tm (e)). When there is a m-table representation, according to proposition 5 if e is positive we will only keep the m-table example max(T1 (e))× . . . × max(Tm (e)), if e is negative we will keep min(T1 (e))× . . . × min(Tm (e)). 3.6
LEa Using Multi-table Representations
When we have a m-table representation P = P1 + . . . + Pm (m > 1) with respect to the set of examples E, LEa has to be modified in the following way: – Each ambiguous example e is represented by a set of tables {T1 (e),. . . ,Tm (e)} such that e = T1 (e) × . . . × Tm (e) where each Tk (e) is either a set of minimal interpretations if e is negative or of maximal interpretations is e is positive – Each conjunctive hypothesis h is represented by a set of tables {P1 (h), . . . , Pm (h)} with h = P1 (h) ⊕ . . . ⊕Pm (h).
Concept Learning from (Very) Ambiguous Examples
473
– Checking h compatible−e is achieved by checking that at least a table Tk (e) contains an interpretation that is not a model of Pk (h) (proposition 4.2). – Checking h compatible+ e is achieved by checking that each table Tk (e) contains a model of Pk (h) (proposition 4. 1). LEa is implemented in Swi-Prolog [9] and available on request to the first author.
4
Convergence
Hereunder we assume that the learning set is obtained by first drawing independent and identically distributed (i.i.d) positive and negative examples from a universe of instances built on {0, 1}n. The universe of instances here is the set of valid instances with respect to a possibly unknown background theory B. A hiding process, that hides the example within an ambiguous example, is applied to each drawn example. In the particular case of missing values, this hiding process corresponds to a blocking process as defined in [10]: the boolean value of each atom of the example can be turned into the value ’ ?’ with a probability p . We suppose now that each k-length part of a valid instance x has a non zero probability to be known as True in an ambiguous e with the same label as x: Proposition 6. If each k-uple (a1 = v1 . . . , an = vn ), part of some valid instance x, has a non zero probability to be known in an ambiguous example with the same label as x, then when learning a k-term-k-DNF in a i.i.d way, the Version Space converges to a set of hypothesis all equivalent on the universe of instances, for a finite number of ambiguous examples. Now recall that LEa translates any DNF problem as a DNF+ problem by adding negated atoms. In LEa , all the possibilities of each ambiguous example are investigated and a hypothesis is stated as a solution by LEa if and only if it belongs to the version space. However the beam search in a bestRulea step is of course not exhaustive. Whenever the seed is not ambiguous, the hypothesis space is built on a subset of the atoms of the seed, and thus the seed2 belongs to this space and does not cover any negative example. However in the case of an ambiguous seed s = {s1 , . . . , sn }, the whole hypothesis space H is the union of several hypothesis space Hi , each built on subsets of a possible complete seed si . The search in bestRulea can then reach a state where no hypothesis in the beam covers the correct si hidden in s. In that case bestRulea can end with no solution. In this case we check whether there exists a possibility in the seed that, as a hypothesis, covers no negative examples. If such a possibility exists, it is returned as a conjunctive term to add to h, otherwise the whole problem has no solution. Given this, the following proposition holds: 2
Or more precisely the most specific term which the seed is a model of. Unexpectedly sometimes LEaNC is better than LEa, and sometimes LEa is better, but in much cases there is no significant differences between them.
474
D. Bouthinon, H. Soldano, and V. Ventos
Proposition 7. Let c be a concept that can be represented as a DNF, then LEa always outputs a hypothesis h that belongs to the VS delimited by a set of ambiguous examples of c and so converges, when conditions of proposition 6 are satisfied, to an exact solution for a finite number of ambiguous examples.
5
Experimentation
Our experiments concern attribute-value learning. For each atom ai , an atom not -ai is added to the hypothesis language whenever learning unrestricted DNF. The background knowledge then always contains at least all the clauses of the form (ai ∨ not-ai ) and ¬(ai ∧ not-a i ). In our experiments, we have compared LEa , with a beam of size 3, to C4.5 and Naive Bayes, as implemented in Weka [11] and denoted as J48 and NBayes. J48 is used in its unpruned setting and with its default parameters. All our problems, but the last one, are artificial: there always exists a coherent and simple solution. When splitting a node, J48 propagates a fraction of the example on each son of the node, according to estimated probabilities. In various experiments, this has been shown to be a very efficient, and still simple, way of dealing with missing values [12]. NBayes represents a simple, robust, and still often accurate probabilistic learner. In all the experiments each learning instance is made incomplete by replacing the truth value of a boolean variable by an unknown tag ”?” with a probability p. For each value of p, 100 trials are performed, and average accuracy and standard deviation are computed. Each trial is performed with a random sample of Ne examples as a learning set. The test set is the same for all the trials and contains only complete examples. We have experimented LEa on a simple boolean problem, further denoted as M. We learn (a1 ∧ a2 ∧ a3) ∨ (a2 ∧ a4 ∧ a5) ∨ (a5 ∧ a6 ∧ a7) ∨ (a7 ∧ a8 ∧ a9) as an unrestricted DNF. The variable a0 is irrelevant here. An example is described by 20 atoms and negated atoms, and the instance space contains 210 = 1024 instances, ˜ 40% of which are positive. LEa generates for each example its multitable representation, thus resulting in 10 tables of two lines, each corresponding to a pair {aj, not aj}. We first consider Ne = 630 and p ranging from 0 to 0.6 and remark that NBayes is not sensitive to the missing values, whereas J48 and LEa have accuracies decreasing from 100% to the accuracy of NBayes. LEa first clearly outperforms J48, with a maximum gain of 9%, and then crashes at the level of NBayes at p = 0.6. We then experiment Ne = 3000 with p ranging from 0.6 to 0.9 and remark that LEa again outperforms J48 and then sharply decreases, and is outperformed by NBayes when p = 0.9. Here the bias of LEa and J48 outperforms NBayes when there is enough information provided by the incomplete examples:
Prog. LEa (630) J48 NBayes
p=0 100 99.16 79.70
p=0.1 99.99 97.40 79.62
p=0.2 99.99 94.85 79.49
p=0.3 99.86 92.38 79.46
p=0.4 98.89(2.57) 89.63(2.82) 79.35(1.10)
p=0.5 92.13(8.13) 85.38(3.39) 79.17(1.39)
p=0.6 78.14(8.21) 79.67(4.39) 79.00(1.35)
Concept Learning from (Very) Ambiguous Examples Prog. LEa (3000) J48 NBayes
p=0.6 98.77(2.63) 81.71(2.06) 79.81(0.79)
p=0.7 87.16(8.97) 71.83(1.90) 79.82(0.57)
p=0.8 70.26(5.65) 62.61(1.17) 79.72(0.75)
475
p=0.9 66.36(4.60) 59.98(0.0) 79.03(1.14)
Now we add constraints to the M problem, turning it to the MC problem. We consider that all the instances are models of B = {a0 ← a1, a2 ← a3, a4 ← a5, a6 ← a7, a8 ← a9}. LEa will only consider as possibilities for each ambiguous example e those that are models of B. The multi-table representation exhibits here only 5 tables of the form {ai, not ai, ai + 1, not ai + 1} because now a0 is related to a1, a2 is related to a3 and so on. The results are as follows: Prog. LEa ( 630) J48 NBayes
p=0 100 100 84.56
Prog. LEa (3000) J48 NBayes
p=0.1 100 99.56 84.51
p=0.2 99.98 99.07 84.42
p=0.6 99.34(1.37) 93.94(1.63) 86.29(0.75)
p=0.3 99.85 98.42 84.46
p=0.4 99.77(0.83) 97.36(1.91) 84.47(0.99)
p=0.7 97.54 (2.54) 80.53(2.17) 84.33(0.62)
p=0.5 98.59(2.27) 94.67(2.72) 84.36(0.94)
p=0.8 90.86 (5.72) 70.35(1.60) 84.25(0.87)
p=0.6 94.83(4.88) 88.57(4.06) 84.09(1.23)
p=0.9 82.40 (6.73) 69.82(0.0) 85.54(1.14)
The first comment here is that it’s much easier to learn our DNF when the universe of instances is reduced through constraints. LEa, J48 and Bayes perform better in learning MC than in learning M. For instance, learning MC with 630 examples with p = 0.6 results in accuracies from ≈ 95% to ≈ 84% when learning M results in accuracies ≈ 79%. The second comment is that LEa , again, seems much more resistant to ambiguity, and its accuracy decreases slower than those of J48 or other programs. For instance when Ne = 3000, p = 0.9 the accuracy of LEa is close to ≈ 80% when that of J48 is about 70%. Howewer at such a high level of uncertainty NBayes is still better than LEa . In the next experiment, we investigate accuracies with p = 0.9 and increasing values of Ne ranging from 6000 to 24000 examples. The result clearly is that LEa then benefits from this additional information and outperforms NBayes: Prog. LEa (p=0.9) J48 NBayes
5.1
nb=6000 85.28(5.50) 67.48(0.00) 84.80(1.09)
nb=12000 86.28(6.34) 67.70(0.13) 84.22(0.78)
nb=24000 89.26(5.97) 66.41(0.00) 85.84(0.61)
Problem Breast-w5
In this last experiment we address a problem of the UCI database (Breast cancer Wisconsin) whose accuracy, as reported in [13] ranges from 91 to 97%. There are 9 numeric variables but we only consider the 5 first variables. We use a boolean description of each numeric value by defining atoms as x ≤ x1 , x >
476
D. Bouthinon, H. Soldano, and V. Ventos
x1 , x ≤ x2 , x > x2 , . . . x ≤ xn , x > xn and adding to the background knowledge all the clauses of the form ¬(x ≤ xi ∧ x > xi ), x ≤ xi ← x ≤ xi+1 , and x > xi ← x > xi−1 . Here the thresholds are computed on all the data but ignoring the label of the instances, and using equal frequency intervals with a maximum of 9 thresholds per numeric variable. The test set contains the last 249 instances whereas the learning set is drawn within the 400 remaining complete examples to which we apply our blocking process with various values of p. Note that here, after the blocking process is applied, the numeric value of a variable x in an instance may still be constrained to an interval, possibly larger than its initial interval ]xi , xi+1 ]. So, in some sense we address also the problem of imprecise values. In our experiment hereunder we consider 100 learning examples and p ranges from 0.5 to 0.95 : Prog. LEa J48 NBayes
p=0.5 94.56(3.2) 96.26(2.3) 98.26(0.2)
p=0.6 94.76(3.0) 95.60(3.0) 98.26(0.2)
p=0.7 95.01(3.1) 95.82(2.6) 98.28(0.2)
p=0.8 94.32(3.6) 94.07(5.4) 98.32(0.2)
p=0.9 92.25(7.3) 89.75(8.0) 98.40(0.2)
p=0.95 90.67(7.9) 78.40(7.2) 98.46(0.26)
Even with a very weak information (few examples with many missing values) the various programs perform well. NBayes has a high accuracy, LEa and J48 build very simple solutions but are outperformed by NBayes. J48 in this task first outperforms LEa but begins to decrease for lower values of p. LEa is better when p is greater than 0.9. Clearly problems with nominal, hierarchic and numeric attributes should be further investigated, but at least on this example, using LEa results in interesting accuracies for high levels of incompleteness. 5.2
CPU-Time
LEa is a beam-searching algorithm driven by the accuracy on the learning examples and in all experiments we used 3 as the beam size. Concerning the benefits of the multi-table implementation there are clear as we hardly find any increase of CPU-time as the uncertainty probability p grows. For instance in the MC problem with 3000 examples and p ranging from 0.6 to 0.9 the CPU-time on a intel Dual core were about 1 hour per 100 trials for all value of p.
6
Related Work
In the Multiple instance learning setting originally proposed by Dietterich[14] each example e of the target concept is a set {inst 1 ,. . . inst n } of descriptions called instances. A positive example e + works as an ambiguous example : at least one instance (possibly several ones) has to satisfy the target concept3 . A negative example e − works differently : it is required that none of its instances satisfy the target concept. The same setting occurs with multiple part problems, 3
More precisely a boolean function i is associated with each example e: if e is positive ∃inst ∈ e such that f(inst) = true, and if e is negative ∀inst ∈ e, f(inst) = false.
Concept Learning from (Very) Ambiguous Examples
477
as defined in [15], and in various attempts to propositionalize first order learning problems in order to use variants of efficient propositional or attribute-value learners [16], [17]. A slight modification of LEa allows to address Multiple-Instance problems : a hypothesis h is here compatible− with a negative example e whenever h is not compatible+ with e. We are currently experimenting LEa as a multiple-instance learner. Uncertainty in propositional or attribute-value representations is addressed with basically two approaches: either predicting the complete description or taking into account the missing values when scoring the hypotheses. The former approach includes single or multiple imputation methods [18] and methods that learn from the examples to predict the missing values [19]. In the later approach the scoring function to optimize when searching a preferred solution is weighted according to an estimation of the probability distribution of the possible values for uncertain attributes at each node of a decision tree as in C4.5 [20]. Regarding first order representations, uncertainty has been addressed in works on abduction and induction [21,4].
7
Perspectives and Conclusion
We have discussed in this paper learning from ambiguous examples from a pure logical point of view and shown that the method were efficient, thanks to the multi-table representation and far more robust to very high level of uncertainty than popular approaches in Machine Learning, as long as enough examples, even extremely incomplete, are provided. However the experiments here are only preliminary, further experiments have to be performed on various attribute-values and first order problems. Future research directions includes experiments on more realistic uncertainty models than the independent blocking process experimented here and the research of ways to make the approach robust to various data incompleteness scenarii.
References 1. DeRaedt, L.: Logical settings for concept-learning. Artif. Intell. 95(1), 187–201 (1997) 2. Mitchell, T.M.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982) 3. Hirsh, H.: Generalizing version spaces. Mach. Learn. 17(1), 5–46 (1994) 4. Kakas, A.C., Riguzzi, F.: Abductive concept learning. New Generation Computing 18(3), 243–294 (2000) ´ Macro-operators revisited in inductive logic programming. In: Ca5. Alphonse, E.: macho, R., King, R., Srinivasan, A. (eds.) ILP 2004. LNCS, vol. 3194, pp. 8–25. Springer, Heidelberg (2004) 6. Khardon, R.: Learning horn expressions with logan-h. In: ICML 2000: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 471–478. Morgan Kaufmann, San Francisco (2000)
478
D. Bouthinon, H. Soldano, and V. Ventos
7. VanLaer, W., DeRaedt, L., Dzeroski, S.: On multi-class problems and discretization in inductive logic programming. In: Ra´s, Z.W., Skowron, A. (eds.) ISMIS 1997. LNCS, vol. 1325, pp. 277–286. Springer, Heidelberg (1997) 8. Muggleton, S.: Inverse entailment and Progol. New Generation Computing 13(3-4), 245–286 (1995) 9. Wielemaker, J.: An overview of the SWI-Prolog programming environment. In: Mesnard, F., Serebenik, A. (eds.) Proceedings of the 13th International Workshop on Logic Programming Environments, Heverlee, Belgium, Katholieke Universiteit Leuven, December 2003, pp. 1–16, CW 371. Katholieke Universiteit Leuven (2003) 10. Schuurmans, D., Greiner, R.: Learning to classify incomplete examples. In: Computational Learning Theory and Natural Learning Systems: Addressing Real World Tasks, pp. 87–105. MIT Press, Cambridge (1997) 11. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (1999) 12. Saar-Tsechansky, M., Provost, F.: Handling missing values when applying classification models. Journal of machine learning research 8, 1623–1657 (2007) 13. Lim, T.S., Loh, W.Y., Shih, Y.S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning 40(3), 203–228 (2000) 14. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1-2), 31–71 (1997) 15. Zucker, J.D., Ganascia, J.G.: Learning structurally indeterminate clauses. In: Page, D.L. (ed.) ILP 1998. LNCS, vol. 1446, pp. 235–244. Springer, Heidelberg (1998) ´ Rouveirol, C.: Lazy propositionalization for relational learning. In: 16. Alphonse, E., Horn, W. (ed.) Proc. of ECAI 2000, pp. 256–260. IOS Press, Amsterdam (2000) 17. Sebag, M., Rouveirol, C.: Resource-bounded relational reasoning: Induction and deduction through stochastic matching. Machine Learning Journal 38, 43–65 (2000) 18. Dick, U., Haider, P., Scheffer, T.: Learning from incomplete data with infinite imputations. In: ICML 2008, pp. 232–239. ACM, New York (2008) 19. Liu, W.Z., White, A.P., Thompson, S.G., Bramer, M.A.: Techniques for dealing with missing values in classification. In: Liu, X., Cohen, P.R., Berthold, M.R. (eds.) IDA 1997. LNCS, vol. 1280, pp. 527–536. Springer, Heidelberg (1997) 20. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 21. Dimopoulos, Y., Kakas, A.: Abduction and inductive learning. In: De Raedt, L. (ed.) Advances in Inductive Logic Programming, pp. 144–171. IOS Press, Amsterdam (1996)
Finding Top-N Pseudo Formal Concepts with Core Intents Yoshiaki Okubo and Makoto Haraguchi Division of Computer Science Graduate School of Information Science and Technology Hokkaido University N-14 W-9, Sapporo 060-0814, Japan {yoshiaki, mh}@ist.hokudai.ac.jp
Abstract. We discuss in this paper a method for finding Top-N Pseudo Formal Concepts. A pseudo formal concept (pseudo FC in short) can be viewed as a natural approximation of formal concepts. It covers several formal concepts as its majorities and can work as a representative of them. In a word, such a pseudo FC is defined as a triple (X, Y, S), where X is a closed set of objects, Y a set of primary features, S a set of secondary features. Then, the concept tells us that 1) all of the objects in X are associated with the primary features Y and 2) for each secondary feature y ∈ S, a majority of X is also associated with y. Therefore, X can be characterized not only exactly by Y but also naturally and flexibly by Y ∪ {y} for each secondary feature y. Our task is formalized as a problem of finding Top-N δ-Valid (τ, ρ)-Pseudo Formal Concepts. The targets can be extracted based on clique search. We show several pruning and elimination rules are available in our search. A depth-first branch-and-bound algorithm with the rules is designed. Our experimental result shows that a pseudo FC with a natural conceptual meaning can be efficiently extracted.
1
Introduction
Formal Concept Analysis [11] is well known as a useful framework for conceptual data analysis. Given an object-feature incident matrix regarded as a formal context, a formal concept (FC in short) is defined as a pair of closed sets of objects X and features Y , where the former is called the extent and the latter the intent of the concept. The concept means that each objects in X shares all of the features in Y and such an object never exists any more. That is, the set of objects X can be uniquely characterized by Y . Thus, a set of objects being the extent of an F C provides a conceptual cluster of objects which is definable by the set of features shared with the objects. As has been pointed out, a meaningful cluster should have a clear explanation of why the objects in the cluster are grouped together or what the conceptual meaning of the cluster is. From this point of view, therefore, we have investigated a method for extracting meaningful clusters based on FCs [14,15,18]. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 479–493, 2009. c Springer-Verlag Berlin Heidelberg 2009
480
Y. Okubo and M. Haraguchi
In general, we can extract a huge number of FCs. One of the reasons is that the definition of FCs is too strict. For a formal concept (X, Y ), although X can be exactly characterized by Y , it should be more flexible, as we are actually doing. For example, if a majority of X share a feature y such that y ∈ Y , then it would be still natural and reasonable to characterize X by Y ∪ {y}. In other words, X would be regarded as an approximation of the majority (exactly) characterized by Y ∪{y}. More concretely speaking, let us consider a concept “Bird”. According to our ordinary sense, the concept would include individuals of pigeon, eagle, penguin, parrots, etc. We consider here a feature “fly”. Since penguins do not fly, the feature “fly” cannot be shared by all of the individuals. However, since almost of them have the feature, “fly” would be naturally regarded as a typical feature of “Bird”. On the other hand, a feature “speak” is associated with only parrots which are just a part of “Bird”. Therefore, it would not be reasonable to consider that “speak” is a feature of “Bird”. From this observation, we try to naturally combine several FCs into a representative FC which can cover them as its majorities and extract only the representatives. Then, each representative can work as an approximation of several formal concepts covered by the representative. Such a representative is called a Pseudo Formal Concept. (pseudo FC). It is emphasized here that we especially respect the features in Y to make the concept meaningful. The authors argue that if a concept is actually meaningful and significant for us, it should be explained in terms of some primary features each of which is associated with all objects belonging to the concept. More precisely speaking, as the basis of our representative, we designate a formal concept (X, Y ) which can never be covered by anothers as a majority. Then, we distinguish two types of features, primary features and secondary features. Primary Feature: A feature which is shared with all of the objects in X is called a primary feature of X. That is, Y is the set of primary features of X. Secondary Feature: A feature which is associated with a majority of X is called a secondary feature of X. Roughly speaking, if an object has all of the primary features, it also has a secondary feature with high probability. That is, we can intuitively say that a feature is secondary if its conditional probability given the primary features is sufficiently high. Then, our pseudo formal concept is defined as a triple (X, Y, secondary(X)), where secondary(X) is the set of secondary features of X. The concept tells us that – Each object in X is associated with at least the primary features Y . – For each secondary feature y ∈ secondary(X), a majority of X is also associated with y. In this sense, y can be approximately regarded as a typical feature of the concept. Thus, a pseudo FC represents a more natural and flexible concept than strict FCs.
Finding Top-N Pseudo Formal Concepts with Core Intents
481
Although our pseudo FCs can be obtained from just a part of FCs, it might be expected that even the number of pseudo FCs is still large to actually examine and analyze. Therefore, finding only nice pseudo FCs with certain quality is a practical approach. Following the previous framework for FCs [14,15,18], we define a problem of extracting Top-N Pseudo Formal Concepts. Briefly speaking, we try to extract Top-N pseudo FCs in the sense that their primary features retain a certain degree of quality (as a constraint on intents) and their extents are evaluated as in the top N (as preference on extents). We present a depth-first branch-andbound algorithm for finding Top-N pseudo FCs. It is an extended version of the previous algorithm for Top-N FCs and is also based on a maximum clique algorithm [8,9]. We present several pruning and elimination rules available in our algorithm. Based on the rules, we can prune useless candidate FCs which can never provide our targets and can avoid redundant computation of candidate FCs. Our experimental result shows that we can efficiently extract pseudo FCs which have natural and reasonable conceptual meanings.
2
Preliminaries
We introduce in this section some terminologies used throughout this paper. Let O be a set of objects (or individuals) and F a set of features (or attributes). For a binary relation R ⊆ O × F, A triple < O, F , R > is called a formal context. If (o, f ) ∈ R, we say that the object o has the feature f . Then, for an object o ∈ O, the set of features associated with o is denoted by FR (o), that is, FR (o) = {f ∈ F | (o, f ) ∈ R}. Given a formal context < O, F , R >, for a set of objects X ⊆ O and a set of features Y ⊆ F, we define two mappings ϕ : 2O → 2F and ψ : 2F → 2O as follows: ϕ(X) = {f ∈ F | ∀o ∈ X, f ∈ FR (o)} = FR (o) and o∈X
ψ(Y ) = {o ∈ O | Y ⊆ FR (o)}. The former computes the set of features shared by every object in X. The latter, on the other hand, returns the set of objects with Y . Based on these mappings, for a set of objects X ⊆ O and a set of features Y ⊆ F, a pair of X and Y , (X, Y ), is called a formal concept (FC) under the formal context if and only if ϕ(X) = Y and ψ(Y ) = X, where X and Y are called the extent and the intent of the concept, respectively. From the definition, it is easy to see that ψ(ϕ(X)) = X and ϕ(ψ(Y )) = Y . That is, a formal concept is defined as a pair of closed sets of objects and features under the mappings. The set of all formal concepts under the context is denoted by F C. For a set of objects X and a set of features Y , assume X = ψ(ϕ(X)) and Y = ϕ(ψ(Y )). Then we always have X = ψ(ϕ(X )) and Y = ϕ(ψ(Y )). That is, the compound mappings, ψ ◦ ϕ and ϕ ◦ ψ, define closure operators. In what follows, both these compound mappings are simply denoted by closure.
482
Y. Okubo and M. Haraguchi
For a set of objects X, we can uniquely obtain a formal concept defined as (closure(X), ϕ(X)). Dually, (ψ(Y ), closure(Y )) is a formal concept uniquely defined for a set of features Y . Let (X, Y ) and (X , Y ) be formal concepts. If X ⊆ X (or Y ⊇ Y ), then we say (X, Y ) precedes (X , Y ) and denote it by (X, Y ) (X , Y ). Under the ordering, the set of formal concepts in a formal context forms a lattice, called a concept lattice. An undirected graph is denoted by G = (V, E), where V is a set of vertices, E ⊆ V × V a set of edges. In the graph G, the set of vertices adjacent to a vertex v is referred to as NG (v), that is, NG (v) = {u ∈ V | (v, u) ∈ E}.
3
Pseudo Formal Concepts
In this section, we introduce a notion of Pseudo Formal Concepts (pseudo FCs). Before giving its formal definition, we briefly discuss which kind of concepts we would really like to obtain. From the definition, a formal concept (X, Y ) tells us that every object in X shares the set of features Y and there exists no more object with Y . In other words, X is exactly characterized by the features in Y . Thus, the intent Y of the concept provides us its conceptual meaning. However, the definition seems too strict. For example, if a majority of X share a feature y such that y ∈ Y , then it would be still natural to characterize X by Y ∪ {y}, as we actually do in ordinary situations. In other words, X would be regarded as an approximation of the majority characterized by Y ∪ {y}. This motivates us to relax the definition of formal concepts so that we can obtain more natural and flexible concepts which fit our intuition. Roughly speaking, we try to combine several formal concepts into a representative formal concept which can cover them as its majorities. We call such a representative a Pseudo Formal Concept (pseudo FC). In order to formalize it, we first introduce a notion of τ -majorities of extents. It is based on the same idea for approximating itemsets in [13,17]. Definition 1. (τ -Majority of Extent) Let (X, Y ) and (X , Y ) be formal concepts such that (X , Y ) (X, Y ). Given a real-valued parameter τ (0 < τ ≤ 1.0), X is said to be a τ -majority of X if and only if |X | ≥ τ. |X| The parameter τ is called a majority threshold. We often simply say that (X, Y ) covers (X , Y ) if X is a τ -majority of X. As has been mentioned above, it is reasonable to consider that a representative concept approximates its majorities. Therefore, a formal concept which can never be a majority of any other concept is regarded as such a representative.
Finding Top-N Pseudo Formal Concepts with Core Intents
483
Definition 2. (τ -isolatedness of formal concept) Let (X, Y ) be a formal concept and τ a majority threshold. For any formal concept (X , Y ) ∈ F C such that (X , Y ) = (X, Y ), if (X, Y ) cannot be covered by (X , Y ), then (X, Y ) is said to be τ -isolated. Since a τ -isolated formal concept cannot become a τ -majority of any others, we designate such a τ -isolated concept as the basis of our representative. Then, it can work as an approximation of several formal concepts covered by the representative. In order to discuss our representative concepts more precisely, we divide features into two classes, primary features and secondary features, based on the majorityness. For a τ -isolated formal concept (X, Y ), since the features in Y are shared with all of the objects in X, they can be considered as primary features for X. On the other hand, a feature y associated with just a majority of X, it is regarded as a secondary feature for X. It is formally defined as follows. Definition 3. (τ -Secondary Features) Let (X, Y ) be a formal concept under a formal context < O, F , R > and τ a majority threshold. A feature y ∈ F \Y is said to be τ -secondary for X if and only if ψ(Y ∪ {y}) is a τ -majority of X, that is, |ψ(Y ∪ {y}| ≥ τ. |X| The set of τ -secondary features for X is referred to as secondaryτ (X). From the theoretical property of the mapping ψ, ψ(Y ∪ {y}) ⊂ X always holds. That is, a feature is considered to be secondary iff it is associated with at least τ · |X| objects in X. In a word, the intent of our representative concept consists of primary features and secondary features. Needless to say, a conceptual meaning of such a representative is mainly provided by the primary features, because they are shared with all of the objects belonging to the concept. Therefore, the primary features should be closely related each other in some sense. In this paper, we try to take an unity of a set of features into account. It is evaluated based on a correlation between features. Definition 4. (Correlation between Features) Let < O, F , R > be a formal context. For a pair of features f and f in F , a correlation between f and f , denoted by correl(f, f ), is defined as correl(f, f ) =
|ψ(f ) ∩ ψ(f )| . |ψ(f ) ∪ ψ(f )|
Definition 5. (Unity of Feature Set) For a set of features F ⊆ F, an unity of F , denoted by unity(F ), is defined as unity(F ) = min {correl(f, f )}. f,f ∈F
484
Y. Okubo and M. Haraguchi
From the definition, the features in F are correlated each other with at least the degree of unity(F ). We, therefore, provide a minimum unity threshold ρ (0 < ρ ≤ 1.0) so that we can ensure quality of feature sets, particularly primary feature sets We can now formally define our notion of Pseudo Formal Concepts which can work as representatives. Definition 6. ((τ, ρ)-Pseudo Formal Concepts) Let τ be a majority threshold and ρ an unity threshold. If a formal concept (X, Y ) is τ -isolated and unity(Y ) ≥ ρ, a triple (X, Y, secondaryτ (X)) is called a (τ, ρ)-Pseudo Formal Concept. X is called the extent and Y ∪ secondaryτ (X) the pseudo-intent of the concept, where Y is the primary feature set and secondaryτ (X) the secondary feature set. The primary feature set is also called the core of the pseudo-intent. From the above definition, for each feature y ∈ secondaryτ (X), we can obtain the formal concept (X = ψ(Y ∪ {y}), Y = closure(Y ∪ {y})) such that (X , Y )
(X, Y ) and X is a τ -majority of X. Thus, (X, Y, secondaryτ (X)) can work as an approximation to each of such concepts (X , Y ).
4
Top-N Pseudo Formal Concept Problem
In this section, we define our problem of finding pseudo formal concepts with the top N evaluation values under some constraint. It is well known that given a formal context, there exists a enormous number of formal concepts. Although our pseudo FCs can be obtained from just a part of them, it might be expected that even the number of pseudo FCs is still large to actually examine and analyze. Therefore, finding only nice pseudo FCs with certain quality is a practical approach. In general, a formal concept with too small intent would provide an obscure conceptual meaning. The intent of a meaningful concept must consist of a reasonable number of features. Following our previous work [14,15,18], it would be plausible to impose a constraint on the size of intent in order to extract meaningful pseudo FCs. More concretely speaking, the primary feature set (that is, the core intent) of a pseudo FC to be extracted is required to contain at least δ features. A pseudo FC satisfying the constraint is said to be δ-valid. Under this constraint on core intents, we try to find pseudo FCs whose extent sizes are in the Top-N . For a pseudo formal concept, if the extent is larger, then the core intent tends to be smaller. By finding pseudo FCs with Top-N extent sizes, therefore, we can avoid extracting concepts with too large core intents which cannot be interpreted easily. Our problem of finding Top-N pseudo FCs is formally defined as follows.
Finding Top-N Pseudo Formal Concepts with Core Intents
485
Definition 7. (Top-N δ-Valid (τ, ρ)-Pseudo Formal Concept Problem) Let < O, F , R > be a formal context, δ a validness threshold, τ a majority threshold, ρ an unity threshold and N an integer for Top-N . The problem of Top-N δ-valid (τ, ρ)-pseudo formal concepts is to extract the set of (τ, ρ)-pseudo formal concepts (X, Y, secondaryτ (X)) such that Constraint: (X, Y, secondaryτ (X)) is δ-valid, that is, |Y | ≥ δ, Preference: |X| is in the top N .
and
Remark: In the definition, although the extent and the (core) intent are evaluated by their sizes, it is not an absolute limitation. For example, for a more general evaluation, one might assign a positive real-valued weight to each object and feature. Then, a set of objects or features can be evaluated by the sum of their weights. It should be noted here that our evaluation functions are assumed to be increasing monotone under set-inclusion. Under the assumption, we can design an algorithm for finding Top-N pseudo formal concepts with some effective pruning rules, as will be discussed shortly.
5
Algorithm for Finding Top-N Pseudo Formal Concepts
We present here an algorithm for finding Top-N δ-valid (τ, ρ)-pseudo formal concepts based on a depth-first clique search for an undirected graph. Before going into details, we describe the basic strategy of our algorithm. 5.1
Basic Search Strategy
Let < O, F , R > be a formal context. For each formal concept (X, Y ) in the context, it is easy to verify that there exists a set of objects X ⊆ O such that ψ(ϕ(X )) = closure(X ) = X and ϕ(X ) = Y . That is, each formal concept can be obtained from a set of objects. Furthermore, each pseudo formal concept can be constructed from an isolated formal concept. We can, therefore, easily enumerate all pseudo FCs by applying the mappings ϕ and ψ to each subset of O. Let ≺ be a linear (total) order on O. For each subset X ⊆ O, we assume the objects in X are always ordered under ≺, where the first object is referred to as head(X) and the last one as tail(X). Based on the ordering, we can define a partial ordering on 2O , ≺2O . For X, X ⊆ O, X ≺2O X iff X is a prefix of X . It should be emphasized here that the partially ordered set (2O , ≺2O ) gives a tree with the root node of the empty set. Particularly, for each internal node X, its children are simply obtained as X ∪ {x}, where x ∈ O and tail(X) ≺ x. Thus, any subset of O can be generated systematically without any duplications. We, therefore, try to extract Top-N pseudo formal concepts by exploring the tree in depth-first manner. A list which stores Top-N pseudo FCs found so far is maintained during our search. That is, the list keeps tentative Top-N pseudo FCs. A set of objects, X ⊆ O, is checked its δ-validness, namely |ϕ(X)| ≥ δ holds or not. If it is δ-valid, then we examine whether (closure(X), ϕ(X)) is τ -isolated and
486
Y. Okubo and M. Haraguchi
whether unity(ϕ(X)) ≥ ρ holds. If both of them are true, we compute the set of secondary features secondaryτ (closure(X)), and construct a (τ, ρ)-pseudo FC (closure(X), ϕ(X), secondaryτ (closure(X))). Then, the tentative Top-N list is adequately updated for the pseudo FC. After that, a child of X is generated and the same procedure is recursively performed on the child. Starting with the initial X as the empty set φ, we iterate the procedure in depth-first manner until no X remains to be examined. Note here that for any X and X such that X ⊆ X , ϕ(X) ⊇ ϕ(X ) holds, that is, |ϕ(X)| ≥ |ϕ(X )|. Hence, a simple pruning rule based on δ-validness is available for the basic algorithm. Pruning 1 [15,18] : For a set of objects X ⊆ O, if |ϕ(X)| < δ, then there is no need to examine any descendant of X. 5.2
Finding Top-N Pseudo FCs Based on Clique Search
Although we can enjoy a pruning rule in the basic algorithm just discussed above, we would be required to further improve it for efficient computation. It can be improved with the help of clique search (e.g. [8,9]), as has been discussed in [15,18]. In addition, the unity constraint provides us an effective pruning rule based on a graph theoretical property. Constructing Undirected Graph for Objects. Given a formal context < O, F , R > and a validity threshold δ, an undirected graph GO = (O, E), is constructed, where the set of edges, E, is defined as E = {(x, x ) | x, x ∈ O such that x = x ∧ |FR (x) ∩ FR (x )| ≥ δ}. That is, if x and x share at least δ features, then they are connected by an edge. It should be noted here that for each δ-valid F C, its extent induces a clique in GO . This means that Top-N δ-valid pseudo FCs can be extracted by exploring only cliques in GO . Clique-Tree. Since each clique Q in the graph GO is a subset of O, (closure(Q), ϕ(Q)) becomes a formal concept. Particularly, from the graph construction process, it is easy to see that for each δ-valid FC (X, Y ), we always have a clique Q such that closure(Q) = X and ϕ(Q) = Y . This implies that subsets of O to be examined in the basic algorithm can be restricted to only cliques in GO . Note here that the ordering ≺2O is still valid for cliques because any subset of a clique is also a clique. Thus, our subset-tree to be explored can be restricted to a clique-tree. It should be emphasized that Pruning 1 is still available for the clique-tree. For a clique Q, its child is obtained by adding a certain object to Q. Such an element to be added is called an extensible candidate for Q and formally defined as follows.
Finding Top-N Pseudo Formal Concepts with Core Intents
487
Definition 8. (Extensible Candidates for Clique) Let G = (V, E) be a graph and Q a clique in G. A vertex v ∈ V adjacent to any vertex in Q is called an extensible candidate for Q. The set of extensible candidates is referred to as cand(Q), that is, cand(Q) = {v ∈ V | ∀u ∈ Q (v, u) ∈ E}. Since it is obvious from the definition that for any extensible candidate v ∈ cand(Q), Q ∪ {v} always forms a clique, we can easily generate children of Q by adding v ∈ cand(Q) such that tail(Q) ≺ v. Thus, we can also explore a clique-tree in depth-first manner. A simple theoretical property of cliques provides us a pruning mechanism adopted in many efficient algorithms for finding a maximum clique (e.g. [8,9]). In our case of finding Top-N pseudo FCs, it is stated as the following simple pruning rule. Pruning 2 : Assume tentative Top-N pseudo FCs are already stored in our Top-N list. Let min be the minimum size of extents stored in the list. For a clique Q in GO , if |Q ∪ cand(Q)| < min, then there is no need to examine any descendant of Q. Avoiding Generation of Redundant Cliques. Our clique-tree in general contains many redundant cliques from which we obtain identical formal concepts. Therefore, it is necessary for efficient computation to avoid such generation of redundant cliques. The following observations bring us some rules by which redundant cliques can be safely and completely excluded [18]. Observation 1 : For each formal concept (X, Y ), there always exists a clique Q in GO such that closure(Q) = X and head(Q) = head(X). Observation 2 : Let Q be a clique in GO . For any α ∈ closure(Q)\Q, closure (Q ∪ {α}) = closure(Q) and ϕ(Q ∪ {α}) = ϕ(Q). Observation 3 : For any element β ∈ closure(Q ∪ {α})\closure(Q) such that β ≺ α, closure(Q ∪ {α}) = closure(Q ∪ {β}) holds. Each of the observations provides us an elimination rule. Elimination 1 : For a clique Q, if head(Q) = head(closure(Q)) holds, then no descendant of Q needs to be examined. Elimination 2 : For any element α ∈ closure(Q)\Q such that tail(Q) ≺ α, Q ∪ {α} and its descendants do not have to be examined. Elimination 3 : Let Q be a clique in GO and Q a child of Q. If there exists an element α ∈ closure(Q )\closure(Q) such that α ≺ tail(Q ), then no descendant of Q has to be examined. Theorem 1. Elimination 1, 2 and 3 can safely and completely exclude redundant cliques which give identical FCs.
488
Y. Okubo and M. Haraguchi
< O, F, R > : a formal context δ : a validness threshold τ : a majority threshold ρ : an unity threshold N : an integer for Top-N [Output] PFC : the set of Top-N δ-valid (τ, ρ)-pseudo formal concepts procedure main() : PFC ← φ ; Construct GO based on δ ; Construct GF based on ρ ; for each x ∈ O in predefined order do begin TopNPseudoFCFind({x}, NGO (x), ϕ({x}), φ, PFC, 0) ; end return PFC ; procedure TopNPseudoFCFind(Q, C, I, P , PFC, min) : if head(Q) = head(closure(Q)) or ∃x ∈ closure(Q)\P such that x ≺ tail(Q) then return ; else if there exists no formal concept (X, Y ) such that closure(Q) is a τ -majority of X then TopNListUpdate(PFC, (closure(Q), I, secondaryτ (closure(Q)), min) ; endif endif for each x ∈ C\closure(Q) such that tail(Q) ≺ x in predefined order do begin if fGF (I ∩ FR (x)) < δ or |Q ∪ {x} ∪ (cand(Q) ∩ NGO (x))| < min then continue ; else TopNPseudoFCFind(Q ∪ {x}, cand(Q) ∩ NGO (x), I ∩ FR (x), closure(Q), PFC, min) ; endif end procedure TopNListUpdate(PFC, P F C, min) : PFC ← PFC ∪ {P F C} ; if PFC tentatively contains N -th ones then min ← N -th evaluation value of extents in PFC; Remove M -th ones from PFC such that N < M ; else min ← 0 ; endif [Input]
Fig. 1. Algorithm for Finding Top-N δ-Valid (τ, ρ)-Pseudo Formal Concepts
Finding Top-N Pseudo Formal Concepts with Core Intents
489
Pruning Invalid FCs Based on Unity Constraint. Our Top-N pseudo FCs should be valid and satisfy the unity constraint. That is, for any formal concept (X, Y ), if |Y | ≥ δ or unity(Y ) ≥ ρ does not hold, (X, Y ) can never be the basis of our targets. We try to identify such useless FCs with the help of an undirected graph for features. Given a formal context < O, F , R > and an unity threshold ρ, we construct an undirected graph GF = (F , E), where the set of edges, E, is defined as E = {(y, y ) | y, y ∈ F such that y = y ∧ correl(y, y ) ≥ ρ}. That is, if y and y are correlated with at least ρ, they are connected by an edge. If a set of features, Y ⊆ F, satisfies the unity constraint, Y has to induce a clique in GF . Furthermore, the size of the clique must be greater than or equal to δ for the validity. This observation brings us a pruning rule. Suppose we have an estimation function fGF which computes an upper bound of the maximum clique size in the subgraph of GF induced by a set of features Y . Furthermore, the function fGF is assumed to be monotone under set-inclusion, that is, for any feature sets Y and Y such that Y ⊆ Y , fGF (Y ) ≤ fGF (Y ) holds. For a clique Q in GO , if fGF (ϕ(Q)) < δ, the formal concept (closure(Q), ϕ(Q)) can never provide a δ-valid pseudo FC. From the monotonicity of fGF , moreover, we have fGF (ϕ(Q )) < δ for any clique Q such that Q ⊇ Q, that is, (closure(Q ), ϕ(Q )) is also needless. Pruning 3 : Let Q be a clique in GO . If fGF (ϕ(Q)) < δ, then there is no need to examine any descendant of Q. As such a monotone function fGF , for example, an approximate coloring of vertices can give us tight upper bounds [8]. Several upper bounds have been also discussed in [9]. Algorithm. With the help of the above pruning and elimination rules, our algorithm for finding Top-N pseudo FCs is summarized in Figure 1.
6
Discussion
As is well known, closed itemsets [4] in a transaction database is an equivalent notion of formal concepts. Since Top-N pseudo FCs can be obtained from FCs, any frequent closed itemset miner, such as LCM [12], can be used for our Top-N δ-valid pseudo FC problem. That is, by giving an adequate minimum support threshold minsup, we first enumerate all frequent closed itemsets and then choose the FCs which can provide Top-N pseudo FCs. It is, however, quite difficult to guess such a minsup under which we can thoroughly enumerate all seeds (FCs) of Top-N pseudo FCs. Moreover, in case δ is relatively high, we necessarily have to set minsup lower. This causes a growth in the number of frequent closed itemsets. Such a naive method, therefore, would not always be helpful for our problem.
490
Y. Okubo and M. Haraguchi
Given a pair of parameters, minsup and minlen, D-Miner [2] can enumerate all closed itemsets J such that sup(J) ≥ minsup and |J| ≥ minlen. It seems to be more adequate for our problem than LCM because minlen is equivalent to our parameter δ. It is, however, still required to provide an adequate minsup so that all seeds of Top-N pseudo FCs can be extracted. The authors consider that providing N is more intuitive than providing minsup. Several approximation methods for formal concepts have been investigated. For example, in [3], the notion of αβ-concepts has been proposed. In a word, an αβ-concept can be viewed as a formal concept (a bi-set of objects and features) with a bounded number of exceptions. Along this line, an approximate formal concept has also been formalized as a δ-bi-set [5] based on the notion of δ-free itemsets which are (approximate) condensed representations of frequent itemsets [6]. However, since they do not discriminate primary features from secondary ones, it seems to be difficult to capture clear conceptual meanings of those concepts. Furthermore, the degree of approximation is defined by some absolute number of exceptions. This implies that the degree of acceptable approximation is quite different in individual concepts. That is, under a bounded number, we may accept at most the 10% objects as exceptions in some concept, but in another concept, the 50% exceptions might be allowed. Thus, the quality of those approximate concepts would be unstable. Another approximation method based on the idea of Rough Set Theory [1] has been proposed in [7]. For a given set of objects X, they have defined operations which can produce an upper-approximation and a lower-approximation of X. In the definitions, however, they never take any degree of approximations into consideration. These approximations are uniquely produced from several formal concepts in the concept lattice we are concerned with. Thus, their methods are less flexible in the sense that we cannot control the degree of approximations. In [10], a set of patterns (frequent itemsets) can be approximated by K colossal patterns. These colossal patterns are efficiently extracted by fusing (merging) a set of core patterns. It is called the Pattern Fusion Method. In a word, for a closed pattern X (that is, the intent of a concept C), a core pattern of X provides a concept which covers C in our sense. In the pattern fusion framework, core patterns are fused step by step in order to quickly generate longer (colossal) patterns. Although the core patters and our notion of τ -majority are defined based on the same idea, their roles are quite different in each framework. The former is used for efficient computation of colossal patterns, and the latter for providing natural and clear meaning of concepts. However, the idea of pattern fusion might be useful for further improving the efficiency of our algorithm. It would be interesting future work.
7
Experimental Result
In this section, we present our experimental result. We have tried to extract Top-N pseudo FCs from a document dataset. It is a collection of Japanese newspaper articles in 1995. Each article belongs to one of
Finding Top-N Pseudo Formal Concepts with Core Intents
491
the four categories, “Politics”, “International”, “Economy” and “Sports”. The total number of articles is 35928. As a preprocess, we have first applied Morphological Analysis, and then extracted nouns as feature terms, where too frequent and too infrequent nouns have been removed. The number of feature terms is 2337. Our system has been implemented in C and run on a PC with Dual-Core AMD Opteron processor 2222 SE and 32GB main memory. 7.1
Examples of Extracted Pseudo Formal Concepts
Our Top-N pseudo FC problems have been tried to solve under several parameter settings. We briefly show here two examples of pseudo FCs actually extracted. For a Top-10 50-valid (0.77, 0.003)-pseudo FC problem, we have obtained a pseudo FC whose extents consists of 9 articles. Each of the articles is a monthly report of the latest economic indicators. Then, the set of primary features consists of 50 terms concerned with economy including “Stock Average”, “Unemployment Rate”, “Current Account”, “Money Supply”, “Government Bonds”, “Tokyo Market”, etc. Furthermore, we have 5 economic terms as the secondary features, for example, “Exchange” and “Performance”. It should be noted here that if we are interested in only strict FCs, these terms can never be associated with the extent. However, since they are essential in economic indicators, it would be reasonable to consider that the terms must be features of the concept. Thus, according to our framework of pseudo FCs, we can obtain such a natural and flexible concept. The reason why “Exchange” and “Performance” could not become primary features is as follows. The term “Exchange” does not appear in one of the 9 monthly reports in the extent. In another 8 articles, we can find it as “Exchange - Tokyo Market 99.01 JPY”. However, (probably) due to space limitation, “Exchange” has been omitted in 1 article. We see there just “Tokyo Market 97.65 - 70 JPY”. Although “Exchange” seems to be an adequate feature, it will be excluded in the strict FC with the extent of the 9 articles. The term “Performance” appears as “Trade Performance” in the reports except only one. In this article, we can find “Trade Statistics” instead. Note here that both “Performance” and “Statistics” will be unfortunately missed in the strict FC with the extent. Since both of them are important terms in the reports, preserving “Performance” will be valuable. As another example, we have also obtained a concept for Top-10 15-valid (0.77, 0.001)-pseudo FC problem. It consists of the extent with 6 articles, the core intent with 15 terms. The articles are concerned with the failure of housing loan companies which is well known as ”Jusen Problem” in Japan. The core intent includes ”Jusen”, ”Financing”, ”Credit”, etc. Furthermore, the terms ”Liquidation” and ”Loss” appear in our pseudo FC as secondary features. Note here again that we can never obtain these two terms in a strict FC with the extent because each of them does not appear explicitly in some article of the extent. However, since ”Liquidation” and ”Loss” seems to be significant for such a concept, our pseudo FC would be preferable to strict FCs.
492
7.2
Y. Okubo and M. Haraguchi
Computational Performance
As has been mentioned in the previous section, our Top-N pseudo FCs can be extracted with any closed itemset miner in naive way, that is, by first enumerating frequent closed itemsets (intents) and then choosing the basis of our targets from them. Assume we try to extract the pseudo FC just shown above by a frequent closed itemset miner. Since the size of the extent is 9, we have to set minsup = 9 in order to obtain our target. Under the setting of minsup = 9, a famous efficient algorithm LCM [12] can enumerate all of the (strict) FCs whose extent sizes are at least 9 with 13.33 seconds. However, since such an adequate minsup cannot be determined in advance, we have to provide minsup = 1 so that our target can be surely extracted. In case of minsup = 1, LCM takes 313.95 seconds to complete its computation. Furthermore, the number of enumerated FCs is over 45 million. It is easily expected that a post process for choosing the basis of the target would be a time consuming task. On the other hand, our computation time for the target is 120.05 seconds for the graph constructions and just 0.23 seconds for the pseudo FC search. If we do not frequently change the parameters δ and ρ, the time for the graph constructions could be canceled. Thus, our algorithm is also useful from the computational point of view.
8
Concluding Remarks
In this paper, we discussed the problem of finding Top-N δ-Valid (τ, ρ)-Pseudo Formal Concepts. A pseudo FC can be viewed as a natural approximation of some formal concepts and work as a representative of several formal concepts which are covered as its majorities. We designed a depth-first branch-and-bound algorithm based on clique search for finding Top-N pseudo FCs. As important future work, we might need to relax the strict definition of primary feature sets, that is, core intents. In order to obtain more flexible concepts, it would be better to allow some exceptions to be included. The notion of δ-bi-sets [5] will be promising for the purpose.
References 1. Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishing, Dordrecht (1991) 2. Besson, J., Robardet, C., Boulicaut, J.: Constraint-Based Concept Mining and Its Application to Microarray Data Analysis. Intelligent Data Analysis 9(1), 59–82 (2005) 3. Besson, J., Robardet, C., Boulicaut, J.: Mining Formal Concepts with a Bounded Number of Exceptions from Transactional Data. In: Goethals, B., Siebes, A. (eds.) KDID 2004. LNCS, vol. 3377, pp. 33–45. Springer, Heidelberg (2005) 4. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient Mining of Association Rules Using Closed Itemset Lattices. Information Systems 24(1), 25–46 (1999)
Finding Top-N Pseudo Formal Concepts with Core Intents
493
5. Pensa, R., Boulicaut, J.: Towards Fault-Tolerant Formal Concept Analysis. In: Bandini, S., Manzoni, S. (eds.) AI*IA 2005. LNCS (LNAI), vol. 3673, pp. 212–223. Springer, Heidelberg (2005) 6. Boulicaut, J., Bykowski, A., Rigotti, C.: Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries. Data Mining and Knowledge Discovery 7, 5–22 (2003) 7. Yao, Y., Chen, Y.: Rough Set Approximations in Formal Concept Analysis. In: Proc. of 2004 Annual Meeting of the North American Fuzzy Information Processing Society - NAFIPS 2004, pp. 73–78 (2004) 8. Tomita, E., Kameda, T.: An Efficient Branch-and-Bound Algorithm for Finding a Maximum Clique with Computational Experiments. Journal of Global Optimization 37, 95–111 (2007) 9. Fahle, T.: Simple and Fast: Improving a Branch-and-Bound Algorithm for Maximum Clique. In: M¨ ohring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 485–498. Springer, Heidelberg (2002) 10. Zhu, F., Yan, X., Han, J., Yu, P.S., Cheng, H.: Mining Colossal Frequent Patterns by Core Pattern Fusion. In: Proc. of the 23rd IEEE Int’l Conf. on Data Engineering - ICDE 2007, pp. 706–715 (2007) 11. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer, Heidelberg (1999) 12. Uno, T., Kiyomi, M., Arimura, H.: LCM ver. 2: Efficient Mining Algorithm for Frequent/Closed/Maximal Itemsets. In: IEEE ICDM 2004 Workshop FIMI 2004 (2004), http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS// Vol-126/ 13. Cheng, J., Ke, Y., Ng, W.: δ-Tolerance Closed Frequent Itemsets. In: Proc. of the 6th IEEE Int’l Conf. on Data Mining - ICDM 2006, pp. 139–148 (2006) 14. Li, A., Haraguchi, M., Okubo, Y.: Implicit Groups of Web Pages as Constrained Top-N Concepts. In: Proc. of the 2008 IEEE/WIC/ACM Int’l Conf. on Web Intelligence and Intelligent Agent Technology Workshops, pp. 190–194 (2008) 15. Haraguchi, M., Okubo, Y.: An Extended Branch and Bound Search Algorithm for Finding Top-N Formal Concepts of Documents. In: Washio, T., Satoh, K., Takeda, H., Inokuchi, A. (eds.) JSAI 2006. LNCS, vol. 4384, pp. 276–288. Springer, Heidelberg (2007) 16. Haraguchi, M., Okubo, Y.: A Method for Pinpoint Clustering of Web Pages with Pseudo-Clique Search. In: Jantke, K.P., Lunzer, A., Spyratos, N., Tanaka, Y. (eds.) Federation over the Web. LNCS (LNAI), vol. 3847, pp. 59–78. Springer, Heidelberg (2006) 17. Kanda, K., Haraguchi, M., Okubo, Y.: Constructing Approximate Informative Basis of Association Rules. In: Jantke, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 141–154. Springer, Heidelberg (2001) 18. Okubo, Y., Haraguchi, M.: Finding Conceptual Document Clusters with Improved Top-N Formal Concept Search. In: Proc. of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence - WI 2006, pp. 347–351 (2006)
On Fixed Convex Combinations of No-Regret Learners Jan-P. Calliess Machine Learning Dept., Carnegie Mellon University, Pittsburgh, USA
Abstract. No-regret algorithms for online convex optimization are potent online learning tools and have been demonstrated to be successful in a wide-ranging number of applications. Considering affine and external regret, we investigate what happens when a set of no-regret learners (voters) merge their respective decisions in each learning iteration to a single, common one in form of a convex combination. We show that an agent (or algorithm) that executes this merged decision in each iteration of the online learning process and each time feeds back a copy of its own reward function to the voters, incurs sublinear regret itself. As a by-product, we obtain a simple method that allows us to construct new no-regret algorithms out of known ones.
1
Introduction
Regret minimizing algorithms are known since Hannan [12] presented the first one for repeated two-player games over 60 years ago. Regret is a measure of the quality of a sequence of decisions that may be taken in the course of an online learning situation such as a repeated game or an online convex problem (OCP) [19]. For instance, in a repeated game, a decision could constitute a pure or a mixed strategy. Regret measures the difference of cumulative rewards between an actual sequence of decisions taken and the best possible sequence one could have chosen from a predefined class. There are different types of regret that have been subject to investigations in the past years and vary with the definition of the before-mentioned class of reference action sequences (c.f. [7, 11]). As the perhaps most prominent example, the external regret of a sequence of decisions is defined as the cumulated reward incurred by their execution subtracted from the cumulative reward that would have been incurred had one chosen one single, optimal static solution instead, with the benefit of hindsight. A no-regret algorithm is a procedure that learns (online) to generate a sequence of decisions incurring regret that grows sublinearly with sequence length (i.e. with increasing learning experience). No-regret algorithms have been proven to be powerful online learning tools that can distributively learn equilibrium points in multiagent game playing, planning scenarios and auctions (e.g. [3, 6, 7, 8, 11, 13]). P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 494–504, 2009. c Springer-Verlag Berlin Heidelberg 2009
On Fixed Convex Combinations of No-Regret Learners
495
There are several recent works featuring the development of concrete no-regret algorithms such as Greedy Projection [19], Lagrangian Hedging [9] or Follow the Perturbed Leader [14]. Despite more recent advances (e.g. [2]) towards an understanding of general construction methods of no-regret algorithms for online convex problems, current knowledge is still limited. This is especially true when it comes to the nature of the underlying no-regret algorithm spaces. This work aims at helping to close this gap. We show the following no-externalregret preservation property: If we fix Φ to be the set of constant mappings on the input space we show that a fixed convex combination of the output of an ensemble of no-Φ-regret learners results in a no-Φ-regret exhibiting learner again (provided each member of the ensemble is fed appropriate inputs). For restrictions to affine objective functions, an analogous statement is then derived for cases where Φ is allowed to contain affine mappings (which also covers internal and linear regret). If we construe algorithms as points in a suitable space this insight spawns the intuition that the sets of such no-Φ-regret learning algorithms suitable for the same type of problems are each convex. Consequently, our findings will allow the construction of new no-regret algorithms as a combination of known ones. Although the general idea of considering weighted sums of different learning entities is far from new the scope of the common multiplicative weights-based ensemble learning methods (e.g. [8, 15, 17]) is significantly different from ours. The latter strand of works is chiefly concerned with the problem of how to adapt the weights in order to combine different votes. For instance, Freund and Shapire provided a no-regret algorithm that, as a variation of Weighted Majority [15], adaptively learned weights of a weighted sum of pure strategies which corresponded to the voters (experts) [8]. In contrast, we consider settings where the adaptive behavior occurs only in the combined learning algorithms (solving online convex problems) while the weights are fixed. We do not focus on finding a clever procedure to combine an arbitrary set of votes or class of learning algorithms but provide guarantees for a specific class (i.e. no-regret learners for OCPs) given constant weights.
2
Preliminaries
There are different frameworks for online learning. One of the more general and recently identified ones is online convex optimization whose name was coined by Zinkevich [19]. He also contributed a gradient-ascent based no-regret algorithm solving a general online convex program (OCP) that is similar to another one introduced in [10]. Learning algorithms solving online convex problems are tools applicable in problem domains not amenable to other machine learning methods and have become subject to increasingly active research over the past years. Before proceeding, we will briefly review OCPs and the notion of no-regret assuming an underlying maximization problem. The corresponding statements for minimizations are analogous.
496
2.1
J.-P. Calliess
Online Convex Problems
A convex programming problem can be stated as follows1 : Given a convex feasible set F ⊆ Rd and a convex mapping Γ : F → R find the optimal solution given by the optimization problem inf x∈F Γ (x). If objective function Γ determines a cost, the optimization task translates to finding a cost-optimal feasible decision. Acknowledging that γ := −Γ is concave, we can restate the problem as a maximization problem of a concave function γ over a convex set. That is, the problem becomes to solve supx∈F γ(x). In this context, γ is interpreted as a reward or revenue function. Since both problems are completely analogous, we will limit our descriptions to the case where our problem is stated in terms of reward maximization. Notice, this choice also affects the definitions of regret given below but the emerging results are equivalent. In an online convex program [10, 19], a (possibly adversarial) sequence (γ(t) )t∈N of concave reward functions is revealed step by step. (Equivalently, one could substitute convex cost functions.) At each time step t, the convex programming algorithm must choose x(t) ∈ F while only knowing the past reward functions γ(τ ) and choices x(τ ) (τ ∈ {1, ..., t − 1}). After the choice is made, the current reward function γ(t) is revealed, and the algorithm receives a revenue amounting to γ(t) (x(t) ). Note, there is a close connection between learning in an online convex problem and learning to play in repeated games. For instance, consider an individual agent playing a repeated matrix-game. In each round it picks a mixed strategy as a distribution over actions and receives a reward according to its decision in return. Then the process starts over. We can model this setting as an OCP: if the local convex set F is a polytope and we interpret its corners as pure strategies then we can construe the choice x(t) of an interior feasible point as a mixed strategy. We then let γ(t) be the resulting payoff function of the game such that γ(t) (x(t) ) reflects the current expected payoff of the player in round t. 2.2
No-Regret
To measure the performance of an OCP algorithm, we can compare its accumulated cost until step T to an estimate of the best cost attainable against the sequence (γ(t) )t=1...T . The notion best can be situation dependent. It could be expressed in rules such as whenever action a ∈ F was chosen one should have chosen φ(a) ∈ F instead where φ : F → F originates from a predefined class Φ of mappings on feasible set F . This idea leads to a measure called Φ-regret T T RΦ (T ) := supφ∈Φ t=1 γ(t) (φ(x(t) )) − t=1 γ(t) (x(t) ) [11, 18]. An algorithm is no-Φ-regret with regret bound Δ iff ∃f ∈ o(T )∀T ∈ N : RΦ (T ) ≤ Δ(T ) ≤ f (T ). The choice of the transformation class Φ leads to different types of no-regret algorithms. For instance, if Φ is chosen to be the set of all endomorphisms on F we obtain the class of the so-called no-linear-regret algorithms [11]. 1
For detailed background regarding convex optimization cf. e.g. [4].
On Fixed Convex Combinations of No-Regret Learners
497
Perhaps the most prominent case arises if Φ is restricted to all constant transformations on F . Then, the best attainable reward corresponds to the reward gained by the best constant choice s(T ) ∈ F , chosen with knowledge of T γ(1) . . . γ(T ) , i.e. s(T ) ∈ arg supx∈F t=1 γ(t) (x). This choice leads to a measure called external regret R(T ) := Tt=1 γ(t) (s(T ) ) − Tt=1 γ(t) (x(t) ). Consequently, a no-external-regret algorithm for a maximizing OCP is defined as an algorithm that generates a sequence of feasible vectors x(1) , x(2) , x(3) , . . . such that ∃Δ∀T ∈ N : Δ(T ) +
T t=1
γ(t) (x(t) ) ≥ sup
T
x∈F t=1
γ(t) (x),
(1)
where Δ needs to grow sublinearly, i.e. ∃f ∈ o(T )∀T : Δ(T ) ≤ f (T ). If Φ is composed of all affine functions we could speak of no-affine-regret properties. Obviously, the set of all no-affine-regret algorithms comprises both the set of no-external-regret and the set of no-linear-regret algorithms and may therefore be an important class to consider. In order to ensure that a no-regret algorithm can even exist in principle it is common to introduce further restrictions to the OCP such as requiring a compact feasible set and continuous reward functions. Doing so implies that supx∈F γ(x) exists and equals maxx∈F γ(x). We will assume this condition to hold throughout the most part of this paper.
3
Convex Combinations of No-Regret Learners
Consider a society of q ∈ N agents A1 , ..., Aq . Each agent Av is capable of no-external-regret learning in an online convex problem and shares the same feasible set F with its peers Aj (j = v). That is: If in every time step t, each Av chooses a vector av(t) ∈ F and then observes a reward function Ωv(t) which is both additional learning experience and used to calculate the magnitude Ωv(t) (av(t) ) of Av ’s reward for round t, then we can guarantee that its external regret Rv (T ) is always sublinear, i.e. ∃f ∈ o(T ) : Rv (T ) = T T maxx∈F t=1 Ωv(t) (x) − t=1 Ωv(t) (av(t) ) ≤ f (T ). The interpretation of the generated vectors av(t) is application dependent. They may constitute mixed strategies in a repeated game (e.g. [8]) or even represent plans. For instance, av(t) could be a routing plan in a network with each vector component entry representing the magnitude of traffic the agent intents to send through a corresponding link (e.g. [1, 6]). Alternatively, it may be a price for an item the agent sells or, it could conceivably be a representation of a tactical decision in a game of robotic soccer [16]. Regardless of the concrete interpretation, we will refer to av(t) as a vote and to agent Av as the corresponding voter. Let A be a proxy agent faced with the problem of having to solve an online convex problem: In each time step t it has to choose an action a(t) ∈ F and
498
J.-P. Calliess
receives a concave reward function ΩA(t) in return. If A is able to consult the voters, i.e. to feed them learning experience in form of reward functions and to receive their votes in return, is it in the position to benefit from the voters’ no-regret learning capabilities? One trivial way to accomplish this is for A to choose one single Av and let her solve his own OCP: In time step t, A executes vote av(t) he was recommended by selected voter Av and after perceiving reward function ΩA(t) this is sent back as further learning experience to Av (i.e. she perceives ΩAv (t) = ΩA(t) as her reward feedback) so she can generate a new recommendation av(t+1) in the next time step,... and so on. In the robotic soccer example, this could translate to a coach who selects a single agent (e.g. player) and leaves the tactical decision making to her from then on. However, this approach may be less than optimal. Assume, the decision of which voter to select was made according to some distribution. Let pv denote the probability that qA chooses voter Av . Then A’s expected reward in the first time step equals v=1 pv ΩA(1) (av(1) ). As an alternative option, A could have q consulted all voters and executed a compromise v=1 pv av(1) of their votes. Due to qconcavity Jensen’s inequality q(e.g. [4]) yields: v=1 pv ΩA(1) (av(1) ) ≤ ΩA(1) ( v=1 pv av(1) ). Thus, q consulting all voters and executing the convex combination of their votes v=1 pv av(1) would have gained him a higher reward2 than the expected reward A received in the first round otherwise. Of course, depending on the nature of the OCP, future reward functions may depend on past choices of feasible vectors. Therefore, without further assumptions it would become more involved to generally assess whether relying on a convex combination of the individual votes would necessarily be a superior approach in the long run. However, in many conceivable applications of online learning, the learner’s choices in past iterations will not influence the outcome of future rewards. For instance, a small investor in the stock market may choose to employ no-regret algorithms to adaptively make investment decisions for him - but his individual actions’ influence on stock market quotations will be negligible. 3.1
Setup and Theorems
q Let z1 , ..., zq be nonnegative constants such that v=1 zv = 1. If each voter Av submits vote av(t) then we will refer to a(t) := z1 a1(t) + ... + zq aq(t) as their compromise decision. How well would A perform in an OCP if it would always execute such a compromise decision after providing each voter with learning experience depending on its own reward function ΩA(t) in each round t? Of course this does not only depend on the individual learning algorithms each of the voters employs but also on the learning experience they are exposed to in the course of the online process. 2
Or, at least not a lower reward.
On Fixed Convex Combinations of No-Regret Learners
499
Fig. 1. Illustration of a voting situation with proxy agent A and two voters A1 and A2
We consider the following setup: In every round t, A calls each voter Av and provides him a a copy of her own reward function, i.e. she sets ΩAv (t) = ΩA(t) .3 Upon receiving the outputs av(t) (t = 1, ..., q) of the voters A executes the compromise decision of these votes. (The setup is depicted in Fig. 1 for q = 2, z1 = z2 = 12 .) External Regret. The next theorem tells us that as a result of this setup, A will incur sublinear external regret if A1 , ..., An do. Theorem 1. Let F be a convex set, A be a proxy agent always executing a = z1 a1 + ... + zq aq where av ∈ F denotes the decision q of agent Av (v = 1, ..., q) and z1 , ..., zq are nonnegative weights such that v=1 zv = 1. Let ΩA(t) (·) be A’s concave - and for all v ∈ {1, ..., q} let ΩAv (t) (·) be Av ’s individual revenue function for iteration t where ∀t ∈ N∀v ∈ {1, ..., q} : ΩA(t) (·) = ΩAv (t) (·). Let A1 , ..., Aq each solve a maximizing online convex problem employing no-externalregret algorithms with regret bounds Δ1 , ..., Δq , respectively. Furthermore, let each element in the sequence of A’s revenue functions (ΩA(1) , ΩA(2) , ...) be concave and A solve a maximizing online convex problem observing the elements of this sequence one by one. Then we have: A is guaranteed to incur sublinear external regret, i.e. it effectively employs no-external-regret learning. A regret bound is qv=1 zv Δv . 3
Of course, it is also possible to divide the payoffs according to the weights, i.e. ΩAv (t) = zv ΩA(t) , leading to slightly worse regret bounds [5].
500
J.-P. Calliess
q Proof . Note, Δ1 , ..., Δq ≤ f (T ) ∈ o(T ) ⇒ ∃g ∈ o(T ) : Δ := v=1 zv Δv ≤ g(T ). Let (av(t) )t∈N denote Av ’s sequence of decisions generated by a no-regret algorithm. This implies (cf. Eq. 1) that ∀v, T : T T t=1 ΩAv (t) (av(t) ) ≥ maxav t=1 ΩAv (t) (av ) − Δv (T ). For all T ∈ N we have: T ΩA(t) (a(t) ) Tt=1 = t=1 ΩA(t) (z1 a1(t) + ... + zq aq(t) ) ≥ 4 Tt=1 qv=1 zv ΩA(t) (av(t) ) q T = v=1 zv t=1 ΩAv (t) (av(t) ) q T ≥ 5 v=1 zv ( maxav t=1 ΩAv (t) (av ) − Δv (T )) q T q = v=1 zv maxav t=1 ΩA(t) (av ) − v=1 zv Δv (T ) = ( qv=1 zv maxa Tt=1 ΩA(t) (a)) − Δ(T ) T = (maxa t=1 ΩA(t) (a)) − Δ(T ). q.e.d. We can easily derive the analogous statement for convex cost functions and minimizing OCPs but chose to omit such redundant considerations in order to keep the exposition concise. Affine and Linear Regret. We will now assume that each individual voter Av incurs sublinear affine regret, i.e it incurs sublinear Φ − regret where Φ is the class of affine mappings on the feasible set F . Furthermore, we restrict our considerations to the case where the aggregate reward function ΩA(t) is affine. An example for a situation where the latter assumption holds is the case of the adversarial revenue functions considered in [6]. Theorem 2. Let F be a convex set, A be a proxy agent always playing a = z1 a1 + ...+zq aq where av ∈ F denotes the decision of agent Av (v = 1, ..., q) and z1 , ..., zq are nonnegative weights such that qv=1 zv = 1. Let ΩA(t) (·) be A’s affine - and for all v ∈ {1, ..., q} let ΩAv (t) (·) be Av ’s individual revenue function for iteration t where ∀t ∈ N∀v ∈ {1, ..., q} : ΩA(t) (·) = ΩAv (t) (·). Let A1 , ..., Aq each solve a maximizing online convex problem employing no-regret algorithms with regret bounds Δ1 , ..., Δq , respectively. Furthermore, let each element in the sequence of A’s revenue functions (ΩA(1) , ΩA(2) , ...) be affine and A solve a maximizing online convex problem observing the elements of this sequence one by one. Then we have: A is guaranteed to experience sublinear affine regret. A regret bound is qv=1 zv Δv . Proof . Let Φ be the set of affine mappings q on feasible set F . Δ1 , ..., Δq ≤ f (T ) ∈ o(T ) ⇒ ∃g ∈ o(T ) : Δ := v=1 zv Δv ≤ g(T ). Let (av(t) )t∈N denote Av ’s sequence of decisions generated by a no-Φ-regret algorithm. 4 5
Owing to concavity. Due to individual no-regret learning.
On Fixed Convex Combinations of No-Regret Learners
501
T Therefore all voters incur sublinear affine regret, i.e. ∀v, T : t=1 ΩAv (t) (av(t) ) ≥ T supφ∈Φ t=1 ΩAv (t) (φ(av(t) )) − Δv (T ). For all T ∈ N we have: T ΩA(t) (a(t) ) Tt=1 = t=1 ΩA(t) (z1 a1(t) + ... + zq aq(t) ) T q ≥ 6 t=1 v=1 zv ΩA(t) (av(t) ) q T = v=1 zv t=1 ΩAv (t) (av(t) ) ≥ 7 qv=1 zv ( supφ∈Φ Tt=1 ΩAv (t) (φ(av(t) )) − Δv (T )) q T = −Δ(T ) + v=1 supφ∈Φ zv t=1 ΩA(t) (φ(av(t) )) q T ≥ −Δ(T ) + supφ∈Φ v=1 zv t=1 ΩA(t) (φ(av(t) )) = −Δ(T ) + supφ∈Φ Tt=1 qv=1 zv ΩA(t) (φ(av(t) )) q T = 8 − Δ(T ) + supφ∈Φ t=1 ΩA(t) (φ( v=1 zv av(t) )). q.e.d. Since any linear function is also affine we can conclude that a convex combination of no-linear-regret learners [11] results in a learner that exhibits nolinear-regret again as well (in settings with affine objective functions). Why is it worthwhile to consider affine regret properties? Of course, affinity is generally a handy property since for affine mappings, Jensen’s inequality is tight. In fact, this was explicitly leveraged in the last line of the proof of Theorem 2. On the other hand, no-affine-regret is still a quite general notion that, as mentioned above, comprises the important cases of both no-linear-regret and no-external-regret. Unfortunately, Theorem 1 could not be stated as a corollary building upon Theorem 2 since the latter requires each member of the sequence of objective functions to be affine9 , while the former merely assumes them to be concave. Notice, Theorem 2 can be slightly generalized: We can consider Φ-regret where Φ is an arbitrary subset of the set of all affine mappings. The proof goes through for such cases as well. Thus, our no-regret preservation property also holds for internal regret (if the revenue functions are affine).
3.2
Convexity of No-Regret Algorithm Spaces
Of course, the above result is not restricted to cases where A1 , . . . , Aq are agents: Let A1 , . . . , Aq be different algorithms on respective problem domains D1 , . . . , Dq . (Such domains could constitute specifications of feasible sets, objective functions and oracle assumptions that are inherent to the corresponding algorithm). Then following above procedure is a prescription of how to construct a new learning algorithm A for domain D1 ∩ . . . ∩ Dq as a convex combination of these previously known ones. If each Av exhibits no-external-regret Theorem 1 implies that the resulting algorithm A exhibits no-external-regret as well. In 6 7 8 9
Owing to concavity. Due to individual no-regret learning. Leveraging that φ, ΩA(t) were assumed to be affine. That is, both concave and convex.
502
J.-P. Calliess
case each Av solves an OCP with affine rewards and is guaranteed to incur sublinear affine regret, then by Theorem 2, combined algorithm A will constitute a no-affine-regret algorithm. Note, we can construe no-regret algorithms as points in a common vector space where the Abelian group operation (+) is constituted by pointwise addition of the algorithms’ outputs and the scalar operation (∗) is simply pointwise multiplication with elements of a field (typically R) that comprises the range of the reward functions. In this light, our results state that the set of all no-regret algorithms of the same type10 is convex.
4
Discussion and Future Work
This paper developed a general no-regret property regarding convex combinations of learners. For the class of no-external- and no-affine -regret learners, we established how a convex combination of a finite number of such learners (voters) can be employed to commonly solve an online convex problem in a manner that is guaranteed to incur sublinear regret, provided each of the voters does. As this paper reflects early work, many questions remain to be addressed. For example, the proofs of our theorems reveal that the convex combination of no-regret learners results in a new one whose regret bound (a guaranteed convergence rate) is not growing faster than the correspondingly weighted average of the regret bounds of the voters. It may be worthwhile to explore conditions under which this sum is an overly conservative bound. While it is possible to describe artificial setups and construct randomized sequences of strictly convex payoff functions where the combined learner can achieve higher total reward than the best of the component learners (which can be seen leveraging the strict version of Jensen’s inequality), at this point, we yet have to identify real-world problems where our convex combination approach is beneficial in practice. Also, for what kind of problems can we further speed up the actual convergence rates with suitable weight-adaptation techniques? Furthermore, the positive result in Theorem 2 had to be based on the assumption that the reward functions were affine. This begs the question, if there provably are classes of Φ-regret algorithms where the preservation property does not hold in general. As this work derived the insight that no-regret algorithms of the same class suitable for the same problems constitute a convex set, exploring additional of its properties may be an interesting direction of future efforts. For instance, does this set have border points? That is, are there no-regret algorithms that inherently cannot be (nontrivially) found by the construction method we presented? In conclusion, we believe the insights gained in this work may not only be of theoretical interest but also hope that they have the potential to serve as the outset for fruitful future research efforts. 10
No-external regret or no-affine-regret.
On Fixed Convex Combinations of No-Regret Learners
503
Acknowledgements The author would like to thank Geoff Gordon for helpful discussions and the Machine Learning Department at CMU for the kind support. Both was highly appreciated.
References 1. Blum, A., Even-Dar, E., Ligett, K.: Routing without regret: on convergence to nash equilibria of regret-minimizing algorithms in routing games. In: PODC 2006: Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing (2006) 2. Blum, A., Mansour, Y.: From external to internal regret. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS, vol. 3559, pp. 621–636. Springer, Heidelberg (2005) 3. Blum, A., Kumar, V., Rudra, A., Wu, F.: Online learning in online auctions. Theor. Comput. Sci. 324(2-3), 137–146 (2004) 4. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press, Cambridge (2004) 5. Calliess, J.-P.: On fixed convex combinations of no-regret learners, Tech. Report CMU-ML-08-112, Carnegie Mellon (2008) 6. Calliess, J.-P., Gordon, G.J.: No-regret learning and a mechanism for distributed multiagent planning. In: Proc. of 7th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2008) (2008) 7. Foster, D., Vohra, R.: Calibrated learning and correlated equilibrium. Games and Economic Behavior (1997) 8. Freund, Y., Shapire, R.E.: Game theory, on-line prediction and boosting. In: Sch¨ olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS, vol. 2777. Springer, Heidelberg (2003) 9. Gordon, G.: No-regret algorithms for online convex programs. In: Advances in Neural Information Processing Systems, vol. 19 (2007) 10. Gordon, G.J.: Approximate solutions to markov decision processes, Ph.D. thesis, Carnegie Mellon University (1999) 11. Gordon, G.J., Greenwald, A., Marks, C.: No-regret learning in convex games. In: 25th Int. Conf. on Machine Learning (ICML 2008) (2008) 12. Hannan, J.: Contributions to the theory of games. Princeton University Press, Princeton (1957) 13. Jafari, A., Greenwald, A.R., Gondek, D., Ercal, G.: On no-regret learning, fictitious play, and nash equilibrium. In: ICML 2001: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 226–233 (2001) 14. Kalai, A., Vempala, S.: Efficient algorithms for online decision problems. In: Sch¨ olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS, vol. 2777, pp. 26–40. Springer, Heidelberg (2003) 15. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. In: IEEE Symposium on Foundations of Computer Science, pp. 256–261 (1989) 16. Sahota, M.K., Mackworth, A.K., Barman, R.A., Kingdon, S.J.: Real-time control of soccer-playing robots using off-board vision: the dynamite testbed. In: IEEE International Conference on Systems, Man, and Cybernetics, pp. 3690–3663 (1995)
504
J.-P. Calliess
17. Shapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990); First boosting method 18. Stoltz, G., Lugosi, G.: Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior 59, 187–208 (2007) 19. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Twentieth International Conference on Machine Learning (2003)
An Improved Tabu Search (ITS) Algorithm Based on Open Cover Theory for Global Extremums Kemal Yüksek and Serhat Cakaloglu Istanbul Kultur University, Department of Computer Engineering Istanbul ,Turkiye [email protected]
Abstract. In this study, a new improved Tabu Search (ITS) algorithm with an open-cover approach is presented for solving global optimization of multimodal functions which have continuous or discrete variables. The method uses open sets covering the wide domain of possible solutions which are constructed by a specific metric. Instead of dealing with individual elements, these special open sets are considered. To demonstrate the speed and memory effectiveness of ITS applied to continuous global optimization are tested in detail by using classical multimodal functions for which minima are known. It has been point out that, ITS collects both the advantages of Tabu Search and Genetic algorithms together. So, the speed, flexibility, applicability have been improved. Keywords: Global exteremum, Genetic Algorithm, Tabu search, open sets.
1 Introduction The basic concept of Tabu Search (TS) as described by Glover is "a meta-heuristic superimposed on another heuristic. The overall approach is to avoid entrainment in cycles by forbidding or penalizing moves which take the solution, in the next iteration, to points in the solution space previously visited ( hence "tabu")[1],[2]. The Method in original form is able to make local search for various types of problems involved especially Artificial Intelligence and Operational Research. The main advantages of TS with respect to Genetic Algorithm and Simulated Annealing (SA), lies in the intelligent use of the past history of the search to influence its future[3]. The adaptive memory designs of tabu search have also provided useful alternatives and supplements to the types of memory embodied in neural networks, allowing enhancements of neural network processes in practical settings[4][5]. Current applications of TS span the realms of resource planning, telecommunications, VLSI design, financial analysis, scheduling, space planning, energy distribution, molecular engineering, logistics, pattern classification, flexible manufacturing, waste management, mineral exploration, biomedical analysis, environmental conservation and scores of others[6-10] . Dell’Amico, M. and M. Trubian has proposed a good example of a very effective implementation of tabu search in the context of production scheduling[11]. Battiti and Tecchiolli presented a description of a specialized tabu search implementation that seeks to dynamically control search parameters with feedback mechanisms[12]. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 505–515, 2009. © Springer-Verlag Berlin Heidelberg 2009
506
K. Yüksek and S. Cakaloglu
Recently, there have been a lot of studies to improve the performance of the TS algorithm. These studies are especially focusing on eliminating the basic drawbacks. Some of them get help from well-known algorithm such as insertion[13], some others modify the suggested structure to speed up or to make it efficient as far as storage is concerned. Some approaches modify the classic structure in order to match requirement of the given problem. In several cases, the TS method and its variants provide solutions very close to optimality and are among the most effective, if not the best, to tackle the difficult problems at hand. These successes have made TS extremely popular among those interested in finding good solutions to the large combinatorial problems encountered in many practical settings such as group shop scheduling problem, warehouse location problem, named p-median problems [14-17]. Many problems in real world are dynamic and with specific constraints. Therefore optimization methods are required to have a high degree of adaptability to variable environments and a quick search function under a limited search time. This study presents a new Improvement of Tabu Search (ITS) algorithm which, has a new neighborhood structure, is proposed for solving many kinds of problems. Proposed paper has been organized as four parts. In the first part includes defining new terms and concepts which will be used inside paper. Second part is devoted to develop the method and producing the pseudo-code of the proposed algorithm, Within the third part, developed algorithm is applied some kind of problem such as singlevariable continuous function followed by double variable continuous function. The last section is used to measure the success of the method by comparing it with other methods. To prevent deadlock on previously visited solutions, the TS algorithm set as taboo every steps that has led to an improvement during a certain amount of iterations, depending on the pre-defined condition such as size of the tabu list or repeating number of the same process. The efficiency of a TS algorithm depends on the size of neighborhood of a solution, and on the number of iterations for which a step is kept as taboo.
2 Improved Tabu Search (ITS) Algorithm Many practical real-world problems can be interpreted as combinatorial optimization problems (COP). The contribution of the ITS to the solution of COP will be based on the terminology and theorems coming from the basic set theory. In other words, instead of dealing with the points of the domain of the problem, pre-defined small sets covering the whole domain are going to be concerned. The sets in question may be dense or discrete. At this point, there will be two questions that must be answered. What will be the type of the set? and is it possible to cover the domain of the problem with these sets? The followings will help to find out of the answers of these questions. Definition: Let S be a subset of a metric space. Then the set S is open if every point in S has a neighborhood lying in the set. An open set of radius r and center x0 is the set of all points x such that
x − x0 < r , and is denoted N r ( x0 ) . In one-
dimensional space, the open set is an open interval. In two-dimensional space, the open set is a disk In n-dimensional space the open set is a ball.
An Improved Tabu Search (ITS) Algorithm Based on Open Cover Theory
507
By considering the domain of the problem as a topological space provides working open sets than individual elements. Definition : A collection A of subsets of a space X is said to cover X, or to be a covering of X, if the union of the elements of A is equal to X. It is called an open covering of X if the elements are open subsets of X [18]. Definition : A topological space is compact if every open cover of A has a finite subcover. In other words, if A is the union of a collection of open sets, there is a finite sub collection whose union is A. For the ITS algorithm, if there is an open cover on a set A, it is clear that the desired point will be the element of one of the open sets in this cover. The main concern will be to find out this open set within a reasonable time interval. It should be noticed that these open sets may overlapped. This provides that global extremum point inside an open set can be found by approaching it from different directions. ITS method is based on the use of prohibition based techniques as a complement to basic heuristic algorithms. So, In this method, prohibition is applied on not individual elements but open set or neighborhoods. In the proposed ITS method, the taboo list contains all centers of randomly selected open sets which are also known as population in terms of GA terminology. The centers and radius are used to define the taboo regions. These regions are constructed as a ball forms in multi-variable environment. New candidate centers are produced out of the collection of balls randomly. Once the center is determined, it is stored in the tabu list and population is created around it. If the best solution for the population is on the boundary of the ball new population is created based on this element next to the previous one and process continuous until it reaches to the situation that the best solution is not boundary of the ball but inside it. It means that this point is one of the extremum of the problem but not necessarily the global one. At this point, new center must be produced so that they are not in the sub regions defined centers stored in the taboo list. In the essence of the proposed ITS method, the solution space is examined according to Genetic Algorithms approach with a controlling mechanism. The ITS method is described as algorithmic description in Fig. 1. The following code will show the execution of the proposed algorithm in programming point of view. It uses C like structure. input populationSize; input iterationNumber; input shiftAmount; find_extremum (population){ minPoint = find_min_Neighborhood(population[0]); if (minPoint == population[0]){ add_Tabu_List(population[0]); return population[0]; } else{ do{
508
K. Yüksek and S. Cakaloglu
Fig. 1. The general structure of ITS Algorithm
population = populate_Population(minPoint); minPoint = minimum(population); add_Tabu_List(population[(populationSize+1)/2]); }while (minPoint == population[populationSize]); return minPoint; }
An Improved Tabu Search (ITS) Algorithm Based on Open Cover Theory
509
main(){ population[0] = random(); extremumBase = find_extremum(population) ; add_Tabu_List(extremum); add_Extremum_List(extremumBase); for (counter = 0; counter < iterationNumber; counter++){ for (shiftingCounter = 0; shiftingCounter < shiftAmount; shiftingCounter++){ population[0] = random(); if (!control_in_TabuList(population[0])){ extremumShifted = find_extremum(population) ; if (extremumShifted < extremumBase){ extremumBase = extremumShifted; add_Extremum_List(extremumBase); break; } } } if (shiftingCounter == shiftAmount) exit(); } } 2.1 Pseudocode of the ITS Method In this code, it is assumed that the objective function is to reach the minimum value within the certain bounds.
3 Computational Experiments Computational success of the method has been explored by using two simple mathematical functions. The reason of this selection is to show the execution of the method in a simple manner. More complex problems, like p-median problems, have also been solved with this method. 3.1 Single-Variable Continuous Function The first attempt, the algorithm has been applied to a function which consists of single decision variable. The following formulation which is showed in Equation 1 is the single-variable continuous function to be manipulated in the procedure. ƒ(x) = (x - 3) . (ex + 9) . (sin (x2 + 2))
(1)
The Figure 2 is an initial state of the algorithm to find out the corresponding extremums where are located in the solution space. In the beginning of our proposed algorithm, ITS Algorithm, the first population is created randomly to reach first extremum which is candidate as local or global minima. In the Figure 3, the algorithm reaches the first extremum with using our proposed technique of approaching to
510
K. Yüksek and S. Cakaloglu
extremum. After this examination, the shifting procedure begins to apply current state to find out a new point which gives better solution than the previous one. In fact, this shifting procedure is continued until the better solution is to be achieved. Nevertheless, the local minima could be reached although it is not better than previous one. Due to the fact that the spanning procedure is applied for each shifting procedure, hence, it is possible that the local minima could be reached in this operation. In the Figure 4, the third better solution is found out and the visited presolutions which are kept track in our proposed Tabu Neighborhoods structure. In this structure, we propose to keep the neighborhoods of pre-created population for preventing to examine these regions. Actually, the examination could penetrate to these regions with approaching from outside the sphere of neighborhood structure. Thereby, it can be possible that particular points giving better solutions are found out although they have been noted in the neighborhood structure previously. At the end of the algorithm, which is showed in the Figure 5, the global extremum is reached and the final shifting procedure begins. This final shifting operation is continued until the shifting amount which is determined in the beginning of the algorithm is exceeded. 3.2 Double-Variable Continuous Function After applying ITS Algorithm on the single-variable continuous function, this proposed algorithm is also applied on the double-variable continuous function to observe behavior of the algorithm. In this study, the partial function containing two decision variables is manipulated. According to previous experiment on the singlevariable continuous function, the algorithm behaves quick reaching the optimum solution and getting the alternative solutions which are also called local minima. The expectations on this application are the same quick response on the getting global optimum and the spanning on the whole search space as possible as. The following formulation which is detailed in the Equation 2 is the double-variable continuous function to be chosen for this procedure. (x - 5)2 + (y - 5)2 - 20,
0 ≤ x ≤ 10, 0 ≤ y ≤ 10
(x + 6)2 + (y - 6)2 -10,
-10 ≤ x < 0, 0 < y ≤ 10
ƒ(x, y) =
(2) (x - 6)2 + (y + 6)2,
0 < x < 10, -10 ≤ y < 0
(x + 4)2 + (y + 4)2,
-10 < x < 0, -10 < y < 0
The Figure 6 is the two dimensional appearance of the partial function which is consisted of four extremum point. As mentioned in Single-variable Continuous Function chapter, algorithm starts by randomly generating first population. In fact, the aim of this procedure is to find out the global minimum of the function and local minimums as possible as it could. In the following figures, the extremum points that the algorithm is to attempt to find are represented as black zones. In the Figure 7, the
An Improved Tabu Search (ITS) Algorithm Based on Open Cover Theory
Fig. 2. Initial Function
Fig. 3. First Extremum point
Fig. 4. Third Extremum point
Fig. 5. The end of the algorithm
511
Fig. 6. The 3D View of Double Variable Continuous Function
first extremum is reached and the neighborhood structures which are visited while the algorithm approaches this extremum are marked. After this operation, the first shifting procedure of corresponding case begins to find out better solution than current
512
K. Yüksek and S. Cakaloglu
Fig. 7. Initial Function
Fig. 8. The first extremum point
Fig. 9. Second extremum is found
Fig. 10. The global extremum is found. And the shifting is begun.
solution. In fact, as mentioned before the local minimums could be reached with this shifting procedure. According to this shifting operation, the algorithm could find out the better solution than current solution and the approaching to new extremum by manipulating direction finding structure of our proposed algorithm. The Figure 8 represents the initial state of the finding new better solution according to shifting operation. The following figure, Figure 9, shows that the algorithm is reached new extremum point according to direction finding structure of our proposed algorithm. The last figures, Figure 10 and Figures 11, point out the end of the algorithm. At final stage, shifting operation is continued to search solution space for better solutions until the shifting amount which is determined in the beginning of the algorithm is exceeded. The applications of our proposed algorithm indicate the certain benefits on problem solving routine. In the beginning of this proposition, the goals are to handle the drawbacks of Tabu Search Algorithm and Genetic Algorithm as mentioned before. In these applications, the algorithm behaves quick approach to extremums by manipulating its well-designed mechanism such as direction finding to reach extremum mechanism. The shifting procedure which is based on the spanning to search space culture of the Genetic Algorithm provides well-controlled searching
An Improved Tabu Search (ITS) Algorithm Based on Open Cover Theory
513
Fig. 11. The end of the algorithm
mechanism on the solution space. According to our proposed algorithm, ITS Algorithm, the neighborhood structure is manipulated to achieve the control spanning operation on the search space, hence, the candidate best solutions are prevented to miss which is the drawback of the Genetic Algorithm. In the essence of the new proposed algorithm, the neighborhood structure and Tabu Neighborhoods structure are denoted distinctive culture comparing with the Tabu Search Algorithm. In fact, the ITS Algorithm manipulates sphere shape neighborhood structure which is kept in the Tabu Neighborhoods according to core of the each population.
4 Comparative Study At the end of collecting results from application which is based on the ITS Algorithm, the results are manipulated to measure the efficiency of our proposed method. It seems that the ITS Algorithm provides results that are very close to actual solution acquiring from step-by-step culture. Furthermore, corresponding practices in which ITS Algorithm manipulates high sensitivity by using very small tolerance number. This tolerance number determines the size of spanning amount and the diameter of neighborhood area. For this reason, the high sensitive results could be achieved. In fact, the ITS Algorithm provides better results than step-by-step culture due to the fact that the step-by-step culture is based on the integer point propagation. For achieving best optimum solution, the ITS Algorithm is much faster than stepby-step culture because of the shifting operator. The speed of approaching to best solution is to be seemed reasonable compared with other methods in problems vast big size and big number of extremum point. In addition to these, the ITS Algorithm ensures that the better extremums in whole search space are visited by assist of shifting procedure. In the last experiments, we used the well-known objective functions to show efficiency of our proposed algorithm. In reference [19], table 1 and table 2, the algorithms DOPE, ECTS and ESA were tested with using 13 functions. The efficiency of these tests is based on function evaluations. In our experiments, we applied our proposed algorithm on these objective functions and point out the efficiency of it with the number of iterations (the iteration references the main iteration of algorithm, which begins with a random new point) for each experiment as shown the following Table.
514
K. Yüksek and S. Cakaloglu Table 1. Results with some test functions
Objective function rcos easom goldstein shubert rosenbrock 2 zakharov 2 dejoung hartman 3 rosenbrock 5 zakharov 5 hartman 6 rosenbrock 10 zakharov 10
Average number Iterations 216 326 28 25 787 139 197 118 1466 375 51 2053 626
of
Average error 0.000001 0.6999952 0.0000470 0.0003 0.0066016 0.00000003 0.0000003 0.0000225 0.4046113 0.0000006 0.0342450 0.4167664 0.0000010
5 Conclusions Improved Enhancement Tabu Search (ITS) is a powerful algorithmic approach that has been applied with great success to many difficult combinatorial problems. All meta-heuristics need to achieve both depth and breadth in their searching process; depth is usually not a problem for TS, which is quite aggressive in this respect but breadth can be a critical issue. To handle this, it is extremely important to develop an effective diversification scheme. ITS type algorithms form an emerging framework in computer programming that could challenge in the near future very sophisticated algorithms. This method can also be adapted to problems which involve root findings. A particularly nice feature of ITS is that, like all approaches based on Local Search, it can quite easily handle the “dirty” complicating constraints that are typically found in real-life applications. We believe that this proposed ITS algorithm can be used for detection of promising areas to detect by several processors working parallel. Moreover, it can be used to solve many difficult combinatorial optimization problems efficiently. The algorithm could be more speed up with constrained random point selection by isolating the visited open sets.
References 1. Glover, F.: Tabu Search: part I. ORSA Journal on Computing 1, 190–206 (1989) 2. Glover, F.: Tabu Search: part II. ORSA Journal on Computing 2, 4–32 (1990) 3. Youssef, H., Sait, S.M., Adiche, H.: Evalutiory algorithms, simulated annealing and a tabu search: a comparative study. In: Engineering Applications of Artificial Intelligence, vol. 14 (2001)
An Improved Tabu Search (ITS) Algorithm Based on Open Cover Theory
515
4. Glover, F., Laguna, M.: Tabu Search. In: Reeves, C.R. (ed.) Modern Heuristic Techniques for Combinatorial Problems, pp. 70–150. Blackwell, Malden (1993) 5. de Werra, D., Hertz, A.: Tabu Search Techniques: A Tutorial and an Application to Neural Networks. OR Spektrum 11, 131–141 (1989) 6. Laguna, A.M., Barnes, A.J.W., Glover, A.F.: Tabu Search Methodology for a Single Machine Scheduling Problem. J. of Int. Manufacturing 2, 63–74 (1991) 7. Porto, A.S.C.S., Ribeiro, A.C.C.: A Tabu Search Approach to Task Scheduling on Heterogeneous Processors under Precedence Constraints. International Journal of HighSpeed Computing 7(2) (1995) 8. Crainic, T.G., Gendreau, M., Farvolden, J.M.: Simplex-based Tabu Search for the Multicommodity Capacitated Fixed Charge Network Design Problem. INFORMS Journal on Computing 12, 223–236 (2000) 9. Osman, I.H.: Metastrategy Simulated Annealing and Tabu Search Algorithms for the Vehicle Routing Problem. Annals of Operations Research 41, 421–451 (1993) 10. Gendreau, M., Guertin, F., Potvin, J.-Y., Taillard, É.D.: Parallel Tabu Search for RealTime Vehicle Routing and Dispatching. Transportation Science 33, 381–390 (1999) 11. Dell’Amico, M., Trubian, M.: Applying Tabu Search to the Job-Shop Scheduling Problem. Annals of Operations Research 41, 231–252 (1993) 12. Battiti, R., Tecchiolli, G.: The Reactive Tabu Search. ORSA Journal on Computing 6(2), 126–140 (1994a); A description of a specialized tabu search implementation that seeks to dynamically control search parameters with feedback mechanisms 13. Fang, Y., Liu, G., He, Y., Qiu, Y.: Tabu search algorithm based on insertion method. In: IEEE Int. Conf. Neural Network & Signal Processing, Nanjing,China, December 14-17 (2003) 14. Hajji, O., Brisset, S., Brochet, P.: A New Tabu Search Method for Optimization With Continuous Parameters. IEEE Transactions on Magnetics 40(2), 1184–1187 (2004) 15. Liu, S.Q., Ong, H.L., Ng, K.M.: A fast tabu search algorithm for the group shop scheduling problem. Advances in Engineering Software, 1–5 (2005) 16. Michel, L., Hentenryck, P.V.: A simple tabu search for warehouse location. European Journal of Operational Research, 576–591 (2004) 17. Rolland, E., Schilling, D.A., Current, J.R.: An efficient tabu search procedure for the pMedian Problem. European Journal of Operational Research, 329–342 (1996) 18. Munkres, J.R.: Topology, pp. 164–170 (2000) 19. Franze, F., Speciale, N.: A tabu-search-based algorithm for continuous multiminima problems. International Journal For Numerical Methods in Engineering, 665–680 (2001)
The Needles-in-Haystack Problem Katherine Moreland1 and Klaus Truemper2 1
2
The MITRE Corporation, McLean, VA 22102, U.S.A. Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, U.S.A.
Abstract. We consider a new problem of detecting members of a rare class of data, the needles, which have been hidden in a set of records, the haystack. The only information regarding the characterization of the rare class is a single instance of a needle. It is assumed that members of the needle class are similar to each other according to an unknown needle characterization. The goal is to find the needle records hidden in the haystack. This paper describes an algorithm for that task and applies it to several example cases. Keywords: Logic, Classification, Feature Selection.
1
Introduction
We define a new data mining problem called the needles-in-haystack problem. A collection of vectors of length n called the haystack is given. In addition, a single vector of length n called a needle is provided. A few of the vectors in the haystack are similar to the needle vector according to an unknown relationship involving a subset of the vector attributes. An oracle is available which accepts any vector from the haystack records and tells whether or not it is a needle. The objective is to identify all hidden needle vectors within the haystack while minimizing the number of calls made to the oracle. An example of the problem occurs in the area of helicopter maintenance. Each helicopter of a fleet is represented by a vector of attributes containing measurements taken during flights. One day, one of the helicopters suffers a catastrophic failure. It is conjectured that a few of the other helicopters may be close to a similar failure. However, the attributes that predict such failure are presently unknown. The goal is to identify which other helicopters in the fleet may be susceptible to such a failure. On the surface, the needles-in-haystack problem is similar to the outlier detection problem [2,12,17,18]. However, outliers do not necessarily exhibit common features, while the needles are assumed to share a common, unknown characteristic. The needles-in-haystack problem also may seem similar to the task of separating a very small set of records from a much larger one [8,12,15,16,20,21]. But here all members of the small set save one are unknown. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 516–524, 2009. c Springer-Verlag Berlin Heidelberg 2009
The Needles-in-Haystack Problem
1.1
517
Problem Structure
We focus on a particular version of the needles-in-haystack problem satisfying the following two assumptions. Assumption 1. The unknown needle characterization can be described using a logic conjunction which contains only a small number of literals. The attributes giving rise to these literals are not known a priori. Assumption 2. The logic conjunction is based on a discretization involving only one cutpoint for each attribute. 1.2
Summary of Algorithm
The solution algorithm is iterative. At the onset of each iteration there are k given needles, h haystack records, and l attribute subsets which in prior iterations led to needle candidates that were identified by the oracle as non-needles. When the algorithm begins, k = 1, h = n (the total number of haystack records), and l = 0. Let H denote the current haystack. For each of the k needles, several artificial needles, which look similar to the k needles on hand, are constructed as follows. For each attribute of the data set, the variance is estimated using the haystack records. Using the estimated standard deviation, σ, and a parameter α, we define a width w by w = α·σ
(1)
For each of the known needles, we carry out the following step. We define an interval for each attribute centered at the attribute value and having width w. Using the continuous (resp. discrete) uniform distribution if an attribute is continuous (resp. discrete), we randomly create several artificial needles. The artificial needles are added to the set of k needles to produce a set S. In the solution process, we invoke a separation algorithm that separates S from H. The algorithm creates an ensemble of classifiers which in turn produce a vote total ranging from −40 to 40 for each record of H. Details are included in Section 3. Generally, the records of S produce a vote total near 40, while almost all records of H result in a vote total near −40. Indeed, records of H with a vote total well above −40 may be needles. By enforcing a threshold, we declare all records of H with a vote total above the threshold to be hidden needle records. This simple approach works well when the data sets are randomly generated. However, when real-life data sets are used, this method performs quite poorly. We improve upon the method as follows. After sets S and H have been constructed, we discretize them using a rather complicated process that also determines candidate attribute sets. For each of these candidate attribute sets, we call the separation algorithm to separate set S from H as described previously. The record from H with the highest vote is selected as a candidate for testing with the oracle. If the record is confirmed to be a needle, it is added to the set of k needles and the process continues iteratively, now with k + 1 given needles,
518
K. Moreland and K. Truemper
h − 1 haystack records, and l attribute subsets. If the record is a non-needle, the attribute configuration which led to the selection of this non-needle is stored, l is incremented, and the algorithm continues with the next candidate attribute set. The algorithm terminates if all candidate attribute sets have been exhausted without identifying any additional hidden needles.
2
Discretization
Recall that the discretization step not only discretizes the data, but also produces candidate attribute subsets that potentially provide the correct attributes needed for the characterization of the needles. Two facts are exploited by the algorithm to accomplish this task. First, needles are known to be rare. In addition, Assumption 1 guarantees that the unknown logic conjunction characterizing the needles contains few literals. The discretization method is described next. 2.1
Attribute Pairs
Define an attribute that is used in the unknown needle characterization to be a needle attribute. Otherwise, the attribute is a non-needle attribute. Suppose the needle attributes were given. For any pair of attributes, the following possible scenarios exist: (1) both attributes are needle attributes, (2) exactly one attribute is a needle attribute, or (3) both attributes are non-needle attributes. Consider the values of one such attribute pair plotted in the plane with one attribute on the x-axis and the other on the y-axis. Suppose we have a cutpoint for each of the two attributes. These two cutpoints define four disjoint quadrants in the plane. Each record of the data set falls into one of the quadrants. If the cutpoints are correct for computation of the unknown needle characterization, all given and hidden needles fall within the same quadrant. Since the total number of needle records is known to be small, we expect the quadrant containing the given and hidden needles to be sparsely populated compared to other quadrants. For example, consider two needle attributes, x and y, with values ranging from 0 to 10. The cutpoint for attribute x is at 4.0 while the cutpoint for attribute y is at 5.0. Let there be k = 2 given needles. Suppose the lower right quadrant defined by these cutpoints contains four points, two of which are the given needles. This case is depicted in Scenario C of Figure 1. The lower right quadrant is very sparsely populated compared to the other three quadrants. Since it contains all given needles and few additional points, any one of the additional points may be a hidden needle record. Now consider the case of a needle attribute paired with a non-needle attribute. The cutpoint of the non-needle attribute is not required to characterize the needles. Assuming the needle attribute corresponds to the y-axis, using only the needle attribute cutpoint produces two horizontal regions instead of quadrants. For example, in Scenario A of Figure 1 the needle attribute y has the cutpoint 4.0. Suppose the given needles fall within the lower region. This region is sparsely populated compared to the upper region and therefore any one of the additional
The Needles-in-Haystack Problem
519
Fig. 1. The graphs illustrate the four possible scenarios for cutpoint selection. Scenarios A and B show examples of a horizontal and vertical band, respectively, when only one attribute in the pair is a needle attribute. Scenario C illustrates an example of both attributes being needle attributes, which yields a sparsely populated quadrant. Scenario D shows an example of neither of the attributes being needle attributes, assuming that the points of the given needles are near the center of the displayed region.
points may be a needle record. Scenario B of Figure 1 shows the analogous case where x is the needle attribute with cutpoint 3.0. This produces two vertical regions with the rightmost region containing the given needles. For the final case of two non-needle attributes, for any pair of cutpoints, either the given needles do not fall within the same quadrant or they fall within a densely populated quadrant. In either case, the two attributes likely are not useful for characterizing the needles. Scenario D of Figure 1 depicts such a case, assuming that the k known needles are near the center of the displayed region. Since the needle attributes are actually unknown, we estimate for each attribute pair which of the aforementioned scenarios applies. Details are provided next. 2.2
Cutpoint Selection
Consider the values for the two attributes of an attribute pair plotted in the (x, y)-plane. Define R0 to be the smallest axis-parallel rectangle of the (x, y)plane that contains the known needles and the points of the haystack. Define another rectangle R1 to be the smallest possible rectangle that encloses all known needles. We define a box to be the smallest rectangle that contains one of the
520
K. Moreland and K. Truemper
corner points of R0 and the rectangle R1 . There are four such boxes. Define a band to be the smallest rectangle that contains one side of R0 and the rectangle R1 . There are four such bands. All points contained within a box or band are considered to be potential needles. We want to ensure that the boxes and bands do not contain too many points since needles are known to be rare. Let e be the maximum number of hidden needles expected in the data set, and assume there are k given needles. For a box/band with p points to be acceptable, we impose the following condition using a parameter β. p ≤ β(k + e) (2) By definition, boxes/bands contain all given needles. Since we want to use boxes/bands to identify additional hidden needles, we only consider boxes/bands that contain at least one additional point. p≥k+1
(3)
Boxes/bands meeting these criteria are stored as candidate boxes/bands. 2.3
Box/Band Comparisons
We need a way to compare two boxes/bands of any two attribute pairs so that we can determine the attribute pairs that are most likely part of the characterization of the needles. Let p denote the number of points in a box/band and B denote the area of the box/band containing the k needles. Define A to be the area of the smallest axis-parallel rectangle of the (x, y)-plane containing the known needles and the points of the haystack. That is, A = (max x − min x)(max y − min y)
(4)
A value v is computed for each box which describes how sparsely populated the region is; the smaller the value the more likely the box is useful for construction of the needle characterization. p·A v= (5) B Two types of bands may be created from a given box. The horizontal band only uses the y-attribute cutpoint while the vertical band only uses the x-attribute cutpoint. A v value is also calculated for these horizontal and vertical bands formed using a formula analogous to that of (5). The v values of the bands are called vx and vy . If a box is useful, the difference between its v value and the vx and vy values of the corresponding bands should be large. Let the distance d between the values be measured as follows: d = min{vx , vy } − v
(6)
The box with the highest d value is selected as the best representative box for the attribute pair. The band with smallest vx or vy value, whichever applies, is chosen as the best representative band for the attribute pair.
The Needles-in-Haystack Problem
3
521
Construction of Candidate Attribute Sets
We construct a graph G. Each node of G corresponds to an attribute. An edge connects two nodes x and y if the best representative box for x and y has positive d value. The nodes of G corresponding to the attributes in the as-yet-unknown characterization of the needles likely define a clique (= complete subgraph) of G with, say, m nodes. Accordingly, we use the cliques of G to define candidate attribute subsets for the iterative algorithm. Generally, any appropriate method may be employed to find the cliques of G. In our implementation, we limit m to 3 and apply direct enumeration. The needle detection algorithm sorts the cliques that were found in decreasing order of their average d value, and declares the attributes corresponding to the node sets of the cliques, in the sorted order, to be the candidate attribute sets. The evaluation of each attribute set, as described in Section 1.2, can be carried out by any separation algorithm as long as the algorithm also identifies the haystack records which cannot be separated, as these records are candidates for being hidden needle records. For candidate separation algorithms, see for example [1,3,4,5,6,7,9,10,11,13,14,19]. We have elected to use the Lsquare algorithm of [13,14]. The Lsquare algorithm produces vote totals ranging from −40 to 40 for all records of the data set by creating an ensemble of classifiers. Based on [13], Lsquare also computes two probability distributions for the vote totals that may be used to estimate the probability that classification based on the vote total is correct. In the specific case here, a −40 vote total for a haystack record signifies that the record likely is not a needle. As the vote total increases from −40, the record is less likely to be a haystack record, and thus may well be one of the hidden needle records.
4
Computational Results
For testing we used sets of the UC Irvine Machine Learning Repository as well as a data set supplied by D. Th´evenin of the University of Magdeburg in Germany. Of the 11 most popular data sets from the repository, we selected the Heart Disease, Iris, and Wine sets since they are of reasonable size and mainly have continuous-valued attributes. The data set from the University of Magdeburg is a fluid dynamics data set called Optim which has continuous-valued attributes. Table 1 summarizes the data sets. We make sure that each case satisfies Assumption 1 and Assumption 2 by selecting needles from the specified set as follows. For a given case, let set A contain all records matching the needle class value while the B set contains the records with the other class values. The Lsquare method is called to obtain a separating formula for the two sets. The first clause in the separating formula is chosen to be the underlying needle relationship. Six records of set A receiving the highest possible vote of 40 are retained as they are well-separated from the B set using the selected clause. Likewise, the B records with the lowest possible vote of −40 are declared to be the non-needle records. The haystack is composed of
522
K. Moreland and K. Truemper Table 1. Summary of Data Sets
Data Set HeartDisease Iris-1 Iris-2 Iris-3 Wine-1 Wine-2 Wine-3 Optim-1 Optim-2 Optim-3 Optim-4
No. of No. of Rec’s Attr’s 303 14 150 5 150 5 150 5 178 14 178 14 178 14 60 9 60 9 60 9 60 9
Needle Records Class = 0 Class = 1 Class = 2 Class = 3 Class = 1 Class = 2 Class = 3 Low value for 1st target Low value for 2nd target Low value for 3rd target Low value for 4th target
Non-needle Records Class > 0 Class > 1 Class = 1 or 3 Class < 3 Class > 1 Class = 1 or 3 Class < 3 High value for 1st target High value for 2nd target High value for 3rd target High value for 4th target
all needles save one and all non-needles. An exception is the Optim case, where only four needle records could be derived. Any case not satisfying both Assumption 1 and Assumption 2 is not evaluated. For this reason, results for the Iris-2 and Wine-1 data sets are not shown. Table 2 shows the results for the remaining cases. The number of iterations required to detect the 1st , 2nd , 3rd , 4th , and 5th hidden needles are given in the table for each of the cases. For example, all but the third hidden needle of the Wine-3 case was identified on the first try. The third hidden needle took a total of three iterations to be identified by the algorithm. This means the algorithm identified two records which were declared by the oracle to be non-needles before correctly identifying the hidden needle. In all runs, the parameters α of (1) and β of (2) are selected as α = 0.1 and β = 1.5. On average, the algorithm detects the first hidden needle in 1.78 tries. The second hidden needle is detected on the first attempt without Table 2. Needle Detection Results Number of Runs to Detect Needle Number Case One Two Three Four Five Total Runs HeartDisease 1 1 1 1 1 5 Iris-1 1 1 1 1 1 5 Iris-3 4 1 3 1 9 18 Wine-2 1 1 1 1 10 14 Wine-3 1 1 3 1 1 7 Optim-1 5 1 1 n/a n/a 7 Optim-2 1 1 1 n/a n/a 3 Optim-3 1 1 1 n/a n/a 3 Optim-4 1 1 1 n/a n/a 3 Average 1.78 1.0 1.44 1.0∗ 4.4∗ Cum Avg 1.78 2.78 4.22 5.22∗ 9.62∗ ∗ Values do not include cases Optim-1 - Optim-4
The Needles-in-Haystack Problem
523
identifying any non-needles. The fifth and final hidden needle is the most difficult for the algorithm to detect and on average involves 4.4 tries. Overall, the algorithm on average makes 1.75 calls to the oracle to find one needle. Consider again the problem of helicopter maintenance. Assume that a fleet consists of 100 helicopters, and that the above performance results apply. Then, identifying 2 or 3 helicopters that may be susceptible to failure would require, on average, checking about 4 helicopters.
5
Summary
This paper introduces the needles-in-haystack problem in which a small number of needle records are hidden among haystack records and are to be found. As a guide for the search, just one needle is given. The solution algorithm proposed here creates artificial needles to help characterize the needle class. It also discretizes the needle and haystack records by a particular method that also identifies candidate attribute subsets that are likely useful in characterizing the needles. The algorithm separates the needle and haystack records using only the attributes of candidate attribute subsets. Records in the haystack which are not readily separated from the needle class are candidates for the hidden needles, and an oracle is called to determine whether they belong to the needle class. The algorithm is iterative in nature and uses newly discovered needles to help characterize the needle class in subsequent iterations. The algorithm has been tested using several data sets. On average, the algorithm made 1.75 calls to the oracle to find each hidden needle. In each case, all hidden needles were detected. Potential application areas include fraud detection, aircraft maintenance, and homeland security. A key assumption in the current work is that the characterization of the needles can be achieved using a small number of literals and only one cutpoint per attribute. In future work, we will relax these constraints to handle more complex needle characterizations.
References 1. Abidi, S., Hoe, K.: Symbolic exposition of medical data-sets: A data mining workbench to inductively derive data-defining symbolic rules. In: Proceedings of the 15th IEEE Symposium on Computer-based Medical Systems (CBMS 2002) (2002) 2. Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (2001) 3. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993) 4. An, A., Cercone, N.: Discretization of continuous attributes for learning classification rules. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS, vol. 1574, pp. 509–514. Springer, Heidelberg (1999)
524
K. Moreland and K. Truemper
5. Bay, S., Pazzani, M.: Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5, 213–246 (2001) 6. Boros, E., Hammer, P., Ibaraki, T., Kogan, A.: A logical analysis of numerical data. Mathematical Programming 79, 163–190 (1997) 7. Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering 12, 292–306 (2000) 8. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004) 9. Clark, P., Boswell, R.: Rule induction with CN2: Some recent improvements. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482. Springer, Heidelberg (1991) 10. Cohen, W.W.: Fast effective rule induction. In: Machine Learning: Proceedings of the Twelfth International Conference (1995) 11. Cohen, W.W., Singer, Y.: A simple, fast, and effective rule learner. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (1999) 12. Dokas, P., Ertoz, L., Kumar, V., Lazarevic, A., Srivastava, J., Tan, P.-N.: Data mining for network intrusion detection. In: Proc. 2002 NSF Workshop on Data Mining (2002) 13. Felici, G., Sun, F., Truemper, K.: Learning logic formulas and related error distributions. In: Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, Heidelberg (2006) 14. Felici, G., Truemper, K.: A MINSAT approach for learning in logic domain. INFORMS Journal of Computing 14, 20–36 (2002) 15. Joshi, M.V., Agarwal, R.C., Kumar, V.: Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: SIGMOD 2001: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 91–102 (2001) 16. Joshi, M.V., Kumar, V., Agarwal, R.: Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In: IEEE International Conference on Data Mining, p. 257 (2001) 17. Lee, W., Stolfo, S.: Real time data mining-based intrusion detection. In: Proceedings of the 7th USENIX Security Symposium (1998) 18. Sequeira, K., Zaki, M.: Admit: Anomaly-based data mining for intrusions. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002) 19. Triantaphyllou, E.: Data Mining and Knowledge Discovery via a Novel Logic-based Approach. Springer, Heidelberg (2008) 20. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004) 21. Yan, R., Liu, Y., Jin, R., Hauptmann, A.: On predicting rare classes with svm ensembles in scene classification. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), April 2003, vol. 3, pp. III–21– III–24 (2003)
An Evidence-Driven Probabilistic Inference Framework for Semantic Image Understanding Spiros Nikolopoulos1, Georgios Th. Papadopoulos1, Ioannis Kompatsiaris1 , and Ioannis Patras2 1
2
Informatics and Telematics Institute, CERTH 6th km Charilaou-Thermi Road, Thessaloniki, Greece School of Electronic Engineering and Computer Science, Queen Mary University of London, E1 4NS, London, UK
Abstract. This work presents an image analysis framework driven by emerging evidence and constrained by the semantics expressed in an ontology. Human perception, apart from visual stimulus and pattern recognition, relies also on general knowledge and application context for understanding visual content in conceptual terms. Our work is an attempt to imitate this behavior by devising an evidence driven probabilistic inference framework using ontologies and bayesian networks. Experiments conducted for two different image analysis tasks showed improvement in performance, compared to the case where computer vision techniques act isolated from any type of knowledge or context.
1
Introduction
The use of knowledge and context for indexing multimedia data using higher level semantics, was motivated by the gap existing between the limited inference capabilities that restrain machine understanding and the plentiful reasoning potentials of human brain. Driven by the fact that knowledge and context are two dimensions of human perception that are difficult to introduce and exploit at the numeric level of visual features, we investigate the combined use of formal represented semantics and probabilistic inference mechanisms as a means to simulate their impact on image analysis. Evidence is information that when coupled with the principles of inference becomes relevant to the support or disproof of a hypothesis. For our framework visual stimulus is considered evidence when reasoned on the grounds of knowledge and placed on the appropriate context. In this perspective, the input arguments of an evidence-driven probabilistic inference framework consists of visual stimulus, application context and domain knowledge, as can be seen in Fig. 1. Application context and domain knowledge also affect the process of probabilistic inference (Fig. 1) and are considered to be the a priori/fixed information of the framework. On the other hand, the visual stimulus depends on the image to be analyzed and is considered to be the observed/dynamic information of the framework. Domain knowledge, expressed using ontologies, and application context, captured both in conditional probabilities and application specific structures, are P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 525–539, 2009. c Springer-Verlag Berlin Heidelberg 2009
526
S. Nikolopoulos et al.
Process
Input Observed
Visual Stimulus Application Context Domain Knowledge
E V I D E N C E
Probabilistic Inference
Hypothesis Verification or Rejection
A priori
Fig. 1. Functional relations between the modules of the proposed framework
integrated into a decision model that bears the characteristics of a bayesian network. A methodology allowing the automatic integration of ontology-expressed knowledge into a probabilistic network, is employed for this purpose. The goal of this network is to facilitate evidence driven probabilistic inference in order to verify or reject a hypothesis made about the semantic content of an image. Hence, the tasks carried out by the proposed framework include the statistical analysis of the presented visual stimulus, the adoption of a probabilistic standpoint for coherently handling uncertainty (a feature inherent to multimedia analysis), the representation of domain knowledge and application context (in the form of causality between evidence and hypotheses) in a computationally enabled format and the establishment of a framework supporting decision making driven by the probabilistic inferences of evidence. Thus, what can be considered as the contribution of our work is the fact that the potentials of such techniques i.e., techniques that integrate explicitly provided knowledge and bayesian networks, are thoroughly examined and evaluated as a means to enhance semantic image understanding by allowing in a principled/probabilistic way, the fusion of evidence/information obtained using knowledge and context.
2
Related Work
Various works exist in the literature that try to handle uncertainty and take advantage of knowledge and context for enhancing image analysis. Naphade et al. in [1] introduce the concept of “Multijects” as a way to map time sequence of multi-modal, low-level features to higher level semantics using probabilistic rules. In [2] Naphade et al. illustrates the functionality of “Multinets” by introducing bayesian belief networks as a means to model the interaction between concepts and use this contextual information for performing semantic indexing of video content. Luo et al. [3] propose a framework for semantic image understanding based on belief networks. The authors demonstrate the improvement in performance introduced by extracting and integrating in the same bayesian
An Evidence-Driven Probabilistic Inference Framework
527
inference framework, both low-level and semantic features. Other approaches that take advantage of knowledge and context include [4], [5], where indoor versus outdoor scene categorization based on low-level features and bayesian networks is performed and [6] where a bayesian network is utilized as an inference mechanism for facilitating a classification method. However, none of these works incorporate a systematic methodology for integrating domain knowledge, expressed with standard knowledge representation languages, into a probabilistic inference framework. On the other hand, Ding et al. in [7] present their on-going research on combining ontologies and bayesian networks with the aim of introducing uncertainty in ontology reasoning and mapping. However, no attempt is made by the authors to adjust their scheme for serving the purposes of multimedia analysis. In [8], Papadopoulos et al. propose a knowledge assisted image analysis scheme that combines local and global information for the task of image categorization and region labeling. In this case, a sophisticated decision mechanism that fuses intermediate classification results with contextual information and spatial relations, is used to generate the final results. In [9] Athansiadis et al. propose a scheme that is intended to enhance typical image segmentation algorithms by incorporating semantic information. In this case fuzzy theory and fuzzy algebra are used to handle uncertainty, while a graph of concepts carrying degrees of relationship on its edges is employed to capture visual context. However, no coupling of ontology-based approaches with probabilistic inference algorithms is attempted. Town in [10] use ontologies as a structural prior for deciding the structure of a bayesian network using the K2 algorithm. The task facilitated concerns the high-level analysis of surveillance data, but in this work ontologies are mostly treated as hierarchies that do not incorporate any explicitly represented semantic constraints.
3
Framework Description
What people see is not simply a translation of visual stimulus since knowledge and context have a major effect. The framework of Fig. 1 attempts to simulate visual perception by using evidence as an intermediate layer of image interpretation that combines visual stimulus, domain knowledge and application context. Visual Stimulus: Machine learning methods are able to train a general classifier in recognizing a specific object, provided that a sufficiently large number of indicative examples are available. Thus, a classifier F can be trained to recognize a concept c based on the attributes fI of its visual representation I. The output of such a classifier can be either binary or a value ranging between 0 and 1 that indicates the confidence (or uncertainty) of the classification output, i.e., Fc (fIq ) = P r(c | Iq ). P r(c | Iq ) expresses the probability that visual representation Iq depicts concept c. Domain Knowledge: Domain knowledge will have to be elucidated and represented in machine understandable format in order to be exploitable by our
528
S. Nikolopoulos et al.
framework. Ontologies have emerged as a very powerful tool able to express knowledge in different levels of granularity [11]. If we consider NC to be the set of unary predicate symbols that are used to denote concepts, let R to be the set of binary predicates that are used to denote relations between concepts and O the algebra defining the allowable operands for these sets, the part of experience that relates to the domain knowledge can be represented using NC , R, O. Thus, using OWL–DL [12], the domain knowledge can be expressed by a structure KD that associates the domain concepts and relations using the allowable operands, i.e., KD = S(NC , R, O), where O ∈ DL. DL stands for “Description Logics” [13] and constitutes a specific set of constructors and restrictions. For instance, common constructors include logical operands of the type intersection, union, disjoint, etc. Our goal is to use these constructors for imposing semantic constraints on the process of image interpretation that can not be captured by typical machine learning techniques. Application Context: Loosely speaking the knowledge structure determines a) what evidence to look for, a piece of information that is associated with the domain knowledge and b) how to qualitatively evaluate their presence (i.e., which evidence supports one hypothesis or another). In this sense, the knowledge structure sets the tracks to which evidence belief is allowed to propagate. However, no support is provided to the decision making process in terms of where to look for evidence and how to quantitatively evaluate their presence (i.e., how much each hypothesis is affected by the existence of one evidence or another). The role of KD is to capture information about the domain of discourse in general, not to deliver information concerning the context of the analysis process at hand. This is the role of application context that typically incorporate many application specific information. If we let app denote the set of application specific information (where to look for evidence in our case) and Wij a function that quantifies the influence (i.e., measured as the frequency of co-occurrence) of concept ci on cj , the application context can be expressed as a structure of the type X = S(app, W ). Evidence-Driven Probabilistic Inference: An evidence-driven probabilistic inference scheme should be able to acquire what evidence to look for, from NC ∈ KD , use context information app ∈ X to search for these evidence and apply the trained classifiers Fc to obtain the respective degrees of confidence. Subsequently, formulate an initial hypothesis around a concept c ∈ NC for all potential decisions, use the evidence to trigger probabilistic inference, propagate evidence beliefs using the inference tracks R ∈ KD and the corresponding belief quantification functions Wij ∈ X, re-estimate the values for all hypotheses and finally decide which of the hypotheses should be verified or rejected. KD and app ∈ X determine which of the available concepts should be included in the hypotheses set cH and which in the evidence set cE . In this case, if we denote H H(Iq ) = {P r(cH 1 | Iq ), . . . , P r(cM | Iq )} the estimated degrees of confidence (i.e., prior probabilities) of the concepts belonging to the hypotheses set and E E(Iq ) = {P r(cE 1 | Iq ), . . . P r(cK | Iq )} the estimated degrees of confidence of the
An Evidence-Driven Probabilistic Inference Framework
529
Table 1. Legend of Introduced Terms Term
Symbol
Role
Trained Classifier
Fc
- Degree of confidence that Iq depicts c
Domain Knowledge
KD = S(NC , R, O)
Application Context
Hypotheses
Evidence
X = S(app, W )
{P r(cH 1
- Determine what evidence to look for. - Qualitatively relations between evidence and hypotheses. - Determine where to look for evidence(i.e., application specific information, app). - Quantitative relations between evidence and hypotheses, Wij (i.e., frequency of co-occurrence).
- Degrees of confidence for H(Iq ) = the concepts of cH , as determined | Iq ), . . . , P r(cH M | Iq )} by NC ∈ KD and app ∈ X, obtained by applying classifiers similar to Fc .
E(Iq ) = E {P r(c1 | Iq ), . . . P r(cE K | Iq )}
´ q )) Evidence driven c = arg ⊗cH (H(I ´ q) = probabilistic where H(I inference P r(cH | H(Iq ), R, O, W, E(Iq ))
- Degrees of confidence for the concepts of cE , as determined by NC ∈ KD and app ∈ X, obtained by applying classifiers similar to Fc . - Perform inference by calculating ´ q )), using E(Iq ) as triggering H(I evidence, R, O ∈ KD as belief propagation tracks and Wij ∈ X as causality quantification functions.
concepts belonging to the evidence set, evidence driven probabilistic inference ´ q )) is the process of calculating the posterior probabilities of H(Iq ) (i.e., H(I given the evidence values E(Iq ) and information coming from knowledge R, O and context Wij . Thus, the proposed framework achieves semantic image ´ q )) where H(I ´ q ) = P r(cH | interpretation in the following way, c = arg ⊗cH (H(I H(Iq ), R, O, Wij , E(Iq )) and ⊗ is an operator (e.g., max) that depends on the specifications of the analysis task (details are provided in Section 6.2). Table 1 outlines the basic terms introduced throughout the description of the proposed framework, while their functional relations are demonstrated in Fig. 1. Based on these modules we aim to develop a decision support framework that derives directly from the knowledge structure, retains intact the inference tracks of logic, wraps probabilistically the causality links between domain concepts and handles uncertain estimations meaningfully. The ability of Bayes’ theorem to compute the posterior probability of a hypothesis by relating the conditional
530
S. Nikolopoulos et al.
and prior probabilities of two random variables, was the reason for considering the use of bayesian networks for our purpose.
4
Bayesian Networks and Probabilistic Inference
Bayes’ theorem can be used to update or revise beliefs in light of new evidence that are estimated with a certain amount of confidence. Adjusting this description to the formulation of Section 3, every time a classifier is applied on a visual representation, a hypothesis is formed around concept c and the visual representation Iq . The goal is to verify or reject the hypothesis stating that Iq depicts c, using the evidence E(Iq ). A bayesian network is a directed acyclic graph G = (V, A) whose nodes v ∈ V represent variables and whose arcs a ∈ A encode the conditional dependencies between them. Hence, a bayesian network can be used to facilitate three dimensions of perception: a) provide the means to store and utilize domain knowledge KD , an operation that is served by the network structure and prior probabilities, b) organize and make accessible information coming from context X ∈ S(app, W ), which is supported by the Conditional Probability Tables (CPTs) attached to each network node and c) allow the propagation of evidence beliefs using message passing algorithms, an action facilitated by the Bayes’ theorem. A methodology for consistently transforming ontologies into bayesian networks is essential for enabling evidence driven probabilistic inference. For the purposes of our work we adopted a variation of the methodology introduced in [7]. The proposed variation is mainly focused on the method employed for calculating the CPTs, as detailed later in this section. Network Structure: Intuitively, deciding on the structure of a bayesian network based on an ontology can be seen as determining a function that maps ontological elements (i.e., concepts and relations) to graph elements (i.e., nodes and arcs). All translation rules described in [7] were implemented for determining the network structure out of an OWL ontology. The resulting network consists of concept nodes ncn and control nodes ncl (both of them having two states i.e., true and false) that are used to model the domain concepts and the associations between them, respectively. At this point, it is important to notice that the methodology described in [7] is only able to handle a limited set of constructors, namely owl:intersectionOf, owl:unionOf, owl:complementOf, owl:equivalentClass and owl:disjointWith, and as a consequence these are the constructors supported by our framework. Parameter Learning: While the network structure encodes the qualitative characteristics of causality, (i.e., which nodes affect which), network parameters are used to quantify it, (i.e., how much is a node influenced by its neighbors). CPTs are used to capture the amount of this influence/impact and make it available for inferencing as part of the context structure Wij ∈ X. The methodology adopted in this paper differs from [7] in what refers to the estimation of the network original probability distribution. While in [7] this information is provided
An Evidence-Driven Probabilistic Inference Framework
531
explicitly by an expert, in our case it is learned from observation data, using the Expectation Maximization (EM) algorithm [14]. More specifically, the prior and conditional probabilities for each concept node ncn of the bayesian network, are initially calculated before considering any DL constructors. Subsequently, the DL constructors are migrated by inserting into the resulting network the appropriate control nodes ncl . Once the structural translation has been completed, the CPTs for all concept nodes ncn are re-calculated. Since no observation data are available for the control nodes ncl , these nodes are treated as latent variables with two states (i.e., true and false). The last step is to specifically set the CPTs of all control nodes ncl as appear in [7] and fix their states to “True”, so as to enforce the semantic constraints expressed by the DL constructors. Evidence-Driven Probabilistic Inference: A framework that will allow beliefs to seamlessly flow over the established network is required. Pearl [15] introduced a message passing mechanism where messages are exchanged between father and child nodes carrying the information required to update their beliefs. In order to overcome the fact that Pearl’s algorithm suffer from scalability issues, Lauritzen and Spiegelhalter [16] exploit a range of local representations for the network joint probability distribution, introducing the junction tree [17]. To the best of our knowledge, this is the most efficient and scalable belief propagation algorithm and will be the one used in our experiments.
5 5.1
Framework Functional Settings Image Analysis Tasks
For carrying out an image analysis task using the proposed framework it is important to specify the following: a) formulate the hypothesis set H(Iq ) before initiating the decision mechanism, b) determine the methods used to obtain the initial confidence values of the evidence E(Iq ) and c) clarify what is considered to be the task specific analysis context app ∈ X, used to derive the evidence. Image Categorization, involves selecting a category concept ci describing the image as a whole. A hypothesis is formulated around each of the categories and with respect to the overall image, H(Iq ) = {P r(ci |Iq ) : i = 1, . . . , n} where n is the number of category concepts. Global classifiers (i.e., models trained using image global information) are employed to estimate the initial likelihood for each hypothesis, P r(ci |Iq ). Regional concept information obtained s by analyzing specific regions Iq j of the image at hand, is considered to be the source of contextual information app ∈ X of this task. Local classifiers (i.e., models trained using image regional information) are applied on these regions and generate a set of confidence values that constitute the analysis evidence, s E(Iq ) = {P r(´ ci |Iq j ) : i = 1, . . . , k & j = 1, . . . , m} where k is the number of regional concepts and m the number of identified regions. The distinction between the category concepts ci (i.e., hypothesis concepts cH in this case) and regional concepts c´i (i.e., evidence concepts cE in this case) as well as their exact nature is determined by KD .
532
S. Nikolopoulos et al.
Localized Image Region Labeling, annotates each of the identified regions with one of the available regional concepts c´i . A hypothesis is formulated for each of the available regional concepts and with respect to each of the regions s identified in the image, H(Iq ) = {P r(´ ci |Iq j ) : i = 1, . . . , k & j = 1, . . . , m} where k is the number of regional concepts and m is the number of identified regions. Regional classifiers are utilized to estimate the initial likelihood for each s of the formulated hypotheses, P r(´ ci |Iq j ) with i = 1, . . . , k & j = 1, . . . , m. In this case, global image information is considered to be the source of contextual information app ∈ X and the confidence values for each of the category concepts ci , constitute the analysis evidence of this task, E(Iq ) = {P r(ci |Iq ) : i = 1, . . . , n}, where n is the number of category concepts. Once again, the knowledge structure KD determines which concepts should be considered category concepts and which regional. However, since the nature of this task is different from image categorization, in this case c´i ≡ cH and ci ≡ cE . It is clear that the objective of our framework in both tasks is to operate on top of the classifiers’ outcome with the aim to compensate for misleading decisions. Intuitively, the framework incorporates contextual information by favoring the co-occurrence of evidence that are known from experience to correlate. Additionally, the framework attempts also to exploit semantic restrictions, saying for instance that two concepts are disjointed. Therefore, provided that the majority of evidence coming from context are relatively strong and accurate, the framework is expected to make the correct decision by absorbing any misleading cues produced by the erroneous analysis of visual stimulus. 5.2
Low-Level Image Processing
For low-level image processing we employed the scheme utilized in [8]. Four different visual descriptors proposed by the MPEG-7 standard [18] namely Scalable Color, Homogeneous Texture, Region Shape, Edge Histogram comprised the feature space. An extension of the Recursive Shortest Spanning Tree algorithm [19] was employed for producing a segmentation mask S = {si , i = 1, . . . , N }, with si representing the identified spatial regions. Finally, Support Vector Machines (SVMs) [20] as implemented by the libsvm library [21], were chosen to construct the statistically trained models, using the distance from the decision boundary in the kernel space as a way to measure the degree of confidence.
6
Experimental Study
The purpose of our experimental setup was to demonstrate the improvement in performance introduced by exploiting context and knowledge, compared to schemes that rely solely on low-level visual information. A dataset from the “Personal Collection” domain was selected for testing the proposed framework using the analysis tasks of Section 5.1. 6.1
Experimental Platform
Test Set Characteristics: A collection I of 648 jpeg images comprised the test platform. Six different categories formulating the global (i.e., category) concepts
An Evidence-Driven Probabilistic Inference Framework
533
lexicon CG = {Countryside buildings, Seaside, Rockyside, F orest, T ennis, Roadside} ∈ NC , were used to manually characterize all 648 images at global level. Respectively, 25 more fine grained concepts constituting the local (i.e., regional) concepts lexicon CL = {Building, Roof, T ree, Stone, Grass, Ground, Dried − plant, T runk, V egetation, Rock, Sky, P erson, Boat, Sand, Sea, W ave, Road, Road−line, Car, Court, Court−line, Board, Gradin, Racket} ∈ NC , were used to manually annotate the images at region level. A domain expert was employed to provide the logical relations between the elements of CG and CL using the OWL-DL ontology language, Fig. 2. The matching bayesian network automatically derived according to the methodology presented in Section 4 is depicted in Fig. 3. Calculating the prior probabilities and CPTs for each node of the constructed network requires a set of observation data. A subset of the manually annotated image set containing 324 samples was utilized to provide the parameter learning algorithm with the necessary observations and also to train the necessary classifiers, Fc , for each of the elements in CG and CL . Fig. 3 depicts the prior probabilities of all network nodes calculated by the learning algorithm. The remaining 324 images were used for testing. 6.2
Performance Evaluation
For measuring efficiency recall, precision and F-Measure were utilized. Based on the analysis tasks specified in Section 5.1, we have conducted the following experiments. Image Categorization: In this experiment we measure the efficiency of categorizing all test images in one of the categories in CG , using three configurations
Fig. 2. Manually constructed ontology for describing the “Personal Collection” domain
534
S. Nikolopoulos et al.
Fig. 3. Bayesian network automatically derived from the ontology of Fig. 2
varying in the amount of utilized context and knowledge. In the first configuration we assess the performance of image categorization based solely on the output of global classifiers. In the second configuration, information coming from the local classifiers is incorporated into the network for helping towards the correction of the decisions erroneously taken by the global classifiers. In this case context and knowledge are utilized to extract the existing evidence and facilitate the process of evidence driven probabilistic inference. However, no semantic constraints (i.e., DL constructors) originating from the domain are incorporated into the decision process. This is the purpose of the last configuration where such constraints are incorporated into the bayesian network using the methodology presented in Section 4. After formulating the hypotheses set for all category concepts, the framework looks for the presence of all regional concepts determined in KD . All classifiers, global and local, are applied to formulate one set of confidence values for the image as a whole, LKglobal = {P r(ci |Iq ) : ∀ci ∈ CG } and one set per identified image region, LKlocal = {P r(cj |Iqsk ) : ∀cj ∈ CL & ∀sk ∈ S}. All values of LKglobal and the maximum per row values of LKlocal are inserted as evidence into the bayesian network. Subsequently, the network is updated to propagate evidence impact and the category with the highest resulting likelihood is selected as the final decision (i.e., in this case ⊗ ≡ max). Table 2 summarizes the results for each of the framework configurations mentioned earlier. The performance achieved by the framework using the second configuration (row II of Table 2) is improved by ≈ 5% (in terms of the F-Measure metric)
An Evidence-Driven Probabilistic Inference Framework
535
Table 2. Image Categorization Evaluation Matrix % Global Classifiers only
Tennis Roadside Rockyside Seaside Forest C. Build. Avg
Re 100.00 Pr 83.33 F-M 90.90
Global Classifiers Re Local Classifiers Pr Know. & Context F-M
98.00 90.74 94.23
Global Classifiers Re 94,00 Local Classifiers Pr 100,00 Know. & Context F-M 96,90 Sem. Constraints
68.42 69.64 69.02
68.62 70.00 69.30
85.71 76.66 30.00 67.60 63.88 100.00 75.59 69.69 46.15
71.57 75.74 7 0.11
73.68 64.61 68.85
64.70 76.74 70.21
91.07 71.66 70.83 71.66 79.68 71.66
54.00 90.00 67.50
75.52 77.43 75.36
73,68 64,61 68,85
70,58 76,59 73,46
91,07 71,66 69,86 70,49 79,06 71,07
56,00 90,32 69,13
76,17 78,65 76,41
compared to the first configuration (row I of Table 2). We will use the running example of Fig. 4 to demonstrate how evidence collected using regional information can revise a decision erroneously taken from a global classifier. By applying all global classifiers on the test image of Fig. 4 we get the probabilities of “Global Classifiers” table. According to these values the image should be characterized as Seaside since the corresponding classifier exhibit maximum confidence. The situation remains unaltered, as shown in the second row of “Belief Evolution” table, when the confidence values of all global classifiers are inserted into the network. However, this is not the case when the regional evidence i.e., the maximum value from each column of the “Local Classifiers” table are consecutively inserted into the bayesian network. The last four rows of “Belief Evolution” table illustrate how the likelihood of each category evolve in the light of new evidence. Eventually the correct category, Roadside, is found to exhibit maximum likelihood. What is interesting is the fact that only two out of four local classifiers (regions 1 and 3) succeeded in correctly predicting the depicted regional concept. Nevertheless, this information was sufficient for the evidence driven image analysis framework to infer the correct prediction, since the relation between the evidence grass identified in region 1 and the Roadside category, was strong enough to raise the inferred likelihood of this category above the corresponding value of Seaside, a category that receives no support by this evidence, as shown in Fig.2. By examining the confusion matrix of Table 3 that corresponds to the second configuration of our framework, in conjunction with Fig. 2, where the amount of evidence shared between different image categories is depicted, it is clear that the system tends to confuse categories that share many visual characteristics. Another interesting observation derived from Fig. 2 concerns the small amount of evidence that Tennis shares with the rest of image categories. This is in fact a piece of information (i.e., semantic constraint) that originates from the domain and can be used to aid image analysis (i.e., third configuration of our
536
S. Nikolopoulos et al.
Local Classifiers Region1 Region2 Region3 Region4
Global Classifiers Tennis Roadside Rockyside Seaside Forest C.Buildings
45,97 54,21 47,07 56,31 52,46 56,00
Belief Evolution Tennis Roadside Rockyside Seaside Forest C,Buildings
% Prior Global Evidence-Region1 Evidence-Region2 Evidence-Region3 Evidence-Region4
15,4 13,4 13,4 13,4 13,4 13,4
17,6 20,2 21,7 23,5 27,2 27,2
15,7 14,2 14,2 15,2 17,3 17,6
17,3 21,2 21,2 22,9 26,3 26,9
18,5 20,1 20,9 21,5 22,6 22,8
15,4 18,8 20,6 22,6 26,8 27,4
Board Court Person Gradin Court line Racket Sky Dried plant Grass Road Vegetation Ground Road line Car Trunk Rock Glacier Sea Sand Wave Boat Tree Stone Building Roof
48,51 49,72 52,56 49,07 50,13 48,93 47,03 47,14 56,36 52,35 49,17 50,55 48,55 47,83 48,82 49,13 50,00 48,73 49,62 52,85 49,76 50,47 49,58 44,62 48,60
49,51 52,53 50,16 53,01 49,25 50,99 60,69 49,63 53,29 48,47 47,18 48,56 49,93 49,51 47,68 47,40 50,00 51,32 47,86 46,54 49,74 47,61 49,50 47,71 52,30
50,62 51,83 51,34 51,40 51,85 50,26 74,54 48,09 47,66 49,45 46,85 50,70 49,89 47,60 49,00 47,79 50,00 47,87 49,66 47,96 47,63 48,41 49,39 46,14 49,47
46,84 54,00 52,93 51,67 51,93 48,34 56,03 47,97 50,31 54,97 53,74 49,51 50,10 47,92 48,41 48,00 50,00 44,08 47,41 48,29 48,77 48,25 49,55 43,97 49,90
Fig. 4. An example of evidence-driven probabilistic inference for image categorization
framework). Indeed, by enhancing the ontology of Fig. 2 to associate Tennis with all other concepts in CG with the “owl:disjointWith” DL-constructor, reconstructing the bayesian network to include the additional control nodes (see Fig. 3) and calculating the updated CPTs, the performance is further increased reaching an overall improvement of ≈ 6.5% (row III of Table 3). Region Labeling: The use of context and knowledge for region labeling was motivated by the fact that information concerning the image as a whole can potentially favor one region interpretation over another. Hence, it is clear that the proposed framework can basically benefit region labeling when there is a conflict between the predictions suggested by global and local classifiers. If we Table 3. Confusion Matrix for Image Categorization (2nd Configuration) %
Tennis Roadside Rockyside Seaside Forest C. Buildings
Tennis
98.00
0
0
2.00
0
0
Roadside
1.75
Rockyside Seaside
5.88 0
73.68
0
8.77
10.53
5.26
3.92 5.36
64.71 3.57
5.88 91.07
19.61 0
0 0
Forest C. Buildings
0
10.00
8.33
10.00
71.67
0
2.00
24.00
6.00
12.00
2.00
54.00
An Evidence-Driven Probabilistic Inference Framework
537
let Child(ck : ck ∈ CG ) = (cj : ∀cj ∈ CL where k →parent j) be the subset of CL corresponding to the child nodes of ck according to KD , LKglobal = {P r(ci |Iq ) : ∀ci ∈ CG } be the set of confidence values obtained from the global sw classifiers and LKlocal = {P r(cj |Iqsw ) : ∀cj ∈ CL } be the set of confidence values obtained from the local classifiers applied on region sw , a conflict occurs when sw cl ∈ / Child(cg ) with g = arg maxi (LKglobal ) and l = arg maxj (LKlocal ). In this case, since there is no reason to trust one suggestion over another, we make two hypotheses. The first assumes that the suggestion of the global classifier is correct and a global concept cg is selected such as g = arg maxi (LKglobal ). Afterwards, the local concept cl with maximum confidence that is included in the child node sw set of cg is selected, such as l = arg maxj (LKlocal ) and cl ∈ Child(cg ). Both confidence values of cg and cl are inserted into the network as evidence and the overall impact on the likelihood of the hypothesis stating that the region under examination sw , depicts cl is measured. The second approach considers that the suggestion of the local classifier c´l is the correct, selected such as ´l = sw arg maxj (LKlocal ). The confidence values of the global classifiers that correspond to the parent nodes of c´l are examined and the one cg´ with maximum value is selected, such as g´ = arg maxi (LKglobal ) and cg´ ∈ P arent(c´l ). As in the previous case both likelihoods are inserted into the network and the overall impact on the likelihood of the hypothesis stating that the examined region sw , depicts c´l is measured. Eventually, these values are compared and the concept corresponding to the largest value is chosen (i.e., this is the functionality of ⊗ operator for this case). If no conflict occurs the concept corresponding to the local classifier with maximum confidence is selected. Fig. 5 presents the evaluation results and shows that an average increase of approximately 4.5% is accomplished when the proposed framework is used. Regional concepts that exhibit zero hits from the local classifiers (i.e., Racket, Road line, Car, Glacier, Stone) are not included in the evaluation results. Localized Region Labeling Evalution 100 90
Regional Classifiers Only Regional Classifiers using Know. & Context
80
F−Measure
70 60 50 40 30 20 10
B oa PeCourd r C G sort ou ra n r d D t Li in rie n d Ske P y G lan Ve ra t ge Ro ss t a G ati d ro on u Tr nd u R nk oc Sek Sa a W nd a B ve o B T at ui re ld e i R ng oo f A ve ra ge
0
Fig. 5. Evaluation results for localized region labeling
538
7
S. Nikolopoulos et al.
Conclusions and Future Work
The problem of using visual evidence to assist image analysis has been thoroughly treated and a concrete framework addressing the identified issues has been proposed. The suitability of ontologies and bayesian networks for imitating some of the fundamental aspects of visual perception has been investigated. Experiments demonstrated that the proposed framework is able to analyze images using different configurations, in terms of the amount of utilized context and knowledge, and manage to achieve statistically significant improvement with respect to the solutions relying solely on visual stimulus. One important prerequisite for allowing the proposed framework to maximize the performance gain, is to operate on a sufficiently large amount of training data. This is hindered by the fact that it is really a cumbersome procedure to manually annotate a sufficiently large number of images, especially at region level, and in this way ensure that the sample data tendencies converge to true domain statistics. A solution to this problem could be to adjust the underlying image processing module so as to allow handling of large multimedia corpora that are being populated automatically, as in the case of WWW and Internet users. Given the fact that social sites like Flickr1 , accommodate image corpora that are being populated with hundreds of user tagged images on a daily basis and taking into consideration that literature has already reported efforts on performing localized region-label association, from weakly annotated data [22], pipelining such schemes with the proposed framework may help overcoming some of the problems deriving from the use of limited size training sets. Acknowledgment. This work was funded by the X-Media project (www.xmedia-project.org) sponsored by the European Commission as part of the Information Society Technologies (IST) programme under EC grant number IST-FP6-026978.
References 1. Naphade, M.R., Kristjansson, T.T., Frey, B.J., Huang, T.S.: Probabalistic multimedia objects (multijects): A novel approach to video indexing and retrieval in multimedia systems. In: ICIP, vol. (3), pp. 536–540 (1998) 2. Naphade, M.R., Huang, T.S.: A probabilistic framework for semantic video indexing, filtering, and retrieval. IEEE Transactions on Multimedia 3(1), 141–151 (2001) 3. Luo, J., Savakis, A.E., Singhal, A.: A bayesian network-based framework for semantic image understanding. Pattern Recognition 38(6), 919–934 (2005) 4. Luo, J., Savakis, A.E.: Indoor vs outdoor classification of consumer photographs using low-level and semantic features. In: ICIP, vol. (2), pp. 745–748 (2001) 5. Kane, M.J., Savakis, A.E.: Bayesian network structure learning and inference in indoor vs. outdoor image classification. In: ICPR, vol. (2), pp. 479–482 (2004) 1
http://www.flickr.com/
An Evidence-Driven Probabilistic Inference Framework
539
6. Matos, L.N., de Carvalho, J.M.: Combining global and local classifiers with bayesian network. In: ICPR, vol. (3), pp. 1212–1215 (2006) 7. Ding, Z., Peng, Y., Pan, R.: A bayesian approach to uncertainty modeling in owl ontology. In: Int. Conf. on Advances in Intelligent Systems - Theory and Applications (November 2004) 8. Papadopoulos, G.T., Mezaris, V., Kompatsiaris, I., Strintzis, M.G.: Combining global and local information for knowledge-assisted image analysis and classification. EURASIP J. Adv. Signal Process (2) (2007) 9. Athanasiadis, T., Mylonas, P., Avrithis, Y., Kollias, S.: Semantic image segmentation and object labeling. IEEE Transactions on Circuits and Systems for Video Technology 17(3), 298–312 (2007) 10. Town, C.: Ontological inference for image and video analysis. Machine Vision and Applications 17(2), 94–115 (2006) 11. Cardoso, J.: The semantic web vision: Where are we? IEEE Intelligent Systems 22(5), 84–88 (2007) 12. McGuinness, D.L., van Harmelen, F.: OWL web ontology language overview. W3C recommendation, W3C (February 2004), http://www.w3.org/TR/2004/REC-owl-features-20040210/ 13. Horrocks, I.: Description logics in ontology applications. In: Beckert, B. (ed.) TABLEAUX 2005. LNCS, vol. 3702, pp. 2–13. Springer, Heidelberg (2005) 14. McLachlan, G.J., Krishnan, T.: The EM algorithm and extensions, 2nd edn. John Wiley and Sons, Chichester (1997) 15. Pearl, J.: Fusion, propagation, and structuring in belief networks. Artif. Intell. 29(3), 241–288 (1986) 16. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems, 415–448 (1990) 17. Jensen, F.V., Jensen, F.: Optimal junction trees. In: Kaufmann, C.M. (ed.) Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, San Mateo (1994) 18. Manjunath, B.S., Ohm, J.R., Vinod, V.V., Yamada, A.: Colour and texture descriptors. IEEE Trans. Circuits and Systems for Video Technology, Special Issue on MPEG-7 11(6), 703–715 (2001) 19. Adamek, T., O’Connor, N., Murphy, N.: Region-based segmentation of images using syntactic visual features. In: WIAMIS 2005, Montreux, Switzerland (2005) 20. Scholkopf, B., Smola, A., Williamson, R., Bartlett, P.: New support vector algorithms. Neural Networks 22, 1083–1121 (2000) 21. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines (2001) 22. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index in Circular Regions and SVM André Pereira Nunes1, Aristófanes Corrêa Silva1, and Anselmo Cardoso de Paiva2 1
Federal University of Maranhão – UFMA, Departament of Eletrical Engeneering Av. dos Portugueses, SN, Campus do Bacanga, Bacanga 65085-580, São Luís, MA, Brazil [email protected], [email protected] 2 Federal University of Maranhão – UFMA, Departament of Informatics Av. dos Portugueses, SN, Campus do Bacanga, Bacanga 65085-580, São Luís, MA, Brazil [email protected]
Abstract. Breast cancer is one of the major causes of death among women all over the world. Presently, mammographic analysis is the most used method for early detection of abnormalities. This paper presents a computational methodology to help the specialist with this task. In the first step, the K-Means clustering algorithm and the Template Matching technique are used to detect suspicious regions. Next, the texture of each region is described using the Simpson’s Diversity Index, which is used in Ecology to measure the biodiversity of an ecosystem. Finally, the information of texture is used by SVM to classify the suspicious regions into two classes: masses and nonmasses. The tests demonstrate that the methodology has 79.12% of accuracy, 77.27% of sensitivity, and 79.66% of specificity. Keywords: Mammography, Computer-Aided Detection, K-Means, Template Matching, Simpson’s Diversity Index, SVM.
1 Introduction Cancer is a disease characterized by an abnormal reproduction of cells, which invade and destroy adjacent tissues, being even able to spread to other parts of the body, through a process known as metastasis. Among women, breast cancer is one of the major causes of deaths. In Brazil, the estimative for the year of 2009 points to 175,750 new cases of cancer occurring amongst the female population, being breast cancer the most incident, with 49,400 new cases [1]. Presently, breast radiography, also called mammography, is the mostly used tool to detect this kind of cancer on its starting stage. The mammography makes possible the identification of the anomalies in their initial development, a determining factor for success in treatment. The sensitivity of this kind of exam, although, depends on different factors, such as the size and location of the anomaly, density of the breast tissue, quality of the technical resources and radiologist’s ability [2]. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 540–553, 2009. © Springer-Verlag Berlin Heidelberg 2009
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index
541
For these reasons, several image processing and analysis techniques have been developed in order to create Computer-Aided Detection (CAD) / Diagnosis (CADx) systems that can help the specialist with the task of interpreting mammographic images [3]. In [4] it is presented a methodology for discriminating and classifying regions extracted from mammograms in mass and non-mass classes through spatial statistics, like the Moran’s Index and the Geary’s Coefficient. The work uses the Support Vector Machine (SVM) for classification of candidates, obtaining 99.64% of accuracy. In [5] is proposed another methodology that uses SVM for classification, with 89.30% of accuracy. It uses the Growing Neural Gas algorithm for segmentation of mass candidates and Ripley’s K Function to describe their texture. In [6] the performance of a mass classification methodology is analyzed when the features extraction is done in mammographic images obtained consecutively in time. The main goal is to improve the description of the masses using information present in more than one mammography of the same breast, obtained in a consecutive way. In this work, we investigate the efficiency of a SVM-based methodology that uses the Simpson’s Diversity Index to describe texture of suspicious regions in mammography images. The motivation is to contribute to the CAD/CADx systems development with other way to characterize breast masses. The Simpson’s Diversity Index is traditionally used in Ecology to measure the biodiversity of an ecosystem but, to the best of our knowledge, it has never been applied to medical images before. This paper is organized into five sections. Section 2 presents the background needed for methodology development, which is further described in Section 3. Afterwards, the obtained results are discussed in Section 4 and the conclusion is presented in Section 5.
2 Background This section gives the main theoretical basis that underlies the presented methodology. First, we describe the Simpson’s Diversity Index and the Support Vector Machine (SVM) technique. Next, the feature selection schema using Genetic Search and SVM is presented. Finally, are described the performance indicators used to validate the methodology. 2.1 Simpson’s Diversity Index During digital images processing, it is common to extract a descriptive group of measurable features from the interesting regions. One kind of features especially useful for medical images processing, like radiological images, is the texture. Texture is usually defined as the visual or touch sensation provided by the surface of objects. In image processing, texture is the information that defines the pixel values distribution in space on a region of an image [7]. A classic way to describe the texture of a grayscale image is the statistical approach, which describes the texture as rules that govern the distribution and the
542
A.P. Nunes, A.C. Silva, and A.C. de Paiva
relation between the gray-levels of a region. Statistical measurements include, among others, the contrast, energy, entropy, correlation, uniformity and moment, which are obtained from the Co-occurrence Matrix [8]. This work proposes the use of the Simpson’s Diversity Index, a statistic originally used in Ecology to quantify the diversity of an ecosystem [9], to describe texture in mammographic images. The index is computed for a finite population of individuals through Equation 1. S
D=
∑ n (n i =1
i
i
− 1)
N ( N − 1)
.
(1)
where S represents the number of species, N is the total number of individuals in the region, and ni the number of individuals of a given specie i. The value of D varies from 0 to 1, where 0 represents an infinite diversity and 1 represents absence of diversity. A more intuitive manner of representation, although, is by subtracting the value of D from the unity (1-D), this way obtaining a result that is nearer to 1 as greater is the diversity. To apply the Simpson’s Diversity Index on a given region in the image, Equation 1 is used in the following manner: S determines the amount of different gray-levels in the image, N is the total quantity of pixels and ni the number of pixels that present intensity i, obtained directly from the image histogram. 2.2 Support Vector Machine The Support Vector Machine (SVM), introduced by V. Vapnik in 1995, is a method to estimate the function that classifies input data into two classes [10]. The basic idea of SVM is to construct a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized. The SVM term comes from the fact that the points in the training set which are closer to the decision surface are called support vectors. SVM achieves this by the structural risk minimization principle that is based on the fact that the error rate of a learning machine on the test data is bounded by the sum of the training error rate and a term that depends on the Vapnik-Chervonenkis dimension. The process starts with a training set of points xi ∈ ℜn, i=1,2,…,l where each point xi belongs to one of two classes identified by the label yi ∈ {-1,1}. The goal of maximum margin classification is to separate the two classes by a hyperplane so that the distance to the support vectors is maximized. The construction can be thought as follows: each point x in the input space is mapped to a point z = Φ(x) of a higher dimensional space, called feature space, where the data are linearly separated by a hyperplane. The nature of data determines how the method proceeds. There are data that are linearly separable, nonlinearly separable and with impossible separation. The key property in this construction is that we can write our decision function using a kernel function K(x,y) which is given by the function Φ(x) that maps the input space into the feature space. Equation 2 defines the decision surface.
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index
543
l
f ( x ) = ∑ a i y i K ( x, xi ) + b i =1
(2) .
where K(x,y) = Φ(x).Φ(xi), and the coefficients αi and the b are the solutions of a convex quadratic programming problem [11], namely
min w ,b ,ξ
l 1 T w ⋅ w + C ∑ ξi , 2 i =1
(3)
subject to yi [ wT ⋅ φ ( xi ) + b] ≥ 1 − ξ i , ξ i ≥ 0 . where C > 0 is a parameter to be chosen by the user, which corresponds to the strength of the penalty errors and the ξi’s are slack variables that penalize training errors. Classification of a new data point x is performed by computing the sign of the right side of Equation 2. An important family of kernel functions is the Radial Basis Function, more commonly used for pattern recognition problems, which has been used in this paper, and is defined by Equation 4:
K ( x, y ) = e
−λ x − y
2
.
(4)
where λ > 0 is a parameter defined by the user. 2.3 Feature Selection Using Genetic Search and SVM Genetic Algorithm (GA) is a class of algorithms that mimics nature’s evolution strategy to evolve a population of solutions in optimization problems [12][13]. The process, also called Genetic Search, maintains and evolves a population of chromosomes as potential solutions to an optimization problem. A new population of chromosomes is reproduced by applying genetic operators such as mutation and crossing over on the parent chromosomes in a reproduction step. In the fitness evaluation step, fitness values are assigned to the new chromosomes based on an objective function. After that, the population undergoes a natural selection process that selects the fittest individuals to mate and reproduce. The steps above are repeated until a specified number of generations is reached. GA has been applied effectively to solve a wide spectrum of optimization problems, including feature selection problems in bioinformatics or biomedical areas [14]. The feature selection problem refers the task of identifying and selecting a useful subset of attributes to be used to represent patterns from a larger set of attributes, often mutually redundant and possible irrelevant [15]. In this work, a traditional SVM-GA hybrid process [16] is used to perform the feature selection and the classification of suspicious regions into mass and nonmass classes. The GA part of the system is responsible for evolving chromosomes as sets of selected features that are important for classification outcomes. In the
544
A.P. Nunes, A.C. Silva, and A.C. de Paiva
fitness evaluation step, the classification performance of an SVM classifier model is assigned as the fitness of a chromosome using the following steps: the selected features are used to reduce the dimension of the data by removing unimportant features from the dataset; then a reduced dataset is used as training data for an SVM model selection process; finally the best SVM classifier model is selected based on classification performance, which is assigned as the fitness value of the corresponding chromosome. For a chromosome to survive, it must select relevant features that are essential to SVM classification performance. Hence, at the end of the evolution, the most important set of features are selected. 2.4 Validation of the Classification Methods In image processing problems and recognition of patterns related to the medical area, we usually measure the performance of a methodology by calculating some statistics about the results obtained in tests [17]. Given a sample with positive and negative cases of a certain disease, the results of the classification tests can be divided into four groups: True Positives (TP) – number of cases correctly classified as positive; False Negatives (FN) – number of cases erroneously classified as negative; True Negatives (TN) – number of cases correctly classified as negative; and False Positives (FP) – number of cases erroneously classified as positive. From the values of each group several performance indicators can be calculated. In this work we used Accuracy (A), Sensitivity (Se), Specificity (Sp) and the F-Measure (Fm). The accuracy measures the total percentage of cases correctly classified (Equation 5). Sensitivity measures the performance of the classification in relation to the positive cases (Equation 6). Specificity measures the performance of the classification in relation to the negative cases (Equation 7). F-Measure is a measurement that calculates the balance between sensitivity and specificity of the classification, privileging those that present a good balance between the FP and FN cases (Equation 8).
A=
TP + TN TP + TN + FP + FN
(5)
Se =
TP TP + FN
(6)
Sp =
TN TN + FP
(7)
2 × ( Se × Sp ) ( Se + Sp )
(8)
Fm =
Besides these four indicators, the averages of False Positives per image (FPi) and the average of False Negatives per image (FNi) were also used.
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index
545
3 Proposed Method The methodology proposed in this paper is composed of four main steps: preprocessing, segmentation of mass candidates, extraction of features and classification of the candidates, as the scheme shown in Fig. 1.
Fig. 1. Methodology steps
The pre-processing step improves the efficiency of the next steps. The segmentation of candidates has the objective of identifying the regions of interest, so that the next steps work only with relevant regions. The next step represents the segmented regions by a group of features. Finally, the candidate classification step selects the more representative features to train a classifier capable of distinguishing the regions that correspond to masses and the ones that correspond to non-masses. To develop and evaluate the methodology, a public mammographic image database was used, the Digital Database for Screening Mammography (DDSM) [18]. 3.1 Pre-processing Many of the images in DDSM present noise and other elements that can interfere the processing to be done. The objective of this step is to remove these unwanted objects and improve image quality. Fig. 2 shows the elements normally present in a mammography image.
breast tissue
label
mass background Fig. 2. Elements of a mammography image
546
A.P. Nunes, A.C. Silva, and A.C. de Paiva
The procedure of removing the background and the identification marks uses KMeans [19], a non-supervised clustering algorithm, to group pixels into two classes (k=2), according to their intensities. The clustering results in one class containing the darkened back layer pixels, and other, containing the clearer pixels corresponding to the breast and identification marks, as seen in Fig. 3.
Fig. 3. Image clustered in two classes with K-Means
As the objective is to leave only the breast in the resulting image, the class containing the less intense pixels is discarded. Next, a traditional region growing algorithm [20] is used in the remaining class to distinguish the present objects and choose the one that presents larger area, in pixels. After removing unwanted elements, the breast image still passes through a contrast improvement process. This process aims to raise the visual discrimination among the objects present in the image and can be done in several ways. In this paper a basic linear contrast improvement was used [20]. To end this step, the image is reduced to 1/3 of its original size and has a 50 pixel border cut at height and width. This is done to reduce the time spent by the next steps. Fig. 4 shows the final result of the operations done in pre-processing.
Fig. 4. Image resulting from pre-processing
3.2 Segmentation of Candidates Regions This step, composed of two parts, has the objective of identifying suspicious regions in the breast, discarding those with fewer chances of containing a mass. In the first
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index
547
Fig. 5. Clusters generated by K-Means (k=5)
part, K-Means is used to cluster the breast image pixels into several clusters, as shown in Fig. 5. As, a priori, the ideal amount of clusters for representing the natural distribution of pixels in each image is unknown, several values of k are used (k=5, 6,…,10). The resulting structures on every generated cluster are then separated into individual binary images, using the region growing algorithm. This process starts pointing the seed of region growing algorithm to the first white pixel found on image, then the region growing is performed and the resulting region is moved to a separated file. The process is repeated until no more white pixels are found. Some examples of isolated regions can be seen in Fig. 6. In order to decrease the amount of generated candidates, those with dimensions smaller than 30x30 and larger than 350x350 pixels are discarded, because all the masses present in the selected sample are in this interval, according to information contained in DDSM.
Fig. 6. Examples of structures obtained from the region growing algorithm, on cluster images generated by K-Means
The second part of segmentation consists of selecting only structures having a form minimally alike a mass. For this, the template matching technique was used [20]. Each structure is scanned by three circular templates. On every overlapped position the amount of pixels having the same value of its correspondent in the template is counted. If the match rate of correspondent pixels is greater than 70%, the matching white pixels of the structure are stored in a resultant image. The objective is to discard empty, long or too twisted objects. Among the images shown in Fig. 6, for example, only objects a and c are selected for next step. The use of three template sizes has been made to adequate the technique to the mass scale. Preliminary tests helped to identify the sizes that are more adequate to the masses scale in the sample: 48x48, 95x95 and 158x158.
548
A.P. Nunes, A.C. Silva, and A.C. de Paiva
After this step, the accepted candidates are represented in grayscale again, because the extraction of features uses this information to describe the texture of candidates in the next step. 3.3 Feature Extraction In this step each candidate has its texture described through Simpson’s Diversity Index (Equation 1). To evaluate the best manner to use the index, two extraction approaches are used: global and by regions. In the global approach all the pixels of the candidate are taken into consideration for calculating the diversity index. In the approach by regions, diversity is calculated for several areas of the candidate, trying to find out different diversity patterns among the regions that are nearer to the mass border and the regions nearer to the mass center. To do this, since we don’t previously know the ideal number of areas, the candidate is divided in n concentric circles with different radius sizes, all starting on the candidates’ center of mass. The circle with greater radius circumscribes the candidate and contains all of their pixels, being equivalent to the global approach. Fig. 7 shows the areas obtained when a candidate is divided in three regions (n=3).
Fig. 7. Candidate pixels taken in three circular regions
In order to identify the ideal number of regions (n) to be used in this approach, Simpson’s Diversity Index is calculated for nine different values of n (n=2,3,…,9,10). 3.4 Feature Selection and SVM Classification The objective of this step is to use the information extracted from candidates to classify them into masses and non-masses. In the global approach the extracted values are directly submitted to SVM, with no need to execute any selection scheme because each candidate has just one measurement describing its texture. However, in the approach by regions, many values of Simpson’s Diversity Index are calculated according the number n of regions used. So, to select the region values that best discriminate the texture of candidates, a genetic-based scheme of feature selection is used in conjunction to the SVM, in an integrated process, as shown in the Fig. 8. At the beginning of the process a population of chromosomes representing the regions that will be used by the SVM is defined randomly. The genes that represent the regions are ordered from the most external region to the most internal. For example, for n=3 a chromosome 001 indicates that only the most internal region is selected.
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index
549
Fig. 8. Feature selection and classification scheme
During the genetic search process the f-measure obtained by the SVM using each chromosome is taken as their fitness value. The fittest chromosomes are used to form a new generation of feature combinations, while the less appropriate ones are discarded. At the end of a predefined maximum number of iterations, the fittest chromosomes are selected as the set that best discriminates the mass and non-mass candidates. The subgroup of candidates used during feature selection is called training data, and the remaining ones, used to validate the classifier, are called test data. A library for Support Vector Machines, called LIBSVM [21], was used for training and testing the SVM classifier. The Radial Basis Function was used as kernel with default parameters.
4 Results The proposed methodology was tested with 603 sample images, randomly extracted from DDSM, each one containing only one mass. The candidates selection step segmented a total number of 2679 suspicious regions, including all 603 masses and other 2076 non-masses regions. During the classification step, six different proportions for the training and test subgroups were used: 30/70, 40/60, 50/50, 60/40, 70/30 and 80/20, where the first number represents the percentage of candidates used in training and the second number represents the percentage of candidates used in test (Tr/Te). The candidates used in each subgroup were randomly selected from the total number of segmented candidates. Two strategies of tests were performed. The first one used the Simpson’s Diversity Index extracted with the global approach and the second one used the extraction by regions. Table 1 shows the performance indicators achieved by the global approach for each subgroup of training and test and Table 2 shows the results obtained with the approach by regions. As the approach by regions uses several numbers of regions (n), the first column shows which values of n were used in each test. The chromosomes column shows which regions propitiate the best performance for the SVM at the end of the genetic search, remembering that the genes representing the regions are ordered form the most external region to the most internal. The values of the performance indicators presented in Table 2 are the averages of the results obtained with the six training and test proportions (Tr, Te) for each value of n.
550
A.P. Nunes, A.C. Silva, and A.C. de Paiva Table 1. Results achieved with global approach Tr/Te
A(%)
Se(%)
Sp(%)
FP/i
FN/i
Fm(%)
30/70 40/60 50/50 60/40 70/30 80/20
77.09 75.12 76.64 76.89 75.87 74.86
65.25 67.13 65.89 67.36 58.56 63.64
80.54 77.45 79.77 79.66 80.90 78.13
0.67 0.78 0.70 0.70 0.66 0.75
0.35 0.33 0.34 0.33 0.41 0.36
73.00 72.09 72.17 71.92 70.14 67.94
Average
76.08
64.64
79.41
0.71
0.35
71.21
Table 2. Average of results achieved with the approach by regions n
Chromosome
A(%)
Se(%)
Sp(%)
FP/i
FN/i
Fm(%)
2 3 4 5 6 7 8 9 10
11 111 1101 10101 101110 1010011 10010010 100110010 1001100010
78.84 79.12 78.94 78.66 78.94 79.22 79.12 79.03 79.12
74.79 76.03 75.62 76.86 76.45 76.86 77.27 77.27 77.27
80.02 80.02 79.90 79.18 79.66 79.90 79.66 79.54 79.66
0.69 0.69 0.69 0.71 0.70 0.69 0.70 0.70 0.70
0.25 0.24 0.24 0.23 0.24 0.23 0.23 0.23 0.23
77.32 77.97 77.70 78.00 78.02 78.35 78.45 78.39 78.45
General Average
79.00
76.49
79.73
0.70
0.24
78.07
As can be observed in Tables 1 and 2, the approach by regions obtained better performance than global approach, especially in the sensitivity aspect. However is important to observe that the approach by regions also cover the values of global approach in the most extern region, which was always selected by the GA-SVM process as a good feature. Considering that approach by regions adds diversity information of isolated candidates areas in relation to the global approach and this information propitiate a better power of generalization to the SVM classifier in terms of sensitivity, it may indicates that there are local diversity patters that helps the classifier to distinguish the mass texture more adequately. The best general individual result was achieved by the approach by regions, using 8 regions (n=8), in the proportion of 60/40 for Tr/Te. In Fig. 9 and Fig. 10, we can observe some examples of resultant images obtained by this approach. The white circular marks represent the real location of masses according to DDSM information, while the black areas on the breast tissue are the regions pointed by the methodology as a region that contains a mass. In the examples of Fig. 9, the masses were correctly identified, while in the Fig. 10 non-mass regions were erroneously classified as masses. Analyzing the cases in which the methodology fails it observed that a great number of them are related to mammograms presenting dense breast tissues, common
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index
551
in young women. This kind of tissue presents image aspects that difficult the distinction of mass and non-mass regions. Although, as the performance indicators shown in Table 2, the methodology obtained correct results in almost 80% of cases.
Fig. 9. Masses successfully detected by methodology
Fig. 10. Non-mass regions classified erroneously as masses
5 Conclusion This paper presented an automatic methodology for helping in detection of masses in mammographic images, and evaluated the usage of Simpson’s Diversity Index as a measurement of texture in characterization of masses and non-masses. The presented results show that the methodology has a good performance, obtaining 79.12% of accuracy, 77.27% of sensitivity, and 79.66% of specificity, with acceptable rates of false positives and false negatives per image, 0.7 and 0.23, respectively. Although these results aren’t superior than other similar methodologies, like [4] e [5], the Simpson’s Diversity Index has proven to be a very promising measurement for texture characterization, encouraging deeper studies on the usage of this kind of statistic in the problem of classification of masses and non-masses through SVM.
Acknowledgments The authors acknowledge CAPES, CNPq and FAPEMA for financial support.
552
A.P. Nunes, A.C. Silva, and A.C. de Paiva
References 1. Instituto Nacional do Câncer (INCA). Estimativas 2008: Incidência de Câncer no Brasil, http://www.inca.gov.br 2. American Cancer Society (ACS). Learn about breast cancer (2008), http://www.cancer.org 3. Fenton, J.J., Taplin, S.H., Carney, P.A., Abraham, L., Sickles, E.A., D’Orsi, C., Berns, E.A., Cutter, G., Hendrick, R.E., Barlow, W.E., Elmore, J.G.: Influence of ComputerAided Detection on Performance of Screening Mammography. Breast Diseases: A Year Book Quarterly 18(3), 248 (2007) 4. Braz Junior, G., Silva, E., Paiva, A.C., Silva, A.C., Gattass, M.: Breast Tissues Mammograms Images Classification using Moran’s Index, Geary’s Coefficient and SVM. In: 14th International Conference on Neural Information Processing (ICONIP 2007), Kitakyushu. LNCS. Springer, Heidelberg (2007) 5. Martins, L., Braz Junior, G., Silva, E.C., Silva, A.C., Paiva, A.C.: Classification of Breast Tissues in Mammogram Images using Ripley s K Function and Support Vector Machine. In: Kamel, M.S., Campilho, A. (eds.) ICIAR 2007. LNCS, vol. 4633, pp. 899–910. Springer, Heidelberg (2007) 6. Timp, S., Varela, C., Karssemeijer, N.: Temporal Change Analysis for Characterization of Mass Lesions in Mammography. IEEE Transactions on Medical Imaging 26(7), 945–953 (2007) 7. Tuceryan, M., Jain, A.K.: Texture Analysis. In: The Handbook of Pattern Recognition and Computer Vision, pp. 207–248. World Scientific Publishing, Singapore (1998) 8. Haralick, R.M., Shanmugan, K., Dinstein, I.: Texture features for image classification. IEEE Transaction on Systems, Man and Cybernetics, SMC 3(6), 610–621 (1973) 9. Simpson, E.H.: Measurement of diversity. Nature 163, 688 (1949) 10. Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, New York (1998) 11. Haykin, S.: Redes Neurais: Princípios e Prática, 2nd edn. Bookman, Porto Alegre (2001) 12. Goldberd, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. EUA. Addison-Wesley, Reading (1989) 13. Mitchell, M.: An Introduction to Genetic Algorithms. A Bradford Book/ MIT Press (1997) 14. Yang, J., Honavar, V.: Feature Subset Selection Using a Genetic Algorithm. IEEE Intelligent Systems 13(2), 44–49 (1998) 15. John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of ICML 1994, 11th International Conference on Machine Learning, New Brunswick, US, pp. 121–129 (1994) 16. Chow, R., Zhong, W., Blackmon, M., Stolz, R., Dowell, M.: An efficient SVM-GA feature selection model for large healthcare databases. In: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation (GECCO 2008), pp. 1373–1380 (2008) 17. Bushberg, J.T., Seibert, J.A., Leidholdt Jr., E.M., Bonne, J.M.: The Essential Physics of Medical Imaging. Medical Physics 22(8), 1355 (1995) 18. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, W.P.: The Digital Database for Screening Mammography (DDSM). In: Yaffe, M.J. (ed.) Proceedings of the Fifth International Workshop on Digital Mammography, pp. 212–218. Medical Physics Publishing (2001) ISBN 1-930524-00-5
Detection of Masses in Mammographic Images Using Simpson’s Diversity Index
553
19. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999) 20. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice Hall, Upper Saddle River (2008) 21. Chang, C.C., Lin, C.J.: LIBSVM – a library for support vector machines (2003), http://www.csie.ntu.edu.tw/cjlin/libsvm/
Mining Lung Shape from X-Ray Images Vassili Kovalev1 , Aliaksandr Prus1 , and Pavel Vankevich2 1
Biomedical Image Analysis Group, United Institute of Informatics Problems, National Academy of Sciences of Belarus Room 803, Kirova St., 32-A, 246050 Gomel, Belarus 2 Department of Radiology, Phthisiological City Hospital Bekhtereva St., 9, 220026 Minsk, Belarus
Abstract. This paper presents an approach for mining 2D shape of human lungs from large x-ray image archives of a national level. Images were accumulated in framework of a compulsory computerized country-wide screening programme launched few years ago which is being under development. Three study groups of images containing about 21, 18 and 39 thousand of subjects were created by sub-sampling from a test database resulted from pulmonary x-ray examinations of a total of 188 thousands people. These groups have been well balanced by age and gender according to the existing biomedical standards and subsequently used as input data for searching different kinds of regularities in 2D projective lung shape and size. The approach followed in the paper combines different methods including procrustes shape analysis, Bookstein’s baseline shape registration, multi-dimensional scaling, regression models with brokenline relationships as well as various conventional statistical procedures. As a result, interesting gender- and age-related regularities in lung shape were discovered and documented in the paper.
1
Introduction
Similar to data mining, the image mining can be defined as the process of extracting hidden patterns from images. More specifically, image mining deals with extraction of implicit knowledge, image data relationship or other patterns not explicitly stored in the image database [1], [2], [3], [4]. As more and more image collections are gathered, the image mining becomes an increasingly important tool for transforming these visual data into knowledge. Image mining can be applied to the visual datasets of virtually any size, and while it can discover hidden patterns, it can not discover patterns which are not already present in the image collection. Image mining research has borrowed some ideas from such well developed areas as computer vision, image analysis, image retrieval, machine learning, and artificial intelligence. The fundamental challenge in image mining is to determine how low-level, pixel representation contained in an image can be effectively and efficiently processed to identify high-level relationships [1], [2], [3]. Typical image mining procedure involves preprocessing, segmentation, feature extraction and discovering significant patterns out of the extracted features accomplished by P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 554–568, 2009. c Springer-Verlag Berlin Heidelberg 2009
Mining Lung Shape from X-Ray Images
555
a semantic interpretation and obtaining the final knowledge. Being applied to the large archives of biomedical images, the image mining technology allows to detect natural clusters, to find new diagnostic features [3], [4] and, in some cases, to discover new fundamental biomedical regularities and even decrees of nature. It is already several years as a nation-wide program of compulsory screening of adult population as well as the diagnosis and treatment of pulmonary diseases is under implementation in Belarus. The program is based on a telemedicine system that involves such image data as chest x-ray scans acquired with the help of a domestic fully-digital scanners of Pulmoscan-760 family and 3D tomograms obtained by recent multi-spiral CT machines from different world brands (Volume Zoom Siemens, LightSpeed from General Electrics and some other). These pieces of imaging hardware were installed in a number of general public clinics located country-wide, in tubercular prophylactic centers playing the role of regional diagnostic and management centers as well as in regional cancer dispensaries which are mostly dealing with CT tomography. The supporting computer facilities include networked PCs, dedicated servers and recent supercomputers of the SKIF family installed in the National Supercomputer Center. Development of the screening programme already resulted in accumulation of databases containing hundreds of thousands of x-ray chest images of different subjects which are expected to grow up to the size well comparable with the whole adult population of the country. In this paper, we presented an approach for mining 2D projective shape of human lungs from very large x-ray image archives and reported some interesting gender- and age-related regularities discovered in the lung shape and size.
2 2.1
Materials and Method The Lung Segmentation Problem
Lung segmentation that is finding its external contour is much more difficult problem than it seems to be from the first glance. Up to now, the automatic segmentation problem from chest x-ray images remains largely unsolved mostly due to the following two groups of reasons. (a) A wide inter-subject variation in lung images, which is caused by natural anatomical variability of patients with respect to the body constitution, differences caused by age and gender, underwent surgery, presence of a cardiostimulator and many other factors. (b) A large diversity in the imaging hardware characteristics and imaging modes including image size (spatial resolution), intensity resolution, x-ray strength, specific properties of the sensor, etc. The above reasons being combined with a number of practical issues such as wrong body location in the frame, possible movements and/or inadequate behavior of a patient during the scanning procedure, lack of the properly trained supporting personnel in certain regions of the country and so on involve additional imaging complications that makes the automatic lung segmentation problem even more tangle and truly complicated.
556
V. Kovalev, A. Prus, and P. Vankevich
A review of the existing lung segmentation methods reported in literature [5], [6], [7] allows to draw the following conclusion: the image intensity and geometrical constrains are the major characteristic being exploited by all the methods to determine whether specific image regions belong to the left or to the right lung. Other features and algorithms often used for outlining the lungs include detecting the lung borders based on the intensity gradients calculated with the help of a suitably-sized sliding window, conventional morphological filtration, classification of image regions using neural nets or other classificators, employing the Markov random fields and so on. 2.2
The Segmentation Procedure
In framework of this project we have developed and implemented a novel multistep lung segmentation procedure. Although some auxiliary steps reported in the above literature are utilized, it is mostly capitalized on the following three basic points: using a bounding box covering the lungs area, fitting scalable mask of lungs and employing two bunches of rays drawn from the lung centers for detecting lung borders. The whole segmentation procedure consists of the four stages which are detailed below. Stage 1. Image normalization. Re-scaling input X-ray images to an internal standard dimensions of about 1000×1500 pixels using bicubic splines for reducing the effect of image size variability. Performing intensity normalization by equalization of the image histogram. Note that these transformations are used explicitly for lung segmentation resulted in a lung template and do not effect the original image intensity used at the later diagnostic phase. Stage 2. Extracting region of interest. The segmentation process itself begins with estimation of the position of a patient in the image frame and detecting an approximate region where the lungs most likely are. This is done with the help of a minimal bounding box drawn so that it contains the whole patient’s body. The bounding box is then used for placing-and-scaling the mask of lungs which roughly outlines the borders of lung searching area. In addition, being subdivided into the left and right halves, the bounding box is also used for calculating approximate centers of lungs substituted here by the centers of gravity of the dark regions situated in the box halves. Stage 3. Searching for lung edges. The process of searching lung edges is started by drawing two bunches of rays originated from the prospective lung centers. Each bunch contains 500 rays which are evenly distributed in all directions within 360 degrees. Every ray stops at the point, which is presumably belongs to the lung boundary at that line. Thus, the problem is then essentially stated as how to determine the length of each ray. The length of rays is estimated using an iterative intensity thresholding scheme. We start with the relatively low intensity values, which are typical for the lung region and gradually increase the intensity threshold. At each iteration step we prolong every ray until the extended rays at their ends have the intensity values lower than the current threshold and all the rays are kept within the lung mask. If so, the threshold is increased by certain step and the iterative
Mining Lung Shape from X-Ray Images
557
procedure is continued. Once at least one ray reaches or (more likely) jumped over the mask edge, the ray extension process brakes, the procedure returns to the previous ray positions and the lung border is refined in the same manner using a smaller intensity threshold step. Our experiments showed that the use of a dichotomy search for suitable intensity thresholding and selecting the length of rays reduces computational expenses for about an order. The further lung border improvements are carried out by dividing rays into the separate, relatively independent sub-bunches according to the type of boundary each ray is directed to. In particular, the above initial approximation of the outline of lung is also resulted in information about the position of the upper and lower points of lung, which makes possible an accurate dividing the boundary into the internal (mediastinum and diaphragm) and external (ribcage) edges. The location of these segments of bordering contour is then refined using the same procedure being applied to the corresponding sub-bunches of rays. Execution of all the above steps resulted in an estimated shape of the lung. After the first estimate of the lung contour is calculated, the position of bounding box can be refined on the basis of the obtained outlines and all the steps of stage three are repeated in order to improve the segmentation accuracy. In certain cases the repetition of this procedure may correct even significant errors (up to 30% of box area) comparing to the initial estimation of bounding box. As a rule, the reliable state of bounding box can be achieved in 2–3 iterations. Stage 4. Smoothing contours of lungs. On the final, fourth stage of the procedure a kind of filtration, namely, smoothing of the length of rays (onedimensional sliding window with the angular size of 7 rays, i.e, about 5 degrees) is performed and the final examination of the correctness of segmentation results is accomplished. These help to avoid passing the wrong lung templates to the later image analysis phase. Lung segmentation experiments performed on a database containing 142000 of chest x-ray images have demonstrated that in 96.31% of cases the segmentation quality was practically-acceptable (see examples provided in Fig. 1). In about 4200 cases (2.96%) the output results have been qualified as suspicious by an automatic exit-check procedure and were automatically directed to an additional interactive examination. The rest 0.73% of results were obviously wrong. 2.3
Original Images, Shape Representation and Study Groups
A test image database containing results of pulmonary x-ray examinations of the chest of more than 188 thousands of healthy subjects aged 20 to 84 years was used as input image data repository. Subjects’ age was measured in complete years with the precision of one year. All the original images were segmented using the lung segmentation procedure described in previous section. The projective shape of each lung of every subject was represented by 500 landmark points located at the ends of corresponding rays. Along with this polar coordinate system, the equivalent representation in Cartesian coordinates was also used for lung shape description where necessary.
558
V. Kovalev, A. Prus, and P. Vankevich
Fig. 1. Lung segmentation examples for subjects from different gender and age groups
The first study group of images called G1 was formed out for mining lung shape distinctions associated with age in different age categories or age ”classes”. It consisted of three sub-groups conditionally named as young (20-30 years), midaged (40-50 years) and aged (60-70 years) subjects. Each sub-group included images of 6930 subjects (3465 pairs of male-female subjects with the same age, 315 males and 315 females per age year), total 20790 images in the group G1. The second group of images named G2 has been created for mining both ageand gender-related lung shape regularities. It was covering the wide life span between the 20 and 80 years for both genders. This age range corresponds to 60 age intervals from 20 to 79 complete years each. Similar to the above group G1, in order to achieve a perfect gender and age matching heavily favored by the existing statistical standards, group G2 was formed using pairs of male-female
Mining Lung Shape from X-Ray Images
559
subjects of the same age. A total of 9000 male-female pairs were collected from the image repository, 150 pairs for each year. Thus, the group G2 consisted of 18000 x-ray images of the chest of 18000 different subjects aged 20 to 79 years, 300 images per age year (150 males plus 150 females), 9000 males and 9000 females in total. Finally, since some interesting regularities were discovered in females of group G2 concerning dependence of lung shape and lung area on age, an additional image group G3 was created explicitly from female subjects aged 20 to 57 years, 1016 persons per age year, 38608 females in total. It should be noted that this is the first work that makes use the image data described above. 2.4
Integral Shape Features
A number of commonly recognized shape features were calculated for the left and right lungs of every subject. They include lung area, dimensions of boxing rectangle, boundary length, compactness defined in its usual way ie., as (boundary length)2 /area as well as the size of major axis, minor axis and the eccentricity of the ellipse with equivalent area. The ellipses were fitted to the lung contours using general linear model and the statistical confidence ellipse tools. The ellipse eccentricity feature ε < 1 was computed based on major and minor half-axes a and b as ε = (a2 − b2 )/a. In addition, the lung contour itself being represented by the vector of lengths of 500 rays ordered counter-clockwise naturally served as a polar signature of lung shape allowing computing such standard features as statistical moments. 2.5
Shape Analysis Methods
Statistical shape analysis is a geometrical analysis from a set of shapes in which statistics are measured to describe geometrical properties from similar shapes or different groups, for instance, the difference between male and female Gorilla skull shapes, normal and pathological bone shapes, etc. The statistical shape analysis involves methods for the geometrical study of objects where location, rotation and scale information can be removed. The key tasks of shape analysis are to obtain a measure of distance between two shapes, to estimate average shapes from a sample and to estimate shape variability in a sample [8]. In this work we have chosen 2D version of procrustes shape analysis described in [8] and implemented in form of the shapes software package in framework of R [9], a language and environment for statistical computing. The procrustes analysis is well known as one of the most advanced shape analysis methods and it considers objects made up from a finite number k of points in N dimensions which are called landmark points. The shape of an object is considered as a member of an equivalence class formed by removing the translational, rotational and scaling components. Specifically, the following basic algorithms of procrustes analysis were used: calculating Riemannian distance between two shapes, Bookstein’s baseline shape registration [10], testing for mean shape differences of groups of lung shapes with the help of Hotelling’s T 2 and Goodall’s F tests.
560
V. Kovalev, A. Prus, and P. Vankevich
These tests were developed for examining differences in mean shape between the two independent populations and involve complex eigenanalysis and iterative generalized procrustes analysis for two dimensions. In addition, when studying the age-related lung shape changes, a regression model with broken-line relationships suggested by Muggeo [11] was used. The method is aimed to estimate linear and generalized linear models having one or more segmented relationships in the linear predictor. Estimates of the slopes and of the possibly multiple breakpoints are provided. In this work the method was used for detecting critical age points where the trend is changed significantly. For visualization of a set of M shapes in feature space, an M × M matrix of mutual Riemannian distances has been computed and 2D shape scatterplot was created and displayed. To accomplish this, the Multidimensional Scaling (MDS) method [12] was utilized for reducing feature space dimensionality down to two dimensions. Note that the multidimensional scaling provides an approximate solution, which is suitable for visual examination of object scattering only.
3 3.1
Results Lung Size in Different Age Categories
The lung shape distinctions caused by age were studied with the help of group G1 consisting of three sub-groups whose members were conditionally categorized to young (20-30), mid-aged (40-50) and aged (60-70) subjects. We start from very simple but yet very important, from biomedical point of view, lung feature – the lung size. The lung size is measured by its projective x-ray area and by axes of the fitted ellipse. Changes of lung area were assessed by way of pair-wise comparison of the three age sub-groups using regular two-tailed t-test. Results are summarized in Fig. 2. From the data presented in the figure, the following two important facts becoming immediately evident: (a) The lung area declines with age in a non-linear manner. In particular, the reduction is more steep when transferring from young to the mid-aged group comparing to the considerably less prominent reduction when jumping for the same 10 years but from mid-aged to aged individuals. These observations are confirmed by the comparable significance values of group differences reported by t-test with the same degree of freedom df =6928 in all the occasions (see table on the right of Fig. 2). The significance scores of lung reduction for moderate 40-50 and elderly 60-70 years were nearly twice as low as in young 20-30 and mature 40-50 periods of life (t=16.2 and t=15.7 against t=27.4 and t=31.1 for males and t=26.1 and t=25.7 against t=44.1 and t=49.4 for females). (b) Despite the fact that age-related decline takes place for both lungs and both genders, male and female persons are affected differently. Namely, the rating of lung area reduction is always greater in females comparing to male subjects. Quantitatively, the joint projective area of both lungs changes from the mean value of 544 cm2 computed over the group of young males aged 20-30 years down to the value of 472 cm2 in males aged 40-50 years (reduction rate is 13.2%). In
Mining Lung Shape from X-Ray Images
561
Fig. 2. Change of mean lung area for subjects from young (20-30 years), mid-aged (40-50 years) and aged (60-70 years) groups (left plot) and its statistical significance (table on the right). Each of three age groups consist of 6930 subjects including 3465 males and 3465 females, 630 subjects per age year.
female subjects it reduces from 469 cm2 down to 373 cm2 with a noticeably greater rate of 20.5% for the same 10 years. When comparing groups of midaged subjects of 40-50 years old with those aged 60-70 years, corresponding lung reduction values are 472 versus 431 cm2 for males (8.7%) and 372 versus 320 cm2 (13.9%) for females. The observed regularity of a more quick lung area reduction in females is further confirmed by corresponding statistical significance scores provided in Fig. 2. 3.2
Shape of Lung Ellipses in Different Periods of Life
Statistical assessment of differences between ellipses fitted to the lungs of subjects belonging to different age groups has revealed a bit more complicated pattern of age-related changes compared to the lung areas. Although the size of major and minor ellipse axes generally behave in a way similar to the lung area, ie., decreases with age, the reduction rate varied significantly reflecting corresponding variations in global shape of lungs. Since the eccentricity feature captures mutual relationships between the two axes and describes the global elongated shape of lungs in relative units, it is worth to consider here the eccentricity instead of raw axes values . As it can be easily seen from Fig. 3, the eccentricity exhibits the non-linear character of agerelated changes even more sharply than the lung area. It is especially true for the left lung, the eccentricity of which drops down dramatically from young (20-30) to mid-aged (40-50) periods of life and remains nearly unchanged over the second gap from 40-50 to 60-70 years. Similar trend can be observed for the right lung too but with a considerably lower confidence. In fact, the mean eccentricity values even slightly growing up after 40-50 years but the growth rate is close to the
562
V. Kovalev, A. Prus, and P. Vankevich
Fig. 3. Significance of the eccentricity (oblongness) differences of lungs for young (2030 years), mid-aged (40-50 years) and aged (60-70 years) subjects (top two panels) and examples of mean lung shapes and their dissimilarity (bottom two panels)
border of statistical significance (at this point it is good to remember that degree of freedom here is as high as df =6928 and the commonly-accepted minimal threshold for statistical significance is p < 0.05 what approximately corresponds to t> 2.0). The significance rates supplied with the table depicted on the topright quarter of Fig. 3 as well as pictures of mean group shapes accompanied by their dissimilarity values (see bottom two panels of Fig. 3) provide further quantitative evidences for the discovered regularity. In everyday words, all these numbers testify for a conclusion that during the ageing the lung shape tends to ”round up” and this process is mostly accomplished by the age of about 50 years. Such a behavior is more prominent in the shape of left lung. Investigation of other global shape features of lung images of group G1 did not add anything interesting to the results reported above. In particular, the shape compactness feature ε, which admittedly might be quite useful for distinguishing image objects with sharp and rounded edges in a number of computer vision problems, has demonstrated here some inconsistent behavior. Therefore it was found to be useless for capturing any distinct trends in rather homogeneous sets of lung shapes we are dealing with. 3.3
Lung Shape Distinctions in Different Age Categories
Now we have arrived to the lung shape mining stage which exploits some sort of ”feature-free” methods of comparing the whole shapes. As it was described
Mining Lung Shape from X-Ray Images
563
Fig. 4. Significance of lung shape differences for subjects from young (20-30 years), mid-aged (40-50 years) and aged (60-70 years) groups assessed using Hotelling’s T 2 test (above line) and Goodall’s F test (under line) for mean shape differences
in previous section, we capitalize on the procrustes shape analysis that make use the efficient shape comparison using Riemannian distance, utilize 2D shape registration algorithms and employs specific statistical tests for examining shape differences. To this end, we have performed in a pair-wise manner both the Hotelling’s T 2 test and Goodall’s F test for assessing the significance of lung shape differences in pairs of age sub-groups of group G1. Results are represented in Fig. 4 in form of a plot of Hotelling’s T 2 statistics (left panel) and in less vivid but more precise form of a table of statistical significance expressed by both Hotelling’s T 2 and Goodall’s F values. In all the occasions the observed shape differences were found to be highly significant that is p-value was much less than 0.05. Also, similar to previous experiments, the results summarized in Fig. 4 suggest that the largest portion of lung shape changes occurred from young (20-30) to mid-aged (40-50) period and the magnitude of these changes in female subjects is always greater than in males (see specific values for more details). The greater age dependence of lung shape in females can also be noticed from the example of scattering of young and aged subjects given in Fig. 5. 3.4
Gender-Related Differences in Lung Shape
In previous sections we have been mostly concentrated on the assessment of the influence of age factor to the lung shape. Nevertheless, studying the effect of age, we have performed all the evaluations for male and female subjects separately and therefore already contributed towards disclosing some of gender-related regularities. In this section the gender-related differences will be investigated further based on the well-balanced sample of 9000 males and 9000 females of group G2 densely and evenly covering a good portion of the life span from 20 to 79 years inclusive. It is obvious that the male and female lungs are different in their size. When comparing lung shapes visually or using quantitative features sensitive to size,
564
V. Kovalev, A. Prus, and P. Vankevich
Fig. 5. Example of scattering of young and aged subjects in lung shape space. Shape differences were measured using Riemannian distances. Resultant scatterplot was reduced to 2D by multi-dimensional scaling method therefore the axes are conditional.
these differences may confuse results and cause various miss-interpretations. Thus, it is worth to start with an estimation of size-induced differences. This may be done, for instance, by looking at the male/female lung shape differences with and without scaling. Fig. 6 presents mean lung shapes for male and female subjects of group G2 computed by a straightforward averaging (top left panel) and with the help of Bookstein’s base line registration of all 18000 shapes(top right panel). As it can be seen from the figure, the base line registration makes the existing genderrelated shape differences very explicit. In particular, it can be noticed that the greatest cross section of lower lung aperture is wider in males comparing females. This is more prominent in the right lung which also appears to be slightly shorter and wider in males what reflects the known fact of their more brachymorphic structure of the chest. In order to reliably estimate the significance of lung shape differences between male and female subjects in different ages, the whole study period of 20-79 years was subdivided into tetrads and the Hotelling’s test for shape differences was performed on every subset of 1200 lung images (600 males plus 600 females) for each of 15 tetrads. The resultant significance scores T 2 plotted at the bottom of Fig. 6 suggest that the gender-related shape differences keeping highly significant (p-values were also much less than 0.05) for every tetrad of the examined period of life. It can be noticed that the shape of the right lung appears to be more distinct in male and female subjects than the shape of the left one. However, this regularity holds true till the age of 60-64 years and switches to opposite afterwards. Looking to the shape of significance curve depicted in Fig. 6, one can also note that the significance of gender-related differences tend to keep relatively low in young and in opposite – the elderly periods of life with more
Mining Lung Shape from X-Ray Images
565
Fig. 6. Gender-related differences in lung size and lung shape as revealed by comparison of mean (top left) and registered (top right) lung contours of 18000 subjects along with the Hotelling’s T 2 significance score of gender-related lung shape differences by life tetrads (plot underneath)
high values in between and a confident bump of yet unknown nature within 52-64 years. 3.5
Lung Area Changes during Ageing
Our previous experience (eg., [13]) suggest that the new knowledge on ageing process of the human body and its functional systems always attracting a lot of attention from both scientific and general public domains. In many occasions changes in organ’s size are proven to be a very important sign of pathology and/or age-related decline. Thus, in context of this image mining study it is very interesting to try to find out how exactly the lung size changes during the normal ageing and whether these changes going synchronously for both genders or not. A simple correlative analysis shows that the lung size significantly correlates with age of 18000 subjects of group G2 with the correlation coefficient kL = −0.43 for the left lung and kR = −0.45 for the right one. When considerm m ing genders separately, the correlations were kL = −0.39 and kR = −0.40 for
566
V. Kovalev, A. Prus, and P. Vankevich
Fig. 7. Changes of lung area with normal ageing in 9000 male (left plot) and 9000 female (right plot) subjects. Statistical data are presented by age tetrads for reliability.
f f the left and right lungs of males versus kL = −0.56 and kR = −0.58 in females. It is easy to see from these numbers that correlation of lung projective area with age is greater in females what confirms once more the fact of more significant age-related changes characteristic for female subjects that was discovered earlier in this work with the help of study group G1. After the above general observations, let us take a closer look at the age-related dependence of lung area and its diversity by gender. For this purpose, the whole study period from 20 to 79 complete years was first subdivided into tetrads and statistical box-and-whiskers graphs plotted separately for the area of left and right lungs over the resultant 15 tetrads (see the right lung areas of males and females illustrated in Fig. 7 as an example). As it can be qualitatively concluded from the figure, female’s right lung clearly demonstrates some specific, non-linear behavior during the ageing while male’s lung area staying approximately linear. (It should be noted that corresponding measurements have confirmed that both left and right lungs behave in pretty much the same way and this is why the only right lung plots are presented in Fig. 7. Such left-right synchronism is not surprising given that the correlation of left/right lung area is as high as 0.93 in males and 0.94 in females of group G2.) Finally, for detecting ”critical” age points where the trend in lung size reduction (ie., in the slope of regression line) is changed significantly, we employed regression model with broken-line relationships recently suggested by Muggeo [11]. The three series of experiments on 9000 female subjects were subsequently performed using projective area or the left lung, right lung, and the total lung area (ie., the sum of both lungs) as regression response and the age as a predictor. Two age points, 34 and 50 years inspired by plot in Fig. 7 were set as an initial guess of broken line location required by the method. As a result, the following values of estimated breakpoints and significance scores were obtained.
Mining Lung Shape from X-Ray Images
567
(a) Left lung: 33.3 years (t = −11.8) and 47.8 years (t = 8.6). The ”improved” location of these points computed under condition of null left slope gave 33.2 and 47.8 years respectively. (b) Right lung: 33.1 years (t = −11.2) and 47.5 years (t = 8.9). With the hypothesized null left slope: 33.1 and 47.5 years respectively. (c) Both lungs: 33.2 years (t = −11.7) and 47.6 years (t = 9.0). With the hypothesized null left slope: same 33.2 and 47.6 years. Additional experimentations performed with initial guess values varying in a reasonably wide age range of about 6-7 years have demonstrated good reliability of the output breakpoint estimates the method always converged to. Detailed analysis of the linear slopes including bordering slope values of 95% confidence intervals suggest that the left shoulder of the piece-wise linear regression standing before 33 years may be considered as a plateau (no confident positive or negative slope) whereas the right one (after 48 years) is going slightly upwards. No significant breakpoints were found in male lung size regressions. Since the above regularities discovered on the study group G2 were found to be very interesting, they were examined further on the group G3 consisting of 38608 females. The resultant age points obtained with the help of G3 were similar, namely: 34.2 and 49.1 years (left lung), 35.4 and 51.4 years (right lung) and 34.9 and 49.6 years (both together). Contrary to the G2, the right shoulder beginning around 50 years was found to be rather flat. Thus, summarizing the results we may conclude that the normal ageing process of adult female subjects accompanied by a decline of lung projective area which is non-even across the adult life span. The temporal pattern of lung size reduction can be roughly described as ”plateau–slope–plateau”. The accelerated lung size reduction starts around 33-35 years and lasted till approximately 48-50 where the decline process starts to slow down. This regularity was not confirmed for male subjects.
4
Conclusions
1. The image mining approach reported with this study allows to manage large collections of x-ray data, reliably extract projective lung shape, and run 2D shape mining procedures for discovering new regularities from large image databases of a national level. 2. It was found that the lung projective area declines with age in a non-linear way. The significance scores of lung reduction from moderate 40-50 to elderly 60-70 years were nearly twice as low as from young 20-30 to mature 40-50 periods of life. The rating of lung area reduction is always greater in females comparing to male subjects. The temporal pattern of lung size reduction in females can be roughly described as ”plateau–slope–plateau”. The accelerated decline starts around 33-35 years and lasted till 48-50 where the process begins to slow down. 3. The procrustes analysis suggest that similar to the size, the largest portion of lung shape changes occurs from young (20-30) to mid-aged (40-50) period and the magnitude of these changes in female subjects is always greater than in males. During the ageing, the lung shape tends to ”round up” (the eccentricity
568
V. Kovalev, A. Prus, and P. Vankevich
of fitted ellipses decreases). This process is mostly accomplished by the age of about 50 years. Such a behavior is more prominent in the shape of left lung. It is anticipated that future work will be concerned with discovering reliable markers of biological age using both shape and intensity information as well as with searching for new diagnostic features based on lung texture mining. Acknowledgements. This work was funded by the research programme on distributed GRID computing SKIF-GRID under contract No 4C − Γ /07-225, European FP-7 project BalticGrid–II and partly by the ISTC grant B-1489.
References 1. Hsu, W., Lee, M., Zhang, J.: Image mining: Trends and developments. Journal of Intelligent Information Systems 19(1), 7–23 (2002) 2. Malik, H.H.: Efficient Algorithms for Clustering and Classifying High Dimensional Text and Discretized Data using Interesting Patterns. PhD Thesis, School of Arts and Sciences, Columbia University, 176 p. (2008) 3. Perner, P.: Image mining: Issues, framework, a generic tool and its application to medical-image diagnosis. Engineering Applications of Artificial Intelligence 15(2), 205–216 (2002) 4. Perner, P., Perner, H., M¨ uller, B.: Mining knowledge for HEp-2 cell image classification. Artificial Intelligence in Medicine 26(1-2), 161–173 (2002) 5. Gados, D., Horvath, G.: Using heuristics for the lung fields segmentation in chest radiographs. IFMBE Proceedings 16(2), 802–805 (2007) 6. Chen, S., Cao, L., Liu, J., Tang, X.: Automatic segmentation of lung fields from radiographic images of sars patients using a new graph cuts algorithm. In: Proceedings of 18th International Conference on Pattern Recognition, ICPR 2006, Hong Kong, vol. 1, pp. 271–274. IEEE, Los Alamitos (2006) 7. Ginneken, B., Romeny, B.: Automatic segmentation of lung fields in chest radiographs. Medical Physics 27, 2445–2455 (2000) 8. Dryden, I., Mardia, K.: Statistical Shape Analysis, 1st edn., 376 p. John Wiley & Sons, New York (1998) 9. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008) ISBN 3-900051-07-0 10. Bookstein, F.: Size and shape spaces for landmark data in two dimensions (with discussion). Statistical Science 1, 181–242 (1986) 11. Muggeo, V.: Estimating regression models with unknown break-points. Statistics in Medicine 22, 3055–3071 (2003) 12. Cox, T., Cox, M.: Multidimensional scaling, 2nd edn., 328 p. Chapman and Hall, Boca Raton (2000) 13. Kovalev, V.A., Kruggel, F., von Cramon, D.Y.: Gender and age effects in structural brain asymmetry as measured by MRI texture analysis. NeuroImage 19, 895–905 (2003)
A Wavelet-Based Method for Detecting Seismic Anomalies in Remote Sensing Satellite Data Pan Xiong1, Yaxin Bi2, and Xuhui Shen1 1
Institute of Earthquake Science, China Earthquake Administration, Beijing, 100036, China [email protected], [email protected] 2 School of Computing and Mathematics, University of Ulster, Co. Antrim, BT37 0QB, United Kingdom [email protected]
Abstract. In this paper we present a comparative analysis of two types of remote sensing satellite data by using the wavelet-based data mining techniques. The analyzed results reveal that the anomalous variations exist related to the earthquakes. The methods studied in this work include wavelet transformations and spatial/temporal continuity analysis of wavelet maxima. These methods have been used to analyze the singularities of seismic anomalies in remote sensing satellite data, which are associated with the two earthquakes of Wenchuan and Pure recently occurred in China.
1 Introduction By studying remote sensing satellite data, researchers have found various abnormal activities in earth, atmosphere and ionosphere prior to large earthquakes, which are reflected in anomalous thermal infrared (TIR) signals [1], outgoing longwave radiation (OLR) [5] and surface latent heat flux (SLHF) [2,3,8] and anomalous variations of the total electron content (TEC) [6,7] prior to the earthquake events. The latest advancements in lithosphere – atmospheric – ionospheric models provide a possible explanation to the origin of these phenomena [6,7], and also permit us to explore possible new studies on the spatial and temporal variability of remote sensing data before and during major earthquakes. Several studies have recently been carried out to analyze thermal infrared anomalies and ionospheric anomalies appearing in the area of earthquake preparation a few days before the seismic shock [9,10,19,20,21]. These studies analytically compare a single image of pre (vs. post) earthquake satellite TIR imagery [10]; analyze the pixel temperature variance from long term scene threshold temperatures to identify “hot” areas [11]; perform a multispectral thermal infrared component analysis on the Moderate Resolution Imaging Spectroradiometer (MODIS) on Terra and Aqua satellites by using Land Surface Temperature (LST) [4]; assess the anomalous SLHF peaks a few days prior to the main earthquake event in the case of coastal earthquakes [8]; and study OLR data and discover anomalous variations prior to a number of medium to large earthquakes [5]; and analyze electromagnetic emission, the variations of plasma parameters and precipitation of energetic particles P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 569–581, 2009. © Springer-Verlag Berlin Heidelberg 2009
570
P. Xiong, Y. Bi, and X. Shen
[22] along with ionospheric perturbations before the Puer earthquake by using DEMETER (Detection of Electromagnetic Emissions Transmitted from Earthquake Regions) satellite data [19]. Moreover, a wavelet-based method for studying the ionospheric perturbations associated with earthquakes recorded by DEMETER satellite is carried out by Benghanem et al., their result gives indications on perturbations of the electron density and external phenomenon such as solar activity associated with earthquakes over active fault zones [24]. OLR is the thermal radiation flux emerging from the top of the atmosphere and being connected with the earth–atmosphere system. It is often affected by cloud and surface temperature. Due to OLR resulting from infrared band telemetry, not only OLR data is continuous, stable and commeasurable, but also it is sensitive to the sea surface layer and the near ground temperature change. It can be therefore regarded as an ideal device for monitoring the symptoms of some natural disasters linking to "hot" origin of phenomena, like earthquakes. DEMETER was designed to observe various ionospheric disturbances produced by electric field, magnetic filed, etc for studying earthquakes, volcanic eruptions or tsunamis. The DEMETER satellite is the first in a series of microsatellites developed by CNES (Centre National d’Etudes Spatiales) (French Space Agency). The scientific payload is made up of five instruments: an ICE (Electric Field Instrument) for the measurement of the electric field; an IMSC (Magnetometer Search-Coil Instrument) for the measurement of the magnetic field; an IAP (Plasma Analyzer Instrument) plasma analyser; and the ISL (Langmuir Probe Instrument), a set of two Langmuir probes [28]. Precisely detecting seismic anomalies within remote sensing satellite data related to earthquakes is vitally important to sufficiently make use of remote sensing satellite resources to monitor stable conditions of active faults beneath the earth and to identify the potential earthquake zones. A possible solution to these problems is to employ advanced data mining methods to detect abnormal events embedded in remote sensing satellite data. The key challenge facing data mining research is to properly and rapidly digest massive volumes of remote sensing satellite data in order to detect abnormal events. More recently Cervone et al. have developed a new data mining method based on wavelet analyses to detect anomalous SLHF maxima peaks associated with four coastal earthquakes [2, 3]. In this paper we propose to use wavelet transformations as a data mining tool to detect seismic anomalies within remote sensing satellite data. We have performed an assessment on a number of wavelet methods and selected two real continuous Daubechies Wavelet and Gaussian Derivative Wavelet. The distinguishing feature of our method is that we calculate the wavelet maxima that propagate from coarser to finer scales over the data and then identify strong anomalies in the maxima lines distributing on the data by only accounting for those phenomena that show continuity in both time and space. The identified anomalies are regarded as potential anomalies related to the earthquakes. In this context, the time continuity means that the detected anomalies occur at the same time or with a short delay of each other, while the space continuity means that the detected anomalies are distributed in space according to a precise geometry conforming to the geological settings of the region. The proposed method could be applied to different types of spatial and temporal data, which is not restricted to a particular resolution or time sampling.
A Wavelet-Based Method for Detecting Seismic Anomalies
571
The proposed method has been applied to analyze the remote sensing satellite data associated with the two earthquakes recently occurred in Wenchuan and Puer of China, respectively. Combining with the tectonic explanation of spatial and temporal continuity of the abnormal phenomena, the analyzed results have indicated a number of singularities associated with the possible seismic anomalies of the earthquakes.
2 Earthquakes and Data In this study, two earthquakes are selected for evaluating the proposed method. The first one is the Wenchuan earthquake of magnitude 8.0 that is the largest earthquake in China in the past thirty years. It occurred on 12th May 2008, the location of the epicenter is at 30.986°N, 103.364°E, and the depth is 19 km. The main earthquake was followed by a series of smaller aftershocks. The second is the Puer earthquake of magnitude 6.4 that occurred on 3rd June 2007. The earthquake's epicenter is at 23°N, 101.1°E. Two types of satellite remote sensing data have been used in this paper for studying the seismic anomalies related to the earthquakes. One is OLR data. The OLR energy flux is characterized by a number of parameters, such as the emission from the ground, atmosphere and clouds formation, which have been being observed on the top of the atmosphere by National Oceanic and Atmosphere Administration (NOAA) satellites [15]. These OLR data have been recorded twice-daily by the several polarorbiting satellites for more than eight years, forming time series data across the different periods of time along with the spatial coverage of the entire earth. The original OLR data are processed by the interpolation technique to minimize the distance in space or time over which a value is interpolated. The detail of the interpolation technique has been given by Liebmann and Smith [13]. The data used for this study are twice-daily means from the NOAA-18 satellite. Their spatial coverage is 1×1 degree of latitude by longitude covering the area of 90°N – 90°S and 0°E – 357.5°E, and the time range is from 3rd September 2006 to 28th September 2008, forming time series data over the specified region. The other type of data is DEMETER satellite data, which are electron density and ion density variations obtained using the Langmuir Probe Instrument (ISL) and the Plasma Analysis Instrument (IAP), which are two sensors onboard DEMETER to survey the ionosphere. ISL is designed to measure the electron density of plasma, electron temperature and the potential of the satellite. The data analysis method for this instrument has been described in Lebreton et al. [25]. IAP is used to measure the main parameters of the thermal population that is the densities of the major ionospheric ions H+, He+ and O+, their temperatures and the ion flow velocity in the Earth’s frame of reference. The description of the analysis algorithm and its application to DEMETER data is given in Berthelier et al. [26]. All data files and plots are organized by half-orbits. The data used for this study are the electron density (Ne), electron temperature (Te) and ion density (Ni) that are measured by ISL, and ion density (Ni) measured by IAP. All the data were taken from the data base in DEMETER mission center, Orleans France [28].
572
P. Xiong, Y. Bi, and X. Shen
3 Methodology There are several wavelets to choose in the analysis of remote sensing data. The best one for our application depends on the nature of remote sensing data and what we require in detecting seismic anomalies. We have undertaken an empirical analysis on several wavelet methods and selected two for our study. The first method is one of the Daubechies Wavelets, called a db1, and the second is the Gaussian Derivative Wavelets called a gaus3. Both of these methods employ one dimensional continuous wavelet transformations. We use these two methods to analyze the continuity of modulus maximum in time and space and to detect singularities within the remote sensing data covering the two earthquakes. 3.1 Wavelet Transformation The formalism of the continuous wavelet transform (CWT) was first introduced by Grossmann and Morlet [18]. Formally it is written as:
γ ( s,τ ) = ∫ f (t )ψ
∗ s ,τ
(t ) dt
(1)
where * denotes complex conjugation, s is the scale factor and τ is the translation factor which are the new dimensions after the wavelet transform. This equation shows how a function f (t) is decomposed into a set of basis functions, called the wavelets. For the sake of completeness, formula (2) gives the inverse wavelet transform.
f (t ) = ∫∫ γ ( s,τ )ψ
s ,τ
(t )dτds
(2)
The wavelets are generated from a single basic wavelet (t), the so-called mother wavelet, by scaling and translation:
ψ
s ,τ
(t ) =
1 ⎛ t −τ ⎞ ψ ⎜ ⎟ s ⎝ s ⎠
(3)
where the factor 1 / s is for energy normalization across the different scales. It is important to note that the theory of wavelet transforms not only deals with the general properties of the wavelets and wavelet transforms, but it also defines a framework for designing other types of wavelets. The methods used in this study are Daubechies Wavelets and Gaussian Derivative Wavelet. The general characteristics of Daubechies Wavelets can be compactly supported with external phase and highest number of vanishing moments for a given support width. Associated scaling filters are minimum-phase filters [16]. The Gaussian Wavelets can be derived from the Gaussian probability density function. Gaussian functions are optimal in terms of their time-frequency localisation. The time-frequency localisation property of the Gaussian wavelet make it possible to design Glters with very narrow frequency band. Mallat and Hwang [14] introduced a method for processing and detecting singularities using wavelets, in which detection singularities are initially through
A Wavelet-Based Method for Detecting Seismic Anomalies
573
calculating the local maxima of the wavelet transform modulus. It is proved that modulus maxima detect all singularities. In this work, we have experimented several one dimensional wavelet transformations and selected better performed wavelet functions db1 (Daubechies Wavelets) and gaus3 (Gaussian Wavelets) to calculate maxima lines. The basic idea is that using the wavelet functions db1 and gaus3, we can calculate the wavelet coefficients of the time series remote sensing data, and then we perform a further calculation on these wavelet coefficients, resulting in a sets of numeric values called wavelet maxima. 3.2 Calculating Singularities For any given real valued function
Wf (u, s) = ∫
φ
with zero average ∫ ∞ φ (t ) dt = 0 , let −∞
f (t) t −u φ( )dt (s) s
be the real continuous wavelet transform of a function f . Since φ has zero mean, the previous integral measures the variation of f in a neighborhood of time u of size proportional to the so called scale factor
s > 0. (u 0 , s 0 ) is defined to be a modulus
maximum if Wf (u0 , s0 ) is a local maximum, i.e. if
∂Wf (u 0 , s 0 ) =0 ∂u and if Wf(u0, s0 ) is strictly increasing to the left of of
u 0 or strictly decreasing to the right
u0 . In other words, the goal is to identify the isolated local maxima of the wavelet
transform Wf(u0, s0 ) for each s > 0 .A connected curve
γ
in the scale-time plane is
called “a maxima line” if (u, s ) ∈ γ implies (u, s ) is a modulus maximum. Modulus maxima carries a significant degree of information about the position of singularities, particularly, it is possible to prove that for every singularity t0 of f
one finds “a sequence” of modulus maxima (ui , si ) such that si → 0 and ui → t0 . 3.3 Experimental Procedure
3.3.1 OLR Data By considering the tectonic background, continental boundaries and fault lines, we define the study area and divide it into a set of grids. The analyzed results on the different grids are combined into a n × m matrix, in which the rows n correspond to the selected scales at each of the grids in which the wavelet analysis has been performed, the columns m correspond to time, and the entry values of the matrix are either the propagation lengths of significant maxima lines or zero if none has been
574
P. Xiong, Y. Bi, and X. Shen
detected at this particular point in space/time. As such, the time-series measurements of wavelet maxima are continuous in space and time of the grid path. To effectively visualize maxima lines, we use different colors to represent the degree of maxima magnitudes, in which the dark color indicates the most significant singularity. By plotting the sequences of modulus maxima lines, we can get the continuous curves of maxima both in space and time. The third part of Fig. 2 shows an example of a set of maxima lines. Fig. 3 shows the curves of the sequences of modulus maxima and the Fig.4 indicates the identified singularities. The following describes the experimental procedure and analysis method through an example of the Wenchuan earthquake using OLR data. First, we define an experimental area. The Wenchuan earthquake is postulated as the result of motion on a northeast striking reverse fault or thrust fault on the northwestern margin of the Sichuan Basin. By taking into account the tectonic background, continental boundaries and active faults, we define an experimental area and divide it into a set of grids as shown in Fig.1 [17]. Secondly, based on the defined grids, OLR daily data, from 28th September 2007 to th 28 September 2008, are downloaded from the NOAA Climate Prediction Center. After pre-processing, we employ the wavelet methods db1 and gaus3 to analyze the data and generate wavelet maxima values. The singularities detected from these values on each o the grids are then visualized. For example, Fig.2 shows the resulting maxima curves in the first grid. The figure consists of three components: a) the original time series OLR data, b) a pseudo three dimensional representation of the wavelet coefficients, and c) significant wavelet maxima detected over time. The colors in the legend indicate the degrees of maxima magnitudes from the largest degree to the smallest one.
Fig. 1. Grids defined for the Wenchuan earthquake (the map is adapted from) [17]
Fig. 2. An example of analysis results and corresponding maxima curves
Thirdly, we take every maxima curve in each of the grids and rearrange them onto one diagram as shown in Fig.3. In the figure the x-axis represents time in day units, and the y-axis represents the grids in a sequential order. The magnitudes of maxima represent the degrees of seismic anomalies, where the larger the magnitude, the higher
A Wavelet-Based Method for Detecting Seismic Anomalies
Fig. 3. The curves of wavelet maxima computed wavelet transformation
575
Fig. 4. Identified singularities from the curves in the modulus maxima
the degree of seismic anomalies. The figure heading lists the earthquake name, the period of selected data, data type, region of data, grid path and the wavelet method used. The red line indicates the day when the earthquake occurred. Final stage is to identify singularities from the maxima curves. The key feature of singularities is that they form a continuous maxima line with a large magnitude. In Fig. 3, we can find several singularities, which can be grouped into three types – preand post-earthquake and when the earthquake occurred, corresponding to 1) singularities prior to the earthquake, which may be caused by the large energy flux before the earthquake; 2) singularities in the time when the earthquake occurred, which may be caused by the release of a large amount of energy; 3) singularities after the earthquake, perhaps caused by many aftershocks after the earthquake. From Fig.4, two obvious singularities prior to the earthquake and one after the earthquake are highlighted with the dashed red ovals. 3.3.2 DEMETER Satellite Data Previous studies indicated that the ionospheric precursors often appeared over the region near epicenter within a short time prior to earthquakes [21, 27, 28]. For this study, data was selected from DEMETER over 11 days (10 days before the earthquake and one day on the day of the earthquake occurrence). Using the epicenter of an earthquake as a central point of a circle, and 1888 km as a diameter, we selected all the orbits in the 11 days for an earthquake. Taking the Puer earthquake as an example, among the 284 selected orbits of DEMETER during the period of the 11 days, only29 half orbits meet the requirement of this study. The experimental procedure and analysis method for the selected DEMETER data are similar to the ones in Section 3.3.1. First, we define abnormal seismic events. The abnormal events are some obvious synchronous perturbations of several physical parameters at the same time interval along with the same orbit. Those perturbations could be visualized in the figures of Level 2 data provided by the DEMETER mission centre. Secondly, based on the figures of Level 2 data, the abnormal events were found from the dataset of the selected half orbits of the two earthquakes. Taking the Puer earthquake as an example, after analyzing the figures of Level 2 data of the Puer earthquake, anomalous events were found in half orbits 15440_1 and 15572_1. We take only orbit 15440_1 to explain the experimental procedure.
576
P. Xiong, Y. Bi, and X. Shen
Thirdly, Analyze the Level 1 Data that corresponds to half orbits using the method of wavelet maxima. Fig.5 visualizes the Level 2 data of half orbit 15440_1. In this figure the x-axis gives time, longitude and latitude, the y-axis represents the values of the physical parameters of ISL Ne, ISL Ni, ISL Te and IAP Ti, and the heading lists date and orbit number. The blue lines cover the region that is 1888 km distant from the epicenter. The red lines indicate the synchronous perturbations of the physical parameters at the same time interval. In Fig.5, the synchronous perturbations are somewhat obvious on the physical parameters. We consider such perturbations as anomalous events found in the Level 2 data of half orbit 15440_1. In the next step, we carried out the analysis on the Level 1 data that corresponds to half orbit 15440_1.
Fig. 5. Anomalous events found in the figures of Level 2 data of half orbit 15440_1
Fig. 6. Plotting physical parameters Level 1 data of half orbit 15440_1
Fig. 7. Wavelet maxima of parameters of of ISL Ne, ISL Ni, ISL Te and IAP Ti
A Wavelet-Based Method for Detecting Seismic Anomalies
577
Fig.6 shows the plot of the physical parameters (ISL Ne, ISL Ni, ISL Te and IAP Ti) of Level 1 original data of half orbit 15440_1. In the figure, the x-axis indicates universal time and longitude, the y-axis represents the values of these physical parameters, and the figure heading lists date and orbit number. The blue lines cover the region that 1888 km distant from the epicenter. The red lines indicate the time or latitude of the earthquake occurrence. The obvious perturbations around the earthquake are highlighted with the dashed red rectangle. Fig.7 shows a number of wavelet maxima lines of the physical parameters detected. The colors in the legend indicate the different degrees of maxima magnitudes from the smallest degree 1 to the largest one 8, where the larger the degrees of maxima magnitudes, the more obvious the singularities – corresponding to seismic anomalies. To make the maxima more clear to identify singularities, Fig.8 and Fig.9 show wavelet maxima lines of the physical parameters with only magnitudes 5 to 8 obtained by the different wavelet methods.
Fig. 8. Maxima lines of physical parameters of orbit 15440-1 with magnitudes 5 to 8 obtained using db1
Fig. 9. Maxima lines of physical parameters of orbit 15440-1 with magnitudes 5 to 8 obtained using gaus3
4 Results and Discussion 4.1 Wenchuan Earthquake Based on historical seismic activities and tectonic characteristics, we select three areas, namely regions 1 and 2, and the Wenchuan earthquake region called the main region to carry out comparative studies on the three regions from time and space. The main region covers the earthquake. Region 1 is adjacent to the active fault line and Region 2 is far from the fault zones and there are no historical earthquakes recorded. The duration of the OLR data used for three regions is from 28th September, 2007 to 28th September, 2008. The location of the main region is from 30°N, 103°E to 33°N, 106°E, Region 1 is from 28°N, 105°E to 31°N, 108°E, and Region 2 is from 44°N, 113°E to 47°N, 116°E. The following figures (Fig.10, Fig.11 and Fig.12) show the wavelet maxima curves of the three regions produced by using gaus3. The red line indicates the day when the Wenchuan earthquake occurred.
578
P. Xiong, Y. Bi, and X. Shen
Fig. 10. Maxima curves of the Wenchuan region
Fig. 11. Maxima curves of the Region 1
In Fig.10 several continuous singularities are identified, some of them are around the Wenchuan earthquake. These singularities may be caused by the large amount of energy generated by the Wenchuan earthquake. Compared with Fig.11, the maxima curves are more disorder, but one continuous singularity can be clearly observed. Looking at Fig.12, the maxima lines are complete disorder. The distribution of the singularities in Fig.11 is similar to that in Fig.10. However in Fig.10 the maxima lines of singularities are more continuous with larger magnitudes and a clear singularity appears on the day when the earthquake occurred. Although a similar distribution appears in Fig.12, the maxima lines are disorder and the magnitudes of the maxima are also smaller. Considering the factors of geographic region and tectonic background of the earthquake, we could conclude that the singularities from the wavelet maxima curves of the Wenchuan region are more informative and regular than those in the other two regions. In particular, the singularities in Region 2 are completely in disorder since the region is stable and there are almost no earthquakes in this area in past decades. The singularities can also be discovered in Region 1 on the day when the Wenchuan earthquake occurred, these could be due to that Region 1 is close to the active fault line and the epicenter of the Wenchuan earthquake.
Fig. 12. Wavelet maxima analysis curves of the Region 2
Fig. 13. The figure of studied physical parameters of Level 1 data of half orbit 15572_1
A Wavelet-Based Method for Detecting Seismic Anomalies
579
4.2 Puer Earthquake Two half orbits 15440_1 and 15572_1 within DEMETER satellite data were selected for studying the Puer earthquake. As shown in Fig.8 and Fig.9, the maxima lines of the studied physical parameters of half orbits 15440_1 are plotted by the two different wavelet methods, in which db1 is used for Fig.8 and gaus3 is used in Fig.9. From Fig.8 an obvious singularity of physical parameter ISL Ni can be observed around the epicenter, however most singularities distribute outside the blue lines. Based on the time and latitude, these singularities may be caused by the effect of magnetic fields from the two Polar Regions and conjugate points over the epicenter. However, no corresponding singularities can be observed in Fig.9 that is produced by gaus3. Since the density of the data of physical parameters is too high, so singularities related to the earthquake can not be detected well. By contrast, for the parameter IAP Ti, the amount of the data is smaller than other three parameters, so the singularities can be detected more obvious. Fig.13 shows the original data of another half orbit 15572_1 of the Puer earthquake. From the figure, two obvious singularities are detected and highlighted from the original signal with the dashed red rectangle. The singularities outside the earthquake region also can be observed in the two Polar Regions and conjugate points over the epicenter.
Fig. 14. Maxima lines of physical parameters of orbit 15572-1 with magnitudes 5 to 8 obtained by db1
Fig. 15. Maxima lines of parameters of orbit 15572-1 with magnitudes 5 to 8 obtained by gaus3
From Fig.14, the singularities around the epicenter also can be identified, which are corresponding to the variation of the original signal, and several singularities of the two physical parameters ISL Ne and ISL Ni can be found in the Polar Regions from the figures. In the figure of physical parameters ISL Te of Fig.14, the singularities can be also found in the Polar Regions, but no obvious singularities can be found around the epicenter. In Fig.15, obvious singularities with the physical parameters ISL Ne appear near the epicenter, but in the figures of physical parameters ISL Ni and ISL Te, only several singularities can be found near the Polar Regions. In the figure of IAP Ti, singularities are very obvious which correspond to the variation of the original signal.
580
P. Xiong, Y. Bi, and X. Shen
5 Conclusion This paper presents a comparative analysis on the selected remote sensing satellite data associated with the Wenchuan and Pure earthquakes and explains how the remote sensing singularities discovered could be related to the earthquakes. The numerous experiments and analyses on time and space of the Wenchuan earthquakes conclude that the prominent OLR singularities could be found prior to the earthquakes in the wavelet maxima curves, which follow continuity both in space and time. Therefore our studies conclude that singularities discovered within OLR data could be regarded as an effective indicator to detect seismic anomalies. Compared with OLR data, the result of the proposed maxima method appears not to be effective in detecting seismic anomalies in the DEMETER satellite data. There are two possible reasons for this: one is the density of the DEMETER satellite data used is too high to detect anomalies from the original data. Another is the singularities outside the earthquake region can be also caused in the Polar Regions and conjugate points over the epicenter, which are not related to the earthquake event. Singularities caused by these natural phenomena disturb the result of detecting seismic anomalies. Therefore the conclusion is that the proposed method is suitable for OLR data instead of DEMETER satellite data. However, with the further study of the proposed method, this finding will be further validated for detecting seismic anomalies by using remote sensing satellite data in the future.
Acknowledgements This work is supported by the project of “Data Mining with Multiple Parameters Constraint for Earthquake Prediction (founded by the Ministry of Science and Technology of China, Grant No.:2008BAC35B05)”. The authors would like to acknowledge the School of Computing and Mathematics at the University of Ulster for hosting the first author’s visit, NOAA for making OLR data and DEMETER mission center for making their data available for various research communities.
References 1. Carreno, E., Capote, R., Yague, A., et al.: Observations of thermal anomaly associated to seismic activity from remote sensing. In: General Assembly of European Seismology Commission, Portugal, September 10-15, pp. 265–269 (2001) 2. Cervone, G., Kafatos, M., Napoletani, D., et al.: Wavelet Maxima Curves Associated with Two Recent Greek Earthquakes. Nat. Hazards Earth Syst. Sci. 4, 359–374 (2004) 3. Cervone, G., Singh, R.P., et al.: Wavelet maxima curves of surface latent heat flux anomalies associated with Indian earthquakes. Nat. Hazards Earth Syst. Sci. 5, 87–99 (2005) 4. Ouzounov, D., Freund, F.: Mid-infrared emission prior to strong earthquakes analyzed by remote sensing data. Advances in Space Research 33(3), 268–273 (2004) 5. Ouzounov, D., Liu, D., et al.: Outgoing long wave radiation variability from IR satellite data prior to major earthquakes. Tectonophysics 431, 211–220 (2007) 6. Pulinets, S.A., et al.: Ionospheric Precursors of Earthquakes, p. 316. Springer, Berlin (2004)
A Wavelet-Based Method for Detecting Seismic Anomalies
581
7. Hayakawa, M., Molchanov, O.A.: Seismo Electromagnetics: Lithosphere-AtmosphereIonosphere Coupling. In: TERRAPUB, Tokyo, p. 477 (2002) 8. Dey, S., Singh, R.P.: Surface latent heat flux as an earthquake precursor. Nat. Haz. Earth Sys. Sci. 3, 749–755 (2003) 9. Tronin, A.A.: Satellite thermal survey application for earthquake prediction. In: Hayakawa, M. (ed.) Atmospheric and Ionospheric Phenomena Associated with Earthquakes, pp. 717–746. TERRAPUB, Tokyo (1999) 10. Tronin, A.A., Hayakawa, M., Molchanov, O.A.: Thermal IR satellite data application for earthquake research in Japan and China. J. Geodyn. 33, 519–534 (2004) 11. Tramutoli, G., Di Bello, N., Pergola, S.: Robust satellite techniques for remote sensing of seismically active areas. Annals of Geophysics 44(2), 295–312 (2001) 12. Qiang, Z.: Thermal Infrared Anomoly Precursor of Impending Earthquakes. Pure Pur. A. Geoph. 149, 159–171 (1997) 13. Liebmann, B., et al.: Description of a Complete (Interpolated) Outgoing Longwave Radiation Dataset. Bulletin of the American Meteorological Society 77, 1275–1277 (1996) 14. Mallat, S., Hwang, W.L.: Singularity Detection And Processing With Wavelets. IEEE Transactions on Information Theory 38, 617–643 (1992) 15. NCAR and NOAA, 2008: ftp ftp.cpc.ncep.noaa.gov; cd precip/noaa* for OLR directories (2008) 16. Misiti, M., Misiti, Y., et al.: Wavelet ToolboxTM 4 User’s Guide (2008) 17. The Science Behind China’s Sichuan Earthquake (2008), http://www.tectonics.caltech.edu 18. Grossmann, A., Morlet, J.: Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM J. Math. 15, 723–736 (1984) 19. Zhu, R., et al.: Ionospheric perturbations before Pu’er earthquake observed on DEMETER. Acta Seismologica Sinica 21(1), 77–81 (2008) 20. Sarkar, S., et al.: Ionospheric variations observed by the DEMETER satellite in the midlatitude region during strong earthquakes. Journal of Atmospheric and Solar-Terrestrial Physics 69, 1524–1540 (2007) 21. Parrot, M., et al.: Examples of unusual ionospheric observations made by the DEMETER satellite over seismic regions. Physics and Chemistry of the Earth 31, 486–495 (2006) 22. Pulinets, S., Boyarchuk, K.: Ionospheric Precursors of Earthquakes, pp. 75–83. Springer, Berlin (2003) 23. Bhattacharya, S., et al.: Electric and magnetic field perturbations recorded by DEMETER satellite before seismic events of the 17th July 2006 M 7.7 earthquake in Indonesia. Journal of Asian Earth Sciences (2008) 24. Benghanem, K., et al.: Study of the ionospheric perturbations associated with earthquakes using wavelet transform for DEMETER satellite data. Geophysical Research Abstracts 10, EGU2008-A-01776 (2008) 25. Lebreton, J.P., Stverak, S., Travnicek, P., Maksimovic, M., Klinge, D., Merikallio, S., Lagoutte, D., Poirier, B., Kozacek, Z., Salaquarda, M.: The ISL Langmuir Probe experiment and its data processing onboard DEMETER: scientific objectives, description and first results. Planetary Space Science 54, 472–486 (2006) 26. Berthelier, J.J., Godefroy, M., Leblanc, F., Seran, E., Peschard, D., Gilbert, P., Artru, J.: IAP, the thermal plasma analyzer on DEMETER. Planetary Space Science 54, 487–501 (2006) 27. Chmyrev, V.M., Isaev, N.V., Serebryakova, O.N., et al.: Small-scale plasma inhomogeneities and correlated ELF emissions in the ionosphere over an earthquakes region. Journal of Atomspheric and Solar-Terrestrial Physics 59(9), 967–974 (1996) 28. Lagoutte, D., et al.: The DEMETER Science Mission Centre. Planetary and Space Science 54, 428–440 (2006)
Spectrum Steganalysis of WAV Audio Streams Qingzhong Liu1,2, Andrew H. Sung1,2,*, and Mengyu Qiao1 1 2
Computer Science and Engineering Department Institute for Complex Additive Systems Analysis New Mexico Tech, Socorro, NM 87801 USA {liu,sung,myuqiao}@cs.nmt.edu
Abstract. In this paper, we propose an audio steganalysis method called reference based Fourier Spectrum Steganalysis. The mean values and the standard deviations of the high frequency spectrum of the second and high order derivatives are extracted from the testing signals and the reference versions. A Support Vector Machine (SVM) is employed to discriminate the unadulterated carrier signals and the steganograms wherein covert messages were embedded. Experimental results show that our method delivers very good performance and holds great promise for effective detection of steganograms produced by Hide4PGP, Invisible Secrets, S-tools4 and Steghide. Keywords: steganalysis, spectrum, audio, derivative, steganography, SVM.
1 Introduction Steganography is the art and science of hiding data in digital media including images, audio signals, and videos, etc. In recent years, many different algorithms for information-hiding have been designed and implemented [30-34]. To the contrary, steganalysis refers to the detection of the information-hiding behaviors in digital media. In the past few years, many researchers presented steganalysis methods for various steganography systems. Most of these methods are focused on detecting information-hiding in digital images. For example, one of the well-known detectors, Histogram Characteristic Function Center Of Mass (HCFCOM) was successful in detecting noise-adding steganography [1]. Another well-known method is to construct the high-order moment statistical model in the multi-scale decomposition using wavelet-like transform and then apply learning classifier to the high order feature set [2]. Shi et al. [3] proposed a Markov process based approach to detect the information-hiding behaviors in JPEG images. Based on the Markov approach, Liu et al. [4] expanded the Markov features to the inter-bands of the DCT domains and combined the expanded features and the polynomial fitting of the histogram of the DCT coefficients, and successfully improved the steganalysis performance in multiple JPEG images. Other works in image steganalysis can be found in references [5-10]. *
Corresponding author.
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 582–593, 2009. © Springer-Verlag Berlin Heidelberg 2009
Spectrum Steganalysis of WAV Audio Streams
583
Research on digital audio steganalysis falls behind the activity in digital image steganalysis. Since digital audio signals and digital images have different characteristics, most of the effective steganalysis methods for images don’t work for digital audio signals. Although there are relatively few audio steganalysis methods, the following schemes provided good exploration for detecting the informationhiding: Ru et al. presented a detection method by measuring the features between the signal and a self-generated reference signal via linear predictive coding [11, 12]; Avcibas designed the content-independent distortion measures as features for classifier design [13]; Ozer et al. constructed the detector based on the characteristics of the denoised residuals of the audio file [14]; to detect the information-hiding in audio signals, Johnson et al. set up a statistical model by building a linear basis that captures certain statistical properties of audio signals [15]; Kraetzer and Dittmann [16, 17] proposed a Mel-cepstrum based analysis to perform a detection of embedded hidden messages; recently, by utilizing and expanding the Markov approach that was proposed by Shi et al. [3] for detecting the information-hiding in digital images, Liu et al. designed the expanding Markov features for steganalysis of digital audio signals [27]. Qiao et al. [35] applied neighboring joint density and Markov approach to steganalysis for MP3 compressed audio. Zeng et al. designed new algorithms to detect phase coding steganography based on analysis of the phase discontinuities [28] and to detect echo steganography based on statistical moments of peak frequency [29]. In this article, we propose an audio steganalysis method, named Fourier Spectrum Steganalysis (FSS). The mean values and the standard deviations of the high frequency spectrum of the second and high order derivatives are extracted from the testing audio signals and the reference versions, as the detector. A Support Vector Machine (SVM) with RBF kernel is employed to differentiate the unadulterated carrier signals and the steganograms. Experimental results show that our method is highly successful for audio steganalysis. In comparison with the steganalysis method based on the high order statistics derived from linear prediction error [11, 12], and the latest method based on improved Markov approach and the expanding features [27], the advantage of our proposed FSS is remarkable. The rest of the paper is organized as follows: section 2 presents the derivative based additive noise model for audio steganalysis, section 3 describes the generation of the signal reference and the calculation of the difference features of the spectrum. Experiments are presented in section 4, followed by discussions in section 5 and conclusion in section 6.
2 Derivative Based Additive Noise Model for Audio Steganalysis In image processing, second order derivative is widely employed for detecting isolated points, edges, etc. [18]. Bringing this approach in mind, we developed a scheme based on the joint distribution and the condition distribution of the second order derivative for audio steganalysis. Based on our previous work, we expand the second order derivative to high order derivatives. Specifically, we integrate the second, third, and fourth derivatives of the audio signals, depicted as follows.
584
Q. Liu, A.H. Sung, and M. Qiao
An audio signal is denoted as f (t ) . Where t is the sample time and
t = 0,1, 2,..., N − 1 . The second, third, and fourth derivatives, denoted by
D 2f (•) , D 3f (•) , and D 4f (•) , respectively, are given as
d2 f = f (t + 2) − 2* f (t + 1) + f (t ) dt 2 t = 0,1, 2,..., N − 3.
(1)
d3 f = f (t + 3) − 3* f (t + 2) + 3* f (t + 1) − f (t ) dt 3 t = 0,1, 2,..., N − 4
(2)
D 2f (t ) ≡
D3f (t ) ≡
D 4f (t ) ≡
d4 f = f (t + 4) − 4* f (t + 3) + 6* f (t + 2) − 4* f (t + 1) + f (t ) (3) dt 4 t = 0,1, 2,,..., N − 5
The embedding message is denoted as h (t ) and the stego-signal is denoted as s(t ) . The approximation of s (t ) is s (t ) ≈ f (t ) + h(t ) , but in general, it is not exact. To exactly represent s(t ) , we assume the error between s (t ) and
f (t ) is e(t ) , that is,
s (t ) = f (t ) + e(t ) The
derivatives
of
the
(4)
e(t )
error
and
s (t )
are
denoted
by
D (•) and D (•) (n = 2,3, 4) , respectively. We obtain n e
n s
Dsn (•) = D nf (•) + Den (•),
n = 2,3, 4 n
The Discrete Fourier Transforms (DFTs) of Ds denoted as
(5)
(•) , D nf (•) , and Den (•) , are
Fks , Fk f , and Fke , respectively. M −1
Fks = ∑ Dsn (t )e
−
j 2π kt M
(6)
t =0
M −1
Fk = ∑ D (t )e f
t =0
M −1
n f
Fke = ∑ Den (t )e
−
−
j 2π kt M
j 2π kt M
(7)
(8)
t =0
Where k = 0,1, 2,..., M − 1 and M is the number of samples of the derivatives. We have
Spectrum Steganalysis of WAV Audio Streams
Fks = Fk f + Fke Assume that
θ
585
(9) f
e k
is the angle between the vectors Fk and F , then 2
Fks = Fk f
2
2
+ Fke + 2 Fk f • Fke • cos θ
(10)
2
The expected value of
Fks is
( )= F
E Fks
2
f 2
k
+ Fke
2
(11)
And also we have the following equations 2
E ( Fks ) Fk f
2
= 1+
Fke
2
Fk f
(12)
2
Since the expected values of all the derivatives are 0’s, the spectrums in the lowest frequency are zeros. The error e(t) can be treated as random error with the expected value of 0. Refer to [25, 26], the spectrum | F e | is approximately depicted by a Gaussian distribution or a Gaussian-like distribution. The power is zero at the lowest frequency, as the frequency increases, the spectrum increases. That is, the spectrum at the high frequency is higher than that at the low frequency. Fig. 1 shows the spectrum distribution of the second to the fourth order derivatives of a random error with the values of +1, -1, and 0. It demonstrates that the spectrum of the high frequency of the derivatives (the central part) is bigger than those of other parts.
(a)
(b)
Fig. 1. Random error signals with the number of sampling 500 (a) and 10000(b), respectively, and the spectrum distributions (before shifting) of the derivatives. The red dashed rectangles indicate the areas of the high frequency spectrum.
586
Q. Liu, A.H. Sung, and M. Qiao
Normally, digital audio signals are band-limited, that is, there are limited magnitudes on the high frequency components, although the high frequency spectra are different from one audio signal to another. Based on the equation (12), in low and middle frequencies, the spectrum of audio signal is greater than the spectrum of error signal, so the modification of stego-audio signals at low and middle frequency is negligible, however, the modification at high frequency component may be the clue for us to detect the information-hiding behavior, since the magnitude of the high frequency components of the audio signal is limited and the energy of the Fourier transform of the derivative of the error signal is concentrated at the high frequency components. So far, we reach the key point of our steganalysis. That is, the information-hiding in audio signals generally increases the high frequency spectrum of the derivatives. Then we can measure the statistics of the high frequency spectrum to recognize the signal carrying covert message or not. Fig. 2 shows the spectrum distribution of the derivatives of an innocent signal and the distribution of the stego-signal that is generated by hiding some message into the innocent signal. It clearly shows that, the high frequency spectrum of the second derivative of the stego-signal has the higher mean values than that of the cover.
Fig. 2. The comparison of the spectrum (the first row: whole frequency; and the second row: high-frequency) of the second derivatives of a cover signal and the stego-signal
At this point, we present the following procedure to extract the statistical characteristics of the spectrum. 1. 2.
Obtain the Fourier spectrum of the derivatives of the testing signal. Calculate the mean values and the standard deviations of the different frequency zones on the spectrum from step 1. In our approach, we equally divide the whole frequency zone into Z (Z is set to 20 ~ 80) zones or parts,
Spectrum Steganalysis of WAV Audio Streams
587
from the lowest to the highest frequency. The mean value and the standard deviation of the ith zone are denoted as mi and σ i , respectively. 3.
Choose
mi and σ i values from the high frequency spectrum as the features.
In our approach, if Z = 80, i is set from 66 to 80.
3 Reference Based Solution Information-hiding does modify the statistics of the spectrum of the derivatives, as depicted in the part 2 and shown in Fig. 2, however, different audio signals have different statistical characteristics of the spectra, in other words, the spectrum statistics vary from one signal to another. Without any reference, it is still difficult to accurately detect some audio stego-systems, or maybe we reach the incorrect conclusion, especially detecting the audio steganograms wherein the bit-depth modification is just limited to the least significant bits, and hence result in very small modification to the original audio signals. Considering this point, we have the following generation of the signal reference signal, described as: 1.
Randomly modify the least significant bit of the testing signal g , the modified version is denoted r . According to (1)-(3) and (6)-(8), we obtain
Fkg and Fkr . 2.
Obtain the mean values and the standard deviations of the high frequency spectrums, denoted
mig and σ ig , and mir and σ ir , associated with Fkg and
Fkr , respectively. 3.
Calculated the differences
The values,
mid and σ id in the following way
mid = mir − mig
(13)
σ id = σ ir − σ ig
(14)
mid and σ id , extracted from the high frequency spectrum, are the final
features. We also may combine these features with the
mi and σ i from the high
frequency spectrum, depicted in part 2, to constitute the final feature set.
4 Experiments 4.1 Set Up and Comparison of Features We have 1000 WAV audio signals files covering different types such as digital speech, on-line broadcast, and music, etc. Respectively, we produced the same
588
Q. Liu, A.H. Sung, and M. Qiao
amount of the stego-audio signals by hiding different message in these audio signals. The hiding tools include Hide4PGP V4.0 [20], Invisible Secrets [21], S-tools4 [22], and steghide [23]. The hidden data include different text messages, audio signals, and random signals, etc. The embedded in any two audio files are different. We set 80 to Z and extract 80 mean values and 80 standard deviations, total 160 features, of the spectrum of the derivatives. Fig. 3 lists the F-statistics of the features,
mi and σ i , and Fig.4 shows the F-statistics of mid and σ id , extracted from 215 stego-audio signals, compared with the statistics from 215 covers. Figures 3 and 4 clearly demonstrates that, regarding the statistical significance, the values
mi and σ i , and mid and σ id of the high frequency are much better than the
values of the low and middle frequencies; the standard deviation is a little better than the mean values; the features associated with Hide4PGP has higher significance scores than those associated with other three information-hiding systems, which implies that the steganalysis performance of Hide4PGP will be the best. Comparing Fig.3 to Fig. 4, except the F statistics of the features in steganalysis of Hide4PGP, the features of
mid and σ id are better than mi and σ i . It implies that the generation of
signal reference is good to Invisible Secrets, S-tool4, and Steghide, but it is not good for the steganalysis of Hide4PGP.
Fig. 3. F-statistics of the features, mi and σ i (a) of the spectrums of the second derivatives
Spectrum Steganalysis of WAV Audio Streams
589
Fig. 4. The F-statistics of the features, mid and σ id (b) of the spectrums of the second derivatives
4.2 Experimental Results Based on the analysis depicted in 4.1, we formed two types of feature sets. The first is called comb-set, given as
COMB − SET :
{x | x ∈{m } ∪{σ } ∪{m } ∪{σ } , i = 66,67,...,80} i
d i
i
d i
(15)
The second is called diff-set, given by
DIFF − SET :
{x | x ∈ {m } ∪ {σ } , i = 66, 67,...,80} d i
d i
(16)
We employ a SVM with RBF kernel [24] to training feature sets and testing feature sets. 75% of the total feature-sets are used for training; the other 25% are used for testing. The training sets and testing sets are randomly chosen in each experiment. In detecting each type of audio signals, we repeat the experiment for detecting each type of stego-audio signals 30 times. The testing results consists of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Since the testing numbers of each type (cover or stego-signal) of audio signals in each experiment equals, the testing accuracy is calculated by (TP+TN)/(TP+TN+FP+FN). Table 1 lists the
590
Q. Liu, A.H. Sung, and M. Qiao
average testing accuracy (%) of the experiments. In the table COMB-SET (2D) means that the feature set is the first type and the features are extracted from the spectrum of the second order derivatives, and so on. Table 1. The average testing accuracy (%) of the two types of feature sets of high frequency spectrum of the second, the third, and the fourth order derivatives
Feature set COMB-SET (2D) DIFF-SET (2D) COMB-SET (3D) DIFF-SET (3D) COMB-SET (4D) DIFF-SET (4D) COMB-SET (2,3,&4D) DIFF-SET (2,3,&4D)
Hide4PGP
Invisible Secrets
S-tool4
Steghide
99.0 %
90.5
87.1
81.0
98.7
90.6
86.7
82.8
99.2
91.4
88.3
83.8
99.6
95.2
88.8
78.8
99.0
90.9
84.8
85.7
98.5
91.1
87.1
82.0
99.1
95.9
88.1
85.4
99.3
98.7
91.6
81.7
Table 1 indicates that, the steganalysis of Hide4PGP audio signals achieved the best performance, followed by invisible secrets, s-tools, and steghide. The results are consistent with the analysis in 4.1. The best average testing accuracy is 99.6% for Hide4PGP, 98.7% for Inivisible Secrets, 91.6% for S-tool4, and 85.7% for Steghide. It shows that our steganalysis is highly successful. Table 2 compares the testing results derived from FSS, high-order statistics based on the linear prediction code [11, 12], here we call it LPC-HOS, and the Expanding Markov Features [27], here we abbreviate it EMF. Table 2 shows that, the advantage of FSS over LPC-HOS and EMF is dramatic, especially in steganalysis of invisible secrets, s-tool4 and steghide, FSS gains the improvement by about 17% to 37%, 19% to 35%, and 19% to 29%, respectively. Table 2. The testing accuracy (%) of FSS, LPC-HOS [11, 12], and EMF [27]
Stegnalysis Method
Hide4PGP
Invisible Secrets
S-tool4
Steghide
FSS (average)
99.1 %
93.0
91.6
81.7
LPC-HOS
80.0
56.9
57.6
53.2
EMF
99.1
76.3
72.7
62.9
Spectrum Steganalysis of WAV Audio Streams
591
5 Discussion To generate the reference signal, we just randomly modified the least significant bits of the testing signal. Since the embedding data in audio signals with the use of Hide4PGP is not the same, it explains that, regarding the F-statistics, mid and σ id , shown in Fig. 3(b), is not better than mi and σ i , shown in Fig. 3(a). If we exactly simulate the embedding of the hiding-methods / tools by embedding random signal to generate the signal reference, and calculate the difference between the features from the testing signals and those from the reference, the improvement of the steganalysis performance is worthy of being expected. The steganalysis performance in detecting Hide4PGP audio steganograms is much better than the detection of other steganograms. We analyzed the embedding procedures of these three hiding tools. It shows that, Hide4PGP has a bigger embedding capacity and the modified bits of the signals are not restricted in the least bit, but the last few least significant bits; and hence, it makes more modification and results in the more significant change to the derivatives, which causes the highly detectable in the high frequency spectrum. Here we want to mention the poor performance of the compared method, LPCHOS, in detecting invisible secrets, S-tool 4, and steghide. In our opinion, in these hiding systems, the modification caused by the data hiding is very small; however, the error resulting from the linear prediction may be much higher than the hiding modification, so the detection performance is not good. Our method of FSS overcomes the drawback and obtains good detection results. We did not study the issue of the feature selection. By employing some methods of feature selection and choosing an optimal feature set, the improvement of the steganalysis performance is reasonably expected.
6 Conclusions In this paper, we proposed a Fourier Spectrum Steganalysis (FSS) method for digital audio signals. We first introduced the second and high order derivatives of the signals, and employed Fourier transform to obtain the spectrums of the derivatives. By randomly modifying the least significant bits, the signal reference is generated and the spectrums of the derivatives are produced. By extracting the statistics of the high frequency spectrum of the derivatives of the signal and the reference, we employ an SVM to discriminate the features from carrier signals and those from stego-signals. Experimental results indicate that our proposed FSS is highly promising and gains remarkable improvement, in comparison with the high-order statistics based on linear prediction code [11,12], and the latest modified Markov approach and the expanding features [27] for detecting the information-hiding of digital audio signals.
Acknowledgement The authors gratefully acknowledge the support for this research from ICASA, a research division of New Mexico Tech.
592
Q. Liu, A.H. Sung, and M. Qiao
References 1. Harmsen, J., Pearlman, W.: Steganalysis of Additive Noise Modelable Information Hiding. In: Proc. of SPIE Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Contents, vol. 5020, pp. 131–142 (2003) 2. Lyu, S., Farid, H.: How Realistic is Photorealistic. IEEE Trans. on Signal Processing 53(2), 845–850 (2005) 3. Shi, Y., Chen, C., Chen, W.: A Markov process based approach to effective attacking JPEG steganography. LNCS, vol. 437, pp. 249–264. Springer, Heidelberg (2007) 4. Liu, Q., Sung, A., Ribeiro, B., Ferreira, R.: Steganalysis of Multi-class JPEG Images Based on Expanded Markov Features and Polynomial Fitting. In: Proc. of 21st International Joint Conference on Neural Networks, pp. 3351–3356 (2008) 5. Liu, Q., Sung, A.: Feature Mining and Nuero-Fuzzy Inference System for Steganalysis of LSB Matching Steganography in Grayscale Images. In: Proc. of 20th International Joint Conference on Artificial Intelligence, pp. 2808–2813 (2007) 6. Liu, Q., Sung, A., Xu, J., Ribeiro, B.: Image Complexity and Feature Extraction for Steganalysis of LSB Matching Steganography. In: Proc. of 18th International Conference on Pattern Recognition, ICPR, vol. (1), pp. 1208–1211 (2006) 7. Liu, Q., Sung, A., Chen, Z., Xu, J.: Feature Mining and Pattern Classification for Steganalysis of LSB Matching Steganography in Grayscale Images. Pattern Recognition 41(1), 56–66 (2008) 8. Liu, Q., Sung, A., Ribeiro, B., Wei, M., Chen, Z., Xu, J.: Image Complexity and Feature Mining for Steganalysis of Least Significant Bit Matching Steganography. Information Sciences 178(1), 21–36 (2008) 9. Fridrich, J.: Feature-Based Steganalysis for JPEG Images and its Implications for Future Design of Steganographic Schemes. In: Fridrich, J. (ed.) IH 2004. LNCS, vol. 3200, pp. 67–81. Springer, Heidelberg (2004) 10. Pevny, T., Fridrich, J.: Merging Markov and DCT Features for Multi-Class JPEG Steganalysis. In: Proc. SPIE Electronic Imaging, Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Contents IX, vol. 6505 (2007) 11. Ru, X., Zhang, H., Huang, X.: Steganalysis of Audio: Attaching the Steghide. In: Proc. of the Fourth International Conference on Machine Learning and Cybernetics, pp. 3937–3942 (2005) 12. Ru, X., Zhang, Y., Wu, F.: Audio Steganalysis Based on “Negative Resonance Phenomenon” Caused by Steganographic Tools. Journal of Zhejiang University Science A 7(4), 577–583 (2006) 13. Avcibas, I.: Audio Steganalysis with Content-independent Distortion Measures. IEEE Signal Processing Letters 13(2), 92–95 (2006) 14. Ozer, H., Sankur, B., Memon, N., Avcibas, I.: Detection of Audio Covert Channels Using Statstical Footprints of Hidden Messages. Digital Signal Processing 16(4), 389–401 (2006) 15. Johnson, M., Lyu, S., Farid, H.: Steganalysis of Recorded Speech. In: Proc. SPIE, vol. 5681, pp. 664–672 (2005) 16. Kraetzer, C., Dittmann, J.: Pros and Cons of Mel-cepstrum Based Audio Steganalysis Using SVM Classification. In: Furon, T., Cayre, F., Doërr, G., Bas, P. (eds.) IH 2007. LNCS, vol. 4567, pp. 359–377. Springer, Heidelberg (2008) 17. Kraetzer, C., Dittmann, J.: Mel-cepstrum based steganalysis for voip-steganography. In: Proc. SPIE, San Jose, CA, USA, vol. 6505 (2007) 18. Gonzalez, R., Woods, R.: Digital Image Processing, 3rd edn. Prentice Hall, Englewood Cliffs (2008)
Spectrum Steganalysis of WAV Audio Streams
593
19. Hill, T., Lewicki, P.: Statistics: Methods and Applications. StatSoft, Inc. (2005) ISBN: 1884233597 20. Hide4PGP, http://www.heinz-repp.onlinehome.de/Hide4PGP.htm 21. Invisiblesecrets, http://www.invisiblesecrets.com/ 22. S-tools4, http://digitalforensics.champlain.edu/download/ s-tools4.zip 23. Steghide, http://steghide.sourceforge.net/ 24. Vapnik, V.: Statistical Learning Theory. John Wiley, Chichester (1998) 25. Oppenheim, A., Schafer, R., Buck, J.: Discrete-Time Signal Processing. Prentice-Hall, Englewood Cliffs (1999) 26. http://mathworld.wolfram.com/FourierTransformGaussian.html 27. Liu, Q., Sung, A., Qiao, M.: Detecting Information-Hiding in WAV Audio signals. In: Proc. of 19th International Conference on Pattern Recognition, Tampa, FL, USA (2008) 28. Zeng, W., Ai, H., Hu, R.: A Novel Steganalysis Algorithm of Phase coding in Audio Signal. In: Proc. the Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT), pp. 261–264 (2007) 29. Zeng, W., Ai, H., Hu, R.: An Algorithm of Echo Steganalysis based on Power Cepstrum and Pattern Classification. In: Proc. International Conference on Information and Automation (ICIA), pp. 1667–1670 (2008) 30. Zhang, F., Pan, Z., Cao, K., Zheng, F., Wu, F.: The upper and lower bounds of the information-hiding capacity of digital images. Information Sciences 178(14), 2950–2959 (2008) 31. Chang, C., Lin, C., Tseng, C., Tai, W.: Reversible hiding in DCT-based compressed images. Information Sciences 177(13), 2768–2786 (2007) 32. Chang, C., Lin, C.: Reversible steganographic method using SMVQ approach based on declustering. Information Sciences 177(8), 1796–1805 (2007) 33. Lin, C., Chen, S., Hsueh, N.: Adaptive embedding techniques for VQ-compressed images. Information Sciences, doi:10.1016/j.ins.2008.09.001 34. Liu, C., Liao, S.: High-performance JPEG steganography using complementary embedding strategy. Pattern Recognition 41(9), 2945–2955 (2008) 35. Qiao, M., Sung, A., Liu, Q.: Steganalysis of MP3Stego. In: Proc. of 22nd International Joint Conference on Neural Networks, Atlanta, GA, USA (2009)
Audio-Based Emotion Recognition in Judicial Domain: A Multilayer Support Vector Machines Approach E. Fersini1 , E. Messina1 , G. Arosio1 , and F. Archetti1,2 1
DISCo, Universit` a degli Studi di Milano-Bicocca, Viale Sarca, 336 - 20126 Milano, Italy {fersini,messina,arosio,archetti}@disco.unimib.it 2 Consorzio Milano Ricerche, Via Cicognara 7 - 20129 Milano, Italy [email protected]
Abstract. Thanks to the recent progresses in judicial proceedings management, especially related to the introduction of audio/video recording systems, semantic retrieval is a key challenge. In this context emotion recognition engine, through the analysis of vocal signature of actors involved in judicial proceedings, could provide useful annotations for semantic retrieval of multimedia clips. With respect to the generation of semantic emotional tag in judicial domain, two main contributions are given: (1) the construction of an Italian emotional database for Italian proceedings annotation; (2) the investigation of a hierarchical classification system, based on risk minimization method, able to recognize emotional states from vocal signatures. In order to estimate the degree of affection we compared the proposed classification method with SVM, K-Nearest Neighbors and Naive Bayes, highlighting in terms of classification accuracy, the improvements given by a hierarchical learning approach.
1
Introduction
The IT infrastructure introduced into judicial environments, with particular attention at audio/video recording systems into courtrooms, had a great impact related the legal actor work’s. All the recorded events that occur during a trial are available for subsequent consultation. However, despite the huge quantity of information expressed in multimedia form that are captured during trials, the current retrieval process of contents is based on manual consultation of the entire multimedia tracks or, in the best case, on an automatic retrieval service based on textual user queries with no possibility to search specific semantic concepts. Emotional states associated to the actors involved in courtroom debates, represent one of the semantic concepts that can be extracted from multimedia sources, indexed and subsequently retrieved for consultation purposes. Despite the progress in understanding the mechanisms of emotions in human speech from a psychological point of view, progress in the design and development of automatic emotion recognition systems for practical applications is still in its P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 594–602, 2009. c Springer-Verlag Berlin Heidelberg 2009
Audio-Based Emotion Recognition in Judicial Domain
595
infancy, especially in judicial contexts. This limited progress is due to several reasons: (1) representation of vocal signal with a set of numerical features able to achieve reliable recognition; (2) identification of those emotional states that derive from a composition of other emotions (for example the ”remorse” emotion is a combination of ”sadness” and ”disgust”); (3) presence of inter-speaker differences such as the variation in language and culture; (4) noisy environment; (5) interaction among speakers; (6) quality of the emotional database used for learning, and its likelihood with the real world uttered emotions. A general emotion recognition process can be described by four main phases: dataset construction, attribute extraction, feature selection/generation and inference model learning. The first phase deals with the collection of a corpus of voice signals uttered by different speakers and representative of several emotional states. When the database is created, the features extraction step is performed in order to map the vocal signals into descriptive attributes collected in a series of numerical vectors. Among this attributes through a feature selection/construction phase, a feature set able to better discriminate emotional states is derived. This features are used in the final step to create a classification model able to infer emotional states of unlabelled speakers. With respect to these four main phases the literature can be classified accordingly. Concerning the dataset construction step, several benchmarks in different language have been collected. Among other we can find Serbian [9], German [2] and Polish [5] emotional corpus. Considering the attribute extraction phase, two of the most comprehensive studies ([1] and [16]) were aimed at discovering those attribute set that better correlates with respect to a given collection of emotional states. Their results highlighted that pitch or spectral information have high impact in automatic emotion recognition systems. With respect to the feature selection step, there exists a great number of approaches aimed at identifying the most discriminative characteristics for a set of emotional states. We can distinguish in filter and generation methods. Filters evaluates the ”contribution” of a given feature with respect to intrinsic properties of the data through, for instance, Linear Discriminant Analysis [10] or a Mahalanobis class separability measure [4]. Generation methods, as presented in [11]. [12] and [13], try to create a feature representation derived by the features initially extracted. Concerning the final step related to the induction of inference models, able to recognize emotional states of unlabelled speaker, classification algorithms were extensively investigated. In [15] and [16] a comparison between several classification algorithms is performed. Other popular classification algorithms investigation, limited on a subset of emotional states, include Naive Bayes classifier [3] [8], Kernel Regression [3], K-Nearest Neighbor [3] [14] and Hidden Markov Models [6]. In this paper, we address the problem of finding the model that, with respect to courtroom debates characteristics, is able to produce the optimal recognition performance. The outline of the paper is the following. In section 2 we present two emotional corpus. A well-known benchmark for the German language is introduced, while a new benchmark is proposed for the Italian language. In Section 3 the extraction of vocal signature from uttered emotional sentences is
596
E. Fersini et al.
described. In Section 4 traditional inference models and the proposed MultiLayer Support Vector Machines approach, with their respective experimental results, are described. Finally, in Section 5 conclusions are derived.
2
Emotion Corpus
The performance of an automatic emotion recognition system strictly depends on the quality of the database used for inducing an inference model. Actually, there are three ways of recording the sentences: 1. recording by professional actors: the actors identify themselves in a specific situation before acting a given ”emotional” sentence; 2. Wizard-of-Oz (WOZ): a system interacts with the actors and guide them into a specific emotional state that is subsequently recorded; 3. recording of real-word human emotions: the ”emotional” sentences are gathered by recording real life situations. In order to compare the performance of learning algoritms with the state of the art, we choose from the literature one of the most used emotional corpus known as Berlin Database of Emotional Speech or Emo-DB. This emotional corpus is composed by a set of wave files (531 samples) that represent different emotional states: neutral, anger, fear, joy, sadness, disgust and boredom. Emotion distribution for the German corpus is depicted in figure 1 (a) (for a more detailed description refers to [2]). A further benchmark, built at the University of Milano-Bicocca, is presented in the next subsection.
(a)
(b) Fig. 1. Emotional states distributions
2.1
Italian Emotional DB
As pointed out in section 1, emotion recognition can be strongly influenced by several factors, and in particular by language and culture. For this reason, we decided that it would be useful to adopt an Italian corpus in order to investigate
Audio-Based Emotion Recognition in Judicial Domain
597
Italian emotional behaviors. Since at the time of writing there is no Italian benchmark, we decided to manually collect a set of audio files 1 . Due to the difficulty to find available actors to record acted sentences, and the more complicated situation to obtain recordings by real-world situations, we collected audio file from movies and TV series, dubbed by Italian professional actors. Differently by others database used in the emotion recognition, in which the number of speakers vary from 5 to 10 like in [16] and [7], our database construction is aimed at creating a generic corpus: 40 movies and TV series are taken into account and, for each of them, sentences acted by different actors are collected. Thus the number of speakers is relatively high, making the system as independent as possible on the speaker. The Italian Emotional Corpus, named ITA-DB, is composed by 391 balanced samples of different emotional states that respect Italian judicial proceedings: anger, fear, joy, sadness and neutral. This subset of emotions are chosen in order to model the most interesting emotional states, from judicial actors point of view, that could occurs during Italian courtroom debates. All samples are recorded with Pulse Code Modulation, sampling frequency of 16 kHz, 8 bit for sample, bitrate of 256 kbps and single channel (Mono format). Emotion distribution for the Italian corpus is depicted in figure 1 (b).
3
Extraction of Vocal Signatures
Despite there is not yet a general agreement on which are the most representative features, the most widely used are prosodic features, like fundamental frequency (also known as pitch or F0 ) and formants frequencies (F1 , F2 , F3 ), energy related features and Mel Frequency Cepstral Coefficients (M F CC). Fundamental and formants frequencies refer to the frequency of vocal cords vibration, labelling the human vocal tone in a quite unambiguous way; energy refers to the intensity of vocal signal and Mel Frequency Cepstral Coefficients concern the spectrum of the audio signal. Duration, rate and pause related features are also used, as well as different types of voice quality features. In our work, for each audio file, an attribute extraction process was performed. Initially audio signal was sampled and split in 10ms frames and for each of these frames 8 basic features were extracted. We calculated prosodic features such as F0 (pitch), F1 , F2 , F3 , intensity related features like energy and its high and lowpassed version and a spectral analysis made up of the first 10 MFCC coefficients normalized by Euclidean Norm. After this first step a 8 features vector for each frame was obtained. In order to extract from this information the necessary features, we considered their respective three time series, i.e. the series itself, the series of its maxima and the series of its minimum, and we computed a set of statistical index. In particular, for each series that describe one of the attribute over the N frames, we computed 10 statistics: minimum, maximum, range (difference between min and max), mean, median, first quartile, third quartile, interquartile range, variance and mean of the absolute value of the 1
Italian Emotional Database available at http://www.mind.disco.unimib.it/ downloads
598
E. Fersini et al.
Fig. 2. Feature Extraction Process
local derivative. At the end of this feature extraction process, each vocal signal is represented into a feature space characterized by 240 components (8 × 3 × 10). In Figure 2 the entire features extraction process is depicted.
4
Emotional State Inference Models
The feature extraction phase, that creates a feature vector for each audio file, allow us to consider emotion recognition as a generic machine learning problem. The learning algorithm investigation, presented in the following subsections, can be distinguished in Flat and Multi-Layer classification. 4.1
Flat Classification
Among the different machine learning techniques we experimented, we report the results of the most promising ones: Na¨ıve Bayes, K-Nearest Neighbor and Support Vector Machines. Their optimal parameters, i.e. the K parameter referred to the Nearest Neighbor approach and the kind of kernel function and the complexity parameter C for the Support Vector Machines, were investigated. A 10-fold Cross Validation is used as testing method. In Figure 3(a) results, based on optimal parameters, are reported. The optimal performance are obtained by setting K = 20 for the K-Nearest Neighbor and for both the emotional corpus. The optimal performance for the Support Vector Machines are obtained with polynomial kernel with exponent E = 1 and complexity parameter C = 2 for the Italian corpus and with polynomial kernel with exponent E = 1 and complexity parameter C = 3 for the German corpus. The results shown that the machine learning algorithm that performs better is the one based on Support Vector Machines. It is interesting to note that some similar emotions (similar in terms of vocal parameters), like anger/joy, neutral/boredom and neutral/sadness, do not allow the classifier to distinguish between them (See Emo-DB in Figure 3(c) and
Audio-Based Emotion Recognition in Judicial Domain
(a)
599
(b)
(c)
(d)
Fig. 3. Flat Classification Experimental Results
ITA-DB in Figure 3(d)). Another interesting remark, highlighted in Figure 3(b), is related to the investigation about male and female emotion classification performed by two distinct SVMs: learning gender-dependent models produce better performance than unique model. This because some features used to discriminate emotional states are gender-dependent; the fundamental frequency F0 is one of them: women usually have F0 values higher than men because of the different size of the vocal tract, in particular the larynx. Starting from this conclusions, we defined a multi-layer model based on the optimal learner, i.e. Support Vector Machines. 4.2
Hierarchical Classification
As highlighted in the previous sections, inference model are influenced by language, gender and “similar” emotional states. For this reason we propose a MultiLayer Support Vector Machine approach, that tries to overcome the mentioned limitations. At the first layer a Gender Recognizer model is trained to determine the gender of the speaker, distinguishing “male” speakers from “female” ones. In order to avoid overlapping with other emotional states, at the second layer gender-dependent models are trained. In particular, Male Emotion Detector and Female Emotion Detector are induced to produce a binary classification that discriminates the “excited” emotional states by the “not excited” ones (i.e. the neutral emotion). The last layer of the hierarchical classification process is aimed at recognizing different emotional state using Male Emotion Recognizer and Female Emotion Recognizer models, where only “excited” sentences are used to train the models for discriminating the remaining emotional states. A
600
E. Fersini et al.
Fig. 4. Multi-Layer Support Vector Machines
synthetic representation of Multi-Layer Support Vector Machines is depicted in Figure 4. Since also in this case all the models embedded into the hierarchy are based on Support Vector Machines, we experimentally estimate the optimal parameters combination. The performance obtained by the Multi-Layer Support Vector Machines are then compared with the ones provided by the traditional “Flat” Support Vector Machines for both Emo-DB and Ita-DB. The comparison reported in Figure 5 highlights the improvement, in terms of number of instances correctly classified, obtained by the Multi-Layer Support Vector Machines with respect to the traditional model. Figure 6 shows the classification performance of each intermediate layer of the hierarchy. This has been done to understand how the error rate is obtained by the different classifiers of the hierarchy. As we go down in the hierarchy layers the performance get worse, and in the last layer they suffer a remarkable reduction. This because the classifiers have different targets: in the root and in the first level, learning is simplified using only two
Fig. 5. Multi-Layer Experimental Results
Audio-Based Emotion Recognition in Judicial Domain
601
Fig. 6. Multi-Layer Experimental Results
classes, “male” and “female” for root and “excited” and “not excited” for the first layer classifiers; in the last layer a more complex discrimination is required: 6 emotions for Emo-DB and 4 for ITA Emotional DB. A further motivation, related to the decreasing number of instances used to estimate models in the lower layer, could explain the performance reduction. In fact while Gender Recognizer can learn on the entire dataset, learning on Male and Female Emotion Detector is performed on two subsets of the whole dataset, the first model is trained by using only male instances and the second one by considering only female samples. The same thing happens for the last layers, i.e. Male Emotion Recognizer and Female Emotion Recognizer, that are induced by using ”excited” female and ”exited” male samples respectively.
5
Conclusion and Future Work
In this paper the problem of producing semantic annotation for multimedia recording of judicial proceeding is addressed. In particular, two main contributions are given: the construction of an Italian emotional database for Italian proceedings annotation and the investigation of a multi-layer classification system able to recognize emotional states from vocal signal. The proposed model outperforms traditional classification algorithms in terms of instances correctly classified. In our investigation speakers emotion evolution are not considered. We believe that by taking into account the dynamic of emotional process could improve recognition performance. A further development will regard the fusion of different of information sources in order to produce a more accurate prediction.
Acknowledgment This work has been supported by the European Community FP-7 under the JUMAS Project (ref.: 214306).
602
E. Fersini et al.
References 1. Batliner, A., Fischer, K., Huber, R., Spilker, J., N¨ oth, E.: How to find trouble in communication. Speech Commun. 40(1-2), 117–143 (2003) 2. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of german emotional speech. In: Interspeech 2005, pp. 1517–1520 (2005) 3. Dellaert, F., Polzin, T., Waibel, A.: Recognizing emotion in speech. In: Proc. of the CMC, pp. 1970–1973 (1996) 4. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 5. Slot, K., Cichosz, J.: Application of selected speech-signal characteristics to emotion recognition in polish language. In: Proc. of the 5th International Conf. on signals and electronic systems, pp. 409–412 (2004) 6. Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden markov models. Speech Communication 41(4), 603–623 (2003) 7. Petrushin, V.: Emotion recognition in speech signal: Experimental study, development, and application. In: Proc. Sixth International Conf. on Spoken Language Processing (ICSLP 2000), pp. 222–225 (2000) 8. Polzin, T., Waibel, A.: Emotion-sensitive human-computer interface. In: Proc. of the ISCA Workshop on Speech and Emotion (2000) 9. Dordevic, M., Rajkovic, M., Jovicic, S., Kasic, Z.: Serbian emotional speech database: design, processing and evaluation. In: Proc. of the 9th Conf. on Speech and Computer 10. Schuller, B., Arsic, D., Wallhoff, F., Rigoll, G.: Emotion recognition in the noise applying large acoustic feature sets. In: Speech Prosody (2006) 11. Schuller, B., Reiter, S., Rigoll, G.: Evolutionary feature generation in speech emotion recognition. In: Proceeding of the 2005 IEEE International Conf. on Multimedia and Expo., pp. 5–8 (2005) 12. Sedaaghi, M.H., Kotropoulos, C., Ververidis, D.: Using adaptive genetic algorithms to improve speech emotion recognition. In: Proc. of 9th Multimedia Signal Processing Workshop, pp. 461–464 (2007) 13. Sedaaghi, M.H., Kotropoulos, C., Ververidis, D.: Using adaptive genetic algorithms to improve speech emotion recognition. In: Proc. XV European Signal Processing Conf., pp. 2209–2213 (2007) 14. V¨ ayrynen, E., Sepp¨ anen, T., Toivanen, J.: Automatic discrimination of emotion from spoken finish. Language and Speech 47(4), 383–412 (2004) 15. Oudeyer, P.y.: Novel useful features and algorithms for the recognition of emotions in speech. In: Proc. of the 1st International Conf. on Speech Prosody, pp. 547–550 (2002) 16. Oudeyer, P.y.: The production and recognition of emotions in speech: features and algorithms. Int. J. Hum.-Comput. Stud. 59(1-2), 157–183 (2003)
Learning with a Quadruped Chopstick Robot Wei-Chung Lee, Jong-Chen Chen, Shou-zhe Wu, and Kuo-Ming Lin National Yunlin University of Science and Technology, Taiwan, R.O.C. {g9523708,jcchen,g9523743,g9623715}@yuntech.edu.tw
Abstract. Organisms exhibit a close structure-function relationship and a slight change in structure may in turn change their outputs accordingly [1]. This feature is important as it is the main reason why organisms have better malleability than computers in dealing with environmental changes. A quadruped chopstick robot controlled by a biologically-motivated neuromolecular model, named Miky, has been developed. Miky’s skeleton and its four feet were comprised of 16 deposable chopsticks, with each foot being controlled by an actuator (motor). The neuromolecular model is a multilevel neural network which captures the biological structure-function relationship and serves to transform signals sent from its sensors into a sequence of signals in space and time for controlling Miky’s feet (through actuators). The task is to teach Miky to walk, jump, pace, gallop, or make a turn. Our experimental result shows that Miky exhibits a close structure-function relationship that allows it to learn to accomplish these tasks in a continuous manner. Keywords: Evolutionary learning, Robot, Neural networks, Sensors.
1 Introduction In recent years, the advancement of computer hardware and software (including sensors) has brought our world gradually towards a so-called intelligent environment. To achieve this goal, system designers have to clearly define all possible cases that might occur in the real world in a predetermined manner and make sure everything is working as expected when it starts operating. All the efforts made are to minimize all kinds of possible errors. Thus, the overall system is intentionally made to be overly rigid. This approach can, where sufficient time and money are invested, produce excellent results, but there are major restrictions with regard to cost and flexibility. By contrast, organisms possess a mapping structure with fitness landscapes holding some degree of smoothness that a slight change in an organism’s gene structure generally will not significantly alter its functions (or behaviors). Finding feasible solutions within a reasonable time may become much easier in a smooth landscape than in a rugged landscape [2]. In biological systems, the smoothness (gradualism) property is naturally represented in the close structure-function relationship. de Garis [3] developed an artificial brain that assembled a group of cellular automata-based neural net modules to control a robot. Nam et al. [4] proposes an evolutionary learning circuit for controlling robots. Higuchi and his colleagues [5] worked on the P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 603–616, 2009. © Springer-Verlag Berlin Heidelberg 2009
604
W.-C. Lee et al.
development of an evolvable hardware chip to control prosthetic hands in physical body movement and to navigate a robot in a nondeterministic and nonstationary environment. Thompson [6] presents an analog chip to control a mobile robot adaptively. However, as pointed out by Miller and Downing [7], a digital machine is lacking in flexibility as it is built by human being with extremely stringent doping requirements. Thus more crucial is how to enrich the doping. Vassilev et al. [8] propose the socalled neutrality principle that increasing the number of functionally equivalent searches may facilitate the search, as it may provide more solutions for a problem. Thompson and Layzell [9] point out that the chance of finding better solutions increases if fewer constraints are put on the prejudices of a digital machine. Our aim in this study was not to construct a strictly rigid system that, given the same input, slightly different outputs might be possible. To achieve this, it was crucial to develop a structure made of substances holding some degrees of flexibility. In Asia, tons of wood and bamboo are used every year just for one-time-use chopsticks (disposable chopsticks). These chopsticks are generally not very expensive, and thus anything comprised from them should be quite cost-effective. Another important feature is that these chopsticks are completely recyclable, and thus always can be used for different purposes. More importantly, as we know, a chopstick made from bamboo holds some degrees of ‘softness’ in its structure. They are very flexible and can bend or be bent easily. Also, we can easily change its structure by adding some components into or removing them from it. With this feature, we can create specific shapes for robots and then reshape them later, if necessary. Thus, unlike traditional robots with rigid structures, a robot made from one-time-use chopsticks possesses certain degrees of flexibility that can change or be changed easily to suit a different situation. With the above-mentioned malleable features, to some extent, a robot comprised of disposable chopsticks can be thought of as an artificial organism. In the above, we have proposed a comparatively ‘soft’ robot. The proposed neuromolecular architecture [10] is a plastic architecture with rich dynamics that captures the biological structure-function relation into the system, in particular into the subneuronal level of processing. Our goal is to provide the digital machine with a representation of the internal structure-function relations of biological systems, to capture some of the dynamic modes of the processing of these systems, and to incorporate learning algorithms of the type used in natural systems. Section 2 introduces the architecture of the proposed model and the evolutionary learning mechanisms. Section 3 shows our application domain and experimental results. Section 4 discusses the concept of double representations that facilitate evolutionary learning. The final section presents the concluding remarks.
2 Architecture In the following, we first explained Miky’s structure. Then, we introduced the architecture of the neuromolecular model, which was used to train Miky to perform some coherent functions.
Learning with a Quadruped Chopstick Robot
605
2.1 Miky’s Structure - A Chopstick Robot Miky is a quadruped chopstick robot which has two front feet and two rear feet. Each front foot is comprised of a chopstick whereas each rear foot is comprised of three chopsticks. More chopsticks are used for a rear foot than a front foot, as more weight is needed to support a robot to make a jump motion. These four feet are linked to an “I”-shape skeleton that consists of 8 chopsticks. In total, 16 disposable chopsticks are used. All of the above components are fastened together with fuse and rubber bands. The length, width, and height of Miky are shown in Figure 1. Each of these four feet is controlled by a specific actuator (motor MICRO MG 2BB). A servo controller (U5201 Parallax Servo Controller, PSC) mounted at the rear top of the skeleton which served to relay signals sent from its computers (through a USB wire) to a series of outputs that activate these four actuators in space and time.
20cm
7 5c
10cm 7.5cm 10c
22cm
24.6cm
Fig. 1. Physical size of Miky
2.2 The ANM System Motivated from information processing in the brain, the software system that controls Miky is a multilevel neural architecture. It consists of two types of neurons. The first type of neuron possesses a cytoskeletal style of information processing. Neurons of this type are motivated by the working hypothesis that information processing occurring on the cytoskeleton might directly or indirectly control neuronal activity [11-14]. Specifically, the cytoskeletal dynamics (motions) influence ion channels. Each of the neurons with intraneuronal dynamics serves to transduce specific combinations of input signals in space and time to an output signal (or a sequence of temporal output signals), which in turn controls the behaviors of the system. The timing of this pulse relative to the presynaptic inputs is controlled by the transduction dynamics that take place on the cytoskeleton. The intraneuronal model has different types of cytoskeletal fibers and each is responsible for transmitting a specific type of
606
W.-C. Lee et al.
cytoskeletal signal. These fibers might interact with each other through the proteins (microtubule associated proteins, MAPs) that connect them together. Specific combinations of these signal interactions may in turn fire a neuron. The strength of the interactions depends on the types of fibers involved. The second type of neurons is motivated by the hypothesis that some neurons can affect the firing of other neurons directly or indirectly [15]. Neurons of this type serve to assemble (or group) some other neurons to perform specific tasks. The mechanism involves synaptic facilitation, as in Hebbian models. Physiological correlates of this type of neurons can be found in [16]. The integration of these two types of neurons into a system constitutes an open evolutionary architecture with rich dynamics. Evolutionary learning is implemented by varying and selecting the patterns of input signals to a neuron, the patterns of signal flows on the cytoskeleton, and the patterns of signal interactions to which a neuron will respond. In this section, we introduce the intraneuronal architecture that plays the role of integrating spatiotemporal signals inside a neuron and the interneuronal architecture that orchestrates groups of neurons for performing coherent tasks. We then explain the evolutionary learning algorithm used in this model. A. Intraneuronal Architecture It has been firmly established by now that information processing inside a neuron is significant. The objective of the present study is not to identify the precise nature of these mechanisms, but rather to capture the working hypothesis that the cytoskeleton serves as a signal integration system. Our model is restricted to the membrane components. In the present implementation, the membrane of the cytoskeleton is abstracted as a macromolecular network (a cytoskeletal network) comprising a number of components capable of initiating, transmitting, and integrating cytoskeletal signals. Our assumption is that an interneuronal signal impinging on the membrane of a neuron is converted to an intraneuronal signal (a cytoskeletal signal) transmitting on the cytoskeleton. Different types of components have the characteristics of transmitting signals at different speeds and interacting with each other with different strengths. Cytoskeletal signals transmitting in space and time are integrated by the components in the cytoskeletal network. A neuron fires if the signals combine to activate a protein that serves as a readout molecule. (Note that, to fire a neuron, these signals do not have to make a rendezvous at the same time.) Different combinations of signals in space and time that fire a neuron are grouped as the set of spatiotemporal patterns that it recognizes. The initial repertoire of cytoskeletal networks is generated at random. Variation-selection algorithms act on the cytoskeletal networks to modify the patterns of signal initiation, transmission, and integration inside each neuron. In this way the repertoire of special purpose neuronal transducers is created and subsequently tuned for a desired task. We note that three important features friendly to evolution are captured in the cytoskeletal networks: weak interaction, redundancy, and compartmentalization [1]. By redundancy, we mean that some cytoskeletal signals are superfluous so that they can be added or removed without altering the outputs (or the firing behaviors) of a neuron. By weak interaction, we mean that the interactions among the constituted components of a neuron (through cytoskeletal signals) are so small that it allows us to modify its input/output behaviors gradually.
Learning with a Quadruped Chopstick Robot
607
j location ( i , j )
3 4
1 2
1 2
5 6 7 8
C2
3
i
4
C1 C1
C1 C1
' ' M AP
C2
C2
C2
C2
C2
C3
C2 C1
C3
C1
C1
C1
C1
C1
5
6
7 8
C3
C3
C3
C3
C3
C1
C1
C1
C1
C1
C2
C2
C1
C2
C1
C1
C1
C3
C3
C3
C2
C3
' ' reado ut enzym e
' ' readin en zy m e
Fig. 2. Cytoskeleton. Signals recognized by readin enzymes are integrated by the neuronal cytoskeleton. Signals propagate along microtubules, neurofilaments, and microfilaments at different speeds and affect each other with different strengths. Each of the components C1, C2, and C3 are individual sites on these filaments. Interaction between microtubules and neurofilaments, between neurofilaments and microfilaments, and between microtubules and microfilaments are mediated by MAPs. A neuron fires when cytoskeletal signals at a readout enzyme site exceed a threshold level.
Compartmentalization serves to block off signal interference occurring in different regions of a neuron so that each neuron can be trained as a specific input/output transducer. B. Interneuronal Architecture - Orchestral Learning Orchestration is an adaptive process mediated by varying neurons in the assembly for selecting appropriate combinations of neurons to complete specific tasks. Currently, cytoskeletal neurons are divided into a number of comparable subnets. By comparable subnets, we mean that neurons in these subnets are similar in terms of their interneuronal connections and intraneuronal structures. Neurons in different subnets that have similar interneuronal connections and intraneuronal structures are grouped into a bundle. Two levels of reference neurons are used to manipulate these bundles of neuron. The two levels form hierarchical control architecture (Fig. 3). The first is referred to as lowlevel reference neurons that directly control the bundles of cytoskeletal neurons. Each of these controls a specific bundle. The second level is referred to as high-level reference neurons that play the role of grouping the low-level reference neurons. The activation of a high-level reference neuron will fire all of the low-level reference neurons that it controls, which in turn will activate some of these bundles of cytoskeletal neurons (i.e., neurons in different subnets that have similar intraneuronal structures). We note that only the bundles activated by the reference neurons are allowed to perform information processing. The connections among low-level reference neurons and cytoskeletal neurons are fixed. However, the connections among high-level reference neuron and low-level reference neuron layers are subjected to change during evolutionary learning. The above process is called orchestral learning.
608
W.-C. Lee et al.
high-level reference R1 neurons low-level reference neurons
cytoskeletal neurons E1
R2
r1
E2 . . . E32 subnet1
r2
E1
R3
r3
...
...
R8
r32
E2 . . . E32 . . . E1 E2 . . . E32 subnet2 Subnet4
Fig. 3. Hierarchical interneuronal control architecture
C. Evolutionary Learning Four levels of intraneuronal evolutionary learning are allowed in this system. Evolution can occur at the levels of initiating signal-flow-level (controlled by readin enzymes), responding to signal-flow-level (controlled by readout enzymes), transmitting signal-flow-level (controlled by cytoskeletal PUs), and controlling PUinteraction level (controlled by MAPs). A readin enzyme connects a PU with one or more than one external inputs (or sensors). A cytoskeletal signal is initiated when a readin enzyme receives any of these input signals. The connections between readin enzymes and external inputs are subject to change during the course of learning. Adding a readin enzyme will expand the receptive fields of a neuron whereas removing it narrows the fields. Evolution at the level of readin enzymes is implemented by copying (with mutation) the readin enzyme pattern of each neuron in the best-performing subnets to all comparable neurons in lesser-performing subnets. Variation is realized by randomly adding or deleting some readin enzymes during the copy procedure. Evolutionary procedure at the level of readout enzymes is similar to that at the level of readin enzymes. Adding a readout enzyme will expand the set of cytoskeletal signal patterns recognized by a neuron, whereas removing it reduces the set. Processing units are responsible for transmitting and integrating cytoskeletal signals. Evolution at the level of PU configurations is implemented by copying (with mutation) the PU configurations of neurons in the best-performing subnets to those of comparable neurons in the lesser-performing subnets. Variation is implemented by varying the PU configurations during the copy procedure. We note that different PU configurations exhibit different patterns of signal flows. Evolutionary procedure at the level of MAPs is similar to that at the level of cytoskeletal PUs. Adding an MAP will allow a cytoskeletal signal to interact with its neighboring PUs, which in turn increases the chance of initiating a new signal flow. Deleting an MAP will shut down the interaction between two PUs. In a nutshell, evolution at any of these four levels can alter the set of input patterns recognized by a neuron. That is, we can train each neuron to be a specific input/output transducer by turning on any of these four levels. In addition, except at the level of readout enzymes, evolution at the other three levels can also alter the patterns of
Learning with a Quadruped Chopstick Robot
609
signal flows inside a neuron. This allows the ANM robot a great possibility to change the input/output behavior of a neuron. In the present implementation, the ANM system has 256 cytoskeleton neurons, which are divided into eight comparable subnets. As we mentioned earlier, comparable subnets are similar in terms of their interneuronal connections and intraneuronal structures. Thus, they also can be grouped into 32 bundles. The copy process occurs among neurons in the same bundle. The initial patterns of readin enzymes, readout enzymes, MAPs, and PU-types of the reproduction subnet are randomly decided. The evolutionary learning algorithm is shown in Fig. 4. Evolution of reference neurons is implemented by copying (with mutation) the patterns of low-level reference neuron activities loaded by the most fit high-level reference neurons to less fit high-level reference neurons (details can be found in [10]). The copying process is implemented by activating a most fit high-level reference neuron, which in turn reactivates the pattern of low-level reference neuron firing. This pattern is then loaded by a less fit high-level reference neuron. Variation is implemented by introducing noise into the copying process. Some low-level reference neurons activated by a most fit high-level reference neuron may fail to be loaded by a less fit high-level reference neuron. Or some low-level reference neurons that are not activated may fire and be “mistakenly” loaded by a less fit high-level reference neuron. In the present implementation, evolutionary learning at the reference neuron level is turned off, as we have not yet implemented it on digital circuits.
1. Generate at random the initial MAP, PU-type, readin enzyme, and readout enzyme patterns of each neuron in the comparable subnets. Each neuron is denoted by neuron (i, j). Where i is the subnet number and j is the bundle number. 2. Evaluate the performance of each subnet and select three best-performing subnets. 3. Copy the MAP, PU-type, readin enzyme, and readout enzyme patterns of each neuron in best-performing subnets to those of comparable neurons in the lesser-performing subnets. 4. Vary the MAP, PU-type, readin enzyme , or readout enzyme patterns, depending on which level of evolution learning is operative. 5. Go to Step 2 unless the stopping criteria are satisfied. Fig. 4. Evolutionary learning algorithm
3 Input/Output Interface and Application Domain As described in the above, Miky has seven sensors and four actuators (motors). Cytoskeletal neurons served as the major components to transform signals sent from the sensors to a series of outputs for controlling Miky’s four legs (through its four actuators). We first explained how to set up the connections among these sensors and cytoskeletal neuron and how to transform the data collected by the sensors into
610
W.-C. Lee et al.
signals (external stimuli) for cytoskeletal neurons. Then, we described the application domain. Finally, we explained how to evaluate the fitness of the system for each learning mode. A. Input/output Interface We used three binary bits to encode the data of each sensor, representing the three different states of each sensor. That is, each sensor was encoded with a 3-bit pattern. In total, there were 21 bits required to encode all of these seven sensors. All cytoskeletal neurons that had connections with a specific bit would receive the same stimulus simultaneously. When a readin enzyme received an external stimulus, a cytoskeletal signal was initiated. As to which readin enzymes of a neuron would receive the stimuli from a sensor, it was randomly decided in the beginning but subject to change during the course of learning. All stimuli were sent to cytoskeletal neurons simultaneously, and thus all cytoskeletal signals were initiated at the same time. Cytoskeletal neurons integrated these signals in space and time into a series of output signals for controlling effector neurons. There were four effector neurons, each for controlling the movement of each of Miky’s four legs. Correspondingly, cytoskeletal neurons were divided into four groups of neurons so that each of these groups controls a specific effector neuron. An effector neuron fired when any of the cytoskeletal neurons controlling it fired, which in turn initiated the movement of a leg. We assumed that a firing effector neuron would enter a refractory state and then return back to the quiescent state after the refractory period was over. We note that an effector neuron in the refractory state would ignore any stimuli from any cytoskeletal neurons during the refractory period. The input-output interface of Miky is shown in Figure 5.
Fig. 5. Input-output interface
Learning with a Quadruped Chopstick Robot
611
B. Application Domain In the present study, Miky has been applied to the navigation problem, including walking, trotting, pacing, galloping, and making a turn. Navigating robots have been around since the seventies. However, how to coordinate the robot’s feet in a coherent way so as to accomplish assigned tasks is still a challenging problem. Three types of experiments were performed with Miky. The first type of experiment was to teach Miky how to walk with a 4-time gait: LF (left-front), RB (right-back), RF (right-front), LB (left-back). Notice that, in this action mode, balance and support were maintained by the LB+RF "diagonal" while the LF and RB legs were suspended, and by the opposite diagonal for the other 2 legs. The second types of experiments were to teach Miky how to trot, pace, and gallop. Each of these experiments was performed separately and started from the same initial setup. The trotting mode was a 2-time gait, LB+RF alternating with RB+LF, like the walking mode described above. The third type of experiment was to teach Miky how to make a turn: left or right. Turning involves several phases and complicated angular movements of bone and muscle groups throughout its entire body. We assumed that the motion of each of Miky’s legs was the same, including phase, angle, and speed. We also did not take into account how to balance its four legs, but only how to move them in space and time so as to accomplish the above task in a coherent fashion. Table 1 shows a brief description of the leg movements of each learning mode. Table 1. Sequence of leg movements for each learning mode Mode Walking Trotting Pacing Galloping Turning left Turning right
Brief description LF, RB, RF, LB, and repeat LF+RB alternating with LB+RF LB+LF alternating with RB+RF LB+RB alternating with LF+RF LB+RF+RB alternating with LF LB+LF+RB alternating with RF
C. Fitness Evaluation Notice that in the present implementation we trained Miky how to correctly move the first two steps of each leg. Here we assumed the repetition of these two steps comprised a specific action mode, such as walking, galloping, and so on. We also assumed that it took less than 1 time unit for Miky to initiate a leg movement, including moving it forward and then backward to its original position. For any two subsequent leg movements, ideally, they should be separated by 1.0 time unit. Table 2 shows the expected times to initiate each leg movement for each mode of learning. The following illustrates an example of the expected times to initiate each leg movement when we taught Miky to walk by following the sequence of LF, RB, RF, and LB. Thus, if Miky moved its LF leg at time 0.0, then it should move its RB leg at time 1.0, RF leg at time 2.0, LB leg at time 3.0, and LF leg again at time 4.0. Another example was that when we taught Miky to trot, it should move both LF and RB legs at time 0.0, then both RF and LB legs at time 1.0, next both LF and RB legs again at time 2.0, and finally both RF and LB legs again at time 3.0.
612
W.-C. Lee et al. Table 2. Expected times to move each of these four legs for each learning mode Expected times to initiate the movement of each leg Learning mode
e1LF
e1RF
e1LB
e1RB
2 eLF
2 eRF
2 eLB
2 eRB
Walking Trotting Pacing Galloping Turning left Turning right
0 0 0 0 0 1
2 1 1 0 1 0
3 1 0 1 1 1
1 0 1 1 1 1
4 2 2 2 2 3
6 3 3 2 3 2
7 3 2 3 3 3
5 2 3 3 3 3
For each learning mode, the fitness was determined by how closely Miky followed its predetermined sequence. This meant that not only the order but also the actual times of each leg movement had to match the above-mentioned expected times as close as possible. Thus, the fitness of each leg movement was determined by the time difference (TD) between the actual time and the expected time to initiate its movement. For example, in the walking mode, the expected time to move its RF leg was at time 2.0. Thus, if Miky actually moved it at time 2.0, this meant that this leg was moving at the perfect timing. However, if it was moved at time 2.3, then TD was set 0.3. In total, there would be four TD values, one for each of these four leg movements. If we allowed each of these four legs to move twice, then there would be eight TDs. The overall fitness, called total time difference (TTD), would be the sum of all these eight TD values, as shown in Eq. (1). The lower the TTD value, the better fitness the system had. 1 2 2 2 2 TTD = TD1RF + TD1LF + TDRB + TD1LB + TDRF + TDLF + TDRB + TDLB 2 2 2 2 2 2 2 2 - eRF + aLF - eLF + aRB - eRB + aLB - eLB … (1) = a1RF - e1RF + a1LF - e1LF + a1RB - e1RB + a1LB - e1LB + aRF i
i Where a RF and eRF represented the actual time and the expected time to move Miky’s RF leg, respectively, and superscript i represented the ith times to make such a leg movement.
4 Experimental Results For each learning mode, TTD decreased as learning proceeded. Roughly speaking, it decreased significantly at the early stages of learning, but slowed down at the later stages. However, it showed an important result that learning did not stagnate completely before we terminated the simulation. For example, in the pacing mode, it seemed that no progress was made from cycle 194 to cycle 997, but at cycle 998 it showed a slight improvement in its learning performance. Even though the improvement was too small to have any noticeable effect, nevertheless it still showed a continued manner of learning.
Learning with a Quadruped Chopstick Robot
613
Fig. 6. Learning performance of each learning mode
Another example was that, in the walking mode, the experimental results showed that there was almost no change over long periods of time, but after then it showed rapid evolutionary changes occurring in short time frames. The above process was repeated several times in the course of evolutionary learning. The phenomenon, in some senses, was analogous to the so-called punctuated equilibrium proposed by Eldredge and Gould [17]. In the present implementation, the above results were readily understood because only one leg was allowed to move at a time and any changes in the spatiotemporal control of any of these four legs were subject to open up the opportunity for the changes in the other three legs, which were combined to find solutions with better fitness. It was thus the fitness value that changed in a
614
W.-C. Lee et al. Table 3. Times to initiate the movements of each leg for each mode of learning cycle step 1 1 2 1 20 2 1 1000 2
mode
walk
pace
trot
gallop
left turn
right turn
leg
left right left right left right left right left right left right
front
0.00 0.13 0.25 0.00 0.00 0.13 0.00 0.39 0.00 0.25 0.12 0.25
rear
0.89 0.38 0.01 0.51 0.13 0.38 0.48 0.39 0.00 0.50 0.00 1.02
front
2.07 2.07 2.20 1.10 1.62 2.07 1.57 1.57 0.00 1.37 1.65 2.20
rear
3.37 1.52 1.11 1.65 1.38 1.66 3.20 3.22 1.27 1.64 1.10 3.05
front
0.00 0.37 0.00 0.25 0.00 0.38 0.00 0.00 0.00 0.01 0.00 0.00
rear
0.98 0.00 0.01 0.26 0.27 0.00 0.90 0.90 0.27 0.13 0.12 0.97
front
1.91 3.18 1.11 2.21 1.66 2.76 2.00 2.00 1.53 1.54 1.91 1.91
rear
4.28 2.26 1.11 2.21 2.82 1.66 3.08 3.07 2.50 1.53 2.07 2.93
front
0.00 1.10 0.00 0.26 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00
rear
2.05 0.13 0.00 0.26 0.77 0.00 0.90 0.90 0.56 0.00 0.00 0.98
front
3.06 4.88 1.25 2.21 1.78 2.75 2.00 2.00 1.53 1.54 1.92 1.92
rear
5.87 3.86 1.25 2.21 2.75 1.79 3.01 3.00 2.55 1.53 1.92 2.94
continued manner as learning proceeded. However, the task became more difficult when it required two feet to change at the same time, as the change of one foot’s movement had to wait for the change of the other to join together so as to make a coherent change in Miky’s outputs. As shown in Figure 6, learning appeared to be more difficult in the later stage of learning when we taught Miky to pace, trot, or gallop than when we taught it to walk. However, it should be noted that, as above, learning did not stagnate. For example, it showed an increase in fitness at cycle 998 in the pacing mode, at cycle 712 in the trotting mode, and at cycle 822 in the galloping mode. As above, the task was most difficult when Miky had to learn to make a turn, as it had to learn to move three legs at the same time. We set 0 to be the initiation time of the first movement of any of these four legs, and then recorded the elapsed time between the first movement and any of the subsequent movements of these four legs. Table 3 shows, for each learning mode, the times to initiate the movements of each of these four legs in space and time before and after evolutionary learning was applied. Before learning to walk, in the temporal order of leg movement, Miky moved its LF leg at time 0.0, its RF leg at 0.13, its RB leg at 0.38, its LB leg at 0.89, and then again moved its RB leg at 1.52, both its LF and RF legs at 2.07, and its LB leg at 3.37. Obviously, this order was quite different from what was expected. After 20 learning cycles, the order of leg movements was improved but the timing control was still far from what was expected. After 1000 learning cycles, the timing control of these four legs was improved significantly and a comparatively better solution was obtained. Even though the system was able to show continued learning as long as it was running in the learning phase. It would be very costly, however, if the goal was to go for the perfect solution (perfect timing-control). For the present implementation it became unreasonable for the system to do that, given the resources available to it in an acceptable amount of time.
Learning with a Quadruped Chopstick Robot
615
5 Conclusions Malleability is a highly desirable feature in artificial intelligence. In this study, we have developed a chopstick robot made of a number of deposable chopsticks and controlled by a biologically motivated neuromolecular model. Our strategy, as stated at the outset, has been to construct a substrate that is rich in evolutionary potentiality. The malleability of the chopstick robot presented in this paper is two-fold. First, we can elaborate the system in the direction of greater realism by continually adding extra features into it. Second, the robot is controlled by a malleable architecture so that its intraneuronal processing allows the neurons to perform input-output transformation in a much more specific manner than simple averaging devices (neurons with simple threshold dynamics), and allows them to produce specific temporal patterns of outputs in response to different classes of input patterns. These transducing neurons are controlled by the neurons with control features to perform specific input-output tasks. This is an adaptive process, called the orchestration process, mediated by varying neurons in the assembly and selecting good performing combinations from the repertoire of neurons with internal dynamics. But it cannot create or modify the repertoire by itself. It should be noted that the above setup for cytoskeletal neuron dynamics is rather complicated. The complexity is an important feature of the model, since it allows for a high degree of evolutionary flexibility (or potentiality) as well as plastic (evolution friendly) structure-function relations. The high degree of redundancy in the system, the use of signals with different strengths, and the presence of modulating components (MAPs) all contribute to this richness and plasticity. Embracing these features provides the proposed chopstick robot with great malleability for dealing with problems in the real world. The results in the study show that the chopstick robot can be successfully applicable to a series of functional modes. The results also show that the amount of effort (learning times) required to solve a problem increases as we increase its task complexity (in the present implementation, to move several feet simultaneously). The above feature is important for addressing a real world problem, as it shows the potentiality for continuing to learn indefinitely. We have shown that these computational adaptability characteristics derive from the plasticity of the structure-function relations. The principle of double representation is central to these relations. The internal representation of the system's input-output behavior can be gradually molded, due to the malleability of its structure-function relations. These graceful (or soft) internal changes can nevertheless yield significant changes in the system's interactions with the external world. The computational adaptability properties achieved derive from the soft relationship between the two representations. Acknowledgment. This research was supported in part by the R.O.C. National Science Council (NSC97-2410-H-224-012).
616
W.-C. Lee et al.
References 1. Conrad, M.: Bootstrapping on the adaptive landscape. BioSystems 11, 167–182 (1979) 2. Conrad, M.: The geometry of evolution. BioSystem 24, 61–81 (1990) 3. de Garis, H.: An artificial brain: ATR’s cam-brain project aims to build/evolve an artificial brain with a million neural net modules inside a trillion cell cellular automata machine. New Generation Computing Journal 12, 2 (1994) 4. Nam, D., Seo, Y.D., Park, L.-J., Park, C.H., Kim, B.: Parameter optimization of an onrobot voltage reference circuit using evolutionary programming. IEEE Trans. Evol. Comput. 5(4), 414–421 (2001) 5. Higuchi, T., Iwata, M., Keymeulen, D., Sakanashi, H., Murakawa, M., Kajitani, I., Takahashi, E., Toda, K., Salami, M., Kajihara, N., Otsu, N.: Real-world applications of analog and digital evolvable hardware. IEEE Trans. Evol. Comput. 3(3), 220–235 (1999) 6. Thompson, A.: Evolving electronic robot controllers that exploit hardware resources. In: Proc. 3rd European Conf. Artificial Life, Granada, Spain, pp. 640–656 (1995) 7. Miller, J.F., Downing, K.: Evolution in materio: looking beyond the silicon box. In: Proc. NASA/DoD Conf. Evolvable Hardware, pp. 167–176 (2002) 8. Vassilev, V.K., Job, D., Miller, J.F.: Towards the automatic design of more efficient digital circuits. In: Proc. 2nd NASA/DoD Workshop on Evolvable Hardware, Palo Alto, CA, pp. 151–160 (2000) 9. Thompson, A., Layzell, P.: Analysis of unconventional evolved electronics. Comm. ACM 42(4), 71–79 (1999) 10. Chen, J.-C., Conrad, M.: Learning synergy in a multilevel neuronal architecture. BioSystems 32(2), 111–142 (1994) 11. Liberman, E.A., Minina, S.V., Shklovsky-Kordy, N.E., Conrad, M.: Microinjection of cyclic nucleotides provides evidence for a diffusional mechanism of intraneuronal control. BioSystems 15, 127–132 (1982) 12. Hameroff, S.R., Watt, R.C.: Information processing in microtubules. J. Theoretical Biology 98, 549–561 (1982) 13. Matsumoto, G., Tsukita, S., Arai, T.: Organization of the axonal cytoskeleton: differentiation of the microtubule and actin filament arrays. In: Kinesin, D., Warner, F.D., McIntosh, J.R. (eds.) Cell Movement. Microtubule Dynamics, vol. 2, pp. 335–356. Alan R. Liss, New York (1989) 14. Werbos, P.: The cytoskeleton: why it may be crucial to human learning and to neurocontrol. Nanobiology 1, 75–95 (1992) 15. Conrad, M.: Molecular information processing in the central nervous system. In: Conrad, M., Gütinger, W., Dal Cin, M. (eds.) Physics and Mathematics of the Nervous System, pp. 82–127. Springer, Heidelberg (1974) 16. Conrad, M.: Molecular information structures in the brain. J. Neurosci. Res. 2, 233–254 (1976) 17. Eldredge, N., Gould, S.J.: Punctuated equilibria: an alternative to phyletic gradualism. In: Schopf, T.J.M. (ed.) Models in Paleobiology, pp. 82–115. Freeman, Cooper and Company, San Francisco (1972)
Dissimilarity Based Vector Space Embedding of Graphs Using Prototype Reduction Schemes Kaspar Riesen and Horst Bunke Institute of Computer Science and Applied Mathematics, University of Bern, Neubr¨ uckstrasse 10, CH-3012 Bern, Switzerland {riesen,bunke}@iam.unibe.ch Abstract. Graphs provide us with a powerful and flexible representation formalism for object classification. The vast majority of classification algorithms, however, rely on vectorial data descriptions and cannot directly be applied to graphs. In the present paper a dissimilarity representation for graphs is used in order to explicitly transform graphs into n-dimensional vectors. This embedding aims at bridging the gap between the high representational power of graphs and the large amount of classification algorithms available for feature vectors. The basic idea is to regard the dissimilarities to n predefined prototype graphs as features. In contrast to previous works, the prototypes and in particular their number are defined by prototype reduction schemes originally developed for nearest neighbor classifiers. These reduction schemes enable us to omit the cumbersome validation of the embedding space dimensionality. With several experimental results we prove the robustness and flexibility of our new method and show the advantages of graph embedding based on prototypes gained by these reduction strategies.
1
Introduction
Classification refers to the process of assigning an unknown input object to one of a given set of classes. It is a common task in the areas of machine learning, and data mining [1, 2]. Usually, a classifier is built on the basis of a training set of objects on which the classification rule is learned, based on some underlying mathematical model. Nearest neighbor classifier, Bayes’ classifier, neural network, and support vector machine are classifiers which are widely used [3, 4]. Applications of classification can be found in biometric person identification, optical character recognition, automatic protein prediction, medical diagnosis, and many other domains. A crucial question is how one describes the objects under consideration adequately, such that a classification method can be applied. In statistical pattern recognition objects are represented by feature vectors. That is, an object x is formally represented as vector of n measurements, or features. Hence, an object can be understood as point in the n-dimensional real space, i.e. x = (x1 , . . . , xn ) ∈ Rn . Representing objects or patterns by feature vectors x ∈ Rn offers a number of useful properties. In particular, due to the mathematical wealth of operations available in a vector space, a huge amount of algorithms P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 617–631, 2009. c Springer-Verlag Berlin Heidelberg 2009
618
K. Riesen and H. Bunke
for classification and analysis of objects given in terms of feature vectors have been developed in recent years [3, 4]. Yet, the use of feature vectors implicates two limitations. First, as vectors always represent a predefined set of features, all vectors in a particular application have to preserve the same length regardless of the size or complexity of the corresponding objects. Furthermore, there is no direct possibility to describe binary relationships among different parts of an object. It is well known that both constraints can be overcome by graph based representations [5]. As a matter of fact, graphs are not only able to describe properties of an object but also binary relationships among different parts. Furthermore, graphs are not constrained to a fixed size, i.e. the number of nodes and edges is not limited a priori and can be adapted to the size or the complexity of each individual object under consideration. Thus, a growing interest in graph-based object representation in machine learning and data mining can be observed [6–8]. One of the major drawbacks of graphs is that there is little mathematical structure in the graph domain. That is, most of the basic mathematical operations available for vectors do not exist for graphs. Nevertheless, a number of graph matching methods have been proposed, which can be employed to measure the dissimilarity, or distance, of graphs [5]. In this paper the edit distance of graphs is used as a dissimilarity measure [9]. However, although graph edit distance and related similarity measures allow us to compute distances between graphs, this is not sufficient for most standard pattern recognition algorithms. A promising direction to overcome the lack of algorithmic tools for graph classification is graph embedding. Basically, an embedding of graphs into a vector space establishes access to the rich repository of algorithmic tools for pattern analysis. Examples of graph embeddings can be found in [10–12]. The present paper considers a new class of graph embedding procedures which are based on prototype selection and graph edit distance computation. Originally the idea was proposed in [13] in order to map feature vectors into dissimilarity spaces. This idea was first generalized to string based object representation [14] and eventually to the domain of graphs [15]. The key idea of this approach is to use the distances of an input graph to a number of training graphs, termed prototype graphs, as a vectorial description of the input graph. That is, we use a dissimilarity representation rather than the original graph representation. Note that our graph embedding approach can be applied to both directed and undirected graphs, as well as to graphs without and with labels on their nodes and/or edges. In case there are labels on the nodes and/or edges, these labels can be of any nature (discrete symbols, the set of integer or real numbers, or whole attribute vectors). Even hypergraphs can be embedded with the method described in this paper. Hence, the proposed embedding approach is more general than other graph embedding techniques where (sometimes quite severe) restrictions on the type of underlying graph are imposed. In contrast to the work presented in [15] where the strategy of choosing the prototypes from a dataset is merely heuristic, we adopt well known prototype reduction schemes originally proposed for nearest neighbor classification [16].
Dissimilarity Based Vector Space Embedding of Graphs
619
In [17] the same idea was successfully used in order to apply dissimilarity-based classifiers to feature vectors. The strategies utilized in the present paper have in common that they are selective and that the number of prototypes is uncontrollable [18]. Selective prototype selection means that prototypes are choosen from an existing set and no new prototypes can be generated. This kind of prototype selection is necessary in the context of this paper because it is rather hard to create new prototype graphs based on a given graph set. Uncontrollable prototype selection implies that the number of prototypes is determined automatically and can not be influenced by the user. This is a crucial advantage over previous methods where the prototypes and in particular their number are determined by means of the target classifier on a validation set. Consequently, we claim the following contribution in this paper. We show that well known prototype reduction schemes, which have been explored for various purposes over the years [16], can be used as a prototype selection tool for graph embedding. In the experimental evaluation we empirically confirm that these schemes lead to both significant speed-ups of the validation process and improvements of the classification accuracy on four different graph data sets of quite diverse nature.
2
Basic Concepts and Notation
Generally, a graph g is given by a finite set of nodes V , a finite set of edges E, and their corresponding labeling functions. Let LV and LE be finite or infinite sets of labels for nodes and edges, respectively. Definition 1 (Graph). A graph g is defined by the four-tuple g = (V, E, μ, ν), where V is the finite set of nodes, E ⊆ V × V is the set of edges, μ : V → LV is the node labeling function, and ν : E → LE is the edge labeling function. The definition given above allows us to handle arbitrary graphs with unconstrained labeling functions. For example, the label alphabet can be given by the set of integers, the vector space Rn , or a set of symbolic labels. Moreover, unlabeled graphs are obtained by assigning the same label l to all nodes and edges. Edges are defined by pairs of nodes (u, v), where u ∈ V denotes the source node and v ∈ V the target node of a directed edge. Undirected graphs can be modeled by inserting a reverse edge (v, u) ∈ E for each edge (u, v) ∈ E with ν(u, v) = ν(v, u). Graph matching refers to the task of measuring the dissimilarity of graphs. One of the most flexible graph matching methods is the edit distance [9]. The key idea of graph edit distance is to define the dissimilarity, or distance, of graphs by the minimum amount of distortion that is needed to transform one graph into another. Compared to other approaches, graph edit distance is very flexible since it can handle arbitrary graphs and any type of node and edge labels. Furthermore, by defining costs for edit operations, the concept of edit distance can be tailored to specific applications.
620
K. Riesen and H. Bunke
A standard set of distortion operations is given by insertions, deletions, and substitutions of both nodes and edges. Given two graphs, the source graph g1 and the target graph g2 , the idea of graph edit distance is to delete some nodes and edges from g1 , relabel (substitute) some of the remaining nodes and edges, and insert some nodes and edges in g2 , such that g1 is finally transformed into g2 . A sequence of edit operations e1 , . . . , ek that transform g1 into g2 is called an edit path between g1 and g2 . Obviously, for every pair of graphs (g1 , g2 ), there exist a number of different edit paths transforming g1 into g2 . Let Υ (g1 , g2 ) denote the set of all such edit paths. To find the most suitable edit path out of Υ (g1 , g2 ), one introduces a cost for each edit operation, measuring the strength of the corresponding operation. The idea of such cost functions is to define whether or not an edit operation represents a strong modification of the graph. Obviously, the cost function is usually defined with respect to the underlying node and edge label sets. Clearly, between two similar graphs (in terms of structure and labeling), there should exist an inexpensive edit path, representing low cost operations, while for graphs with a high degree of dissimilarity an edit path with high costs is needed. Consequently, the edit distance of two graphs is defined by the minimum cost edit path between two graphs. Definition 2 (Graph Edit Distance). Assume that a graph domain G is given. Let g1 = (V1 , E1 , μ1 , ν1 ) ∈ G be the source graph and g2 = (V2 , E2 , μ2 , ν2 ) ∈ G be the target graph. The graph edit distance between g1 and g2 is defined by k d(g1 , g2 ) = min c(ei ) , (e1 ,...,ek )∈Υ (g1 ,g2 )
i=1
where Υ (g1 , g2 ) denotes the set of edit paths transforming g1 into g2 , and c denotes the edit cost function measuring the strength c(ei ) of edit operation ei . Optimal algorithms for computing the edit distance of graphs are typically based on combinatorial search procedures that explore the space of all possible mappings of the nodes and edges of the first graph to the nodes and edges of the second graph [9]. A major drawback of those procedures is their computational complexity, which is exponential in the number of nodes of the involved graphs. However, a number of efficient suboptimal methods for graph edit distance computation have been proposed [19, 20].
3 3.1
Dissimilarity Space Embeddings Applied on Graphs General Embedding Procedure
The idea underlying our method for graph embedding was originally developed for the problem of embedding sets of feature vectors in a dissimilarity space [13]. In [13] the authors claim that the concept of proximity is more fundamental than that of a feature or a class. Furthermore, it is pointed out that in the case of
Dissimilarity Based Vector Space Embedding of Graphs
621
structural data (like graphs) the extraction of numerical features may be difficult or even intractable, while proximity can directly be derived from the data using an adequate dissimilarity model. Assume we have a labeled set of sample graphs, G = {g1 , . . . , gN }, and a graph dissimilarity measure d(gi , gj ). Note that G can be any kind of graph set and d(gi , gj ) can be any kind of dissimilarity measure. After having selected a set P = {p1 , . . . , pn } ⊆ G, we compute the dissimilarity of a given input graph g to each prototype p ∈ P. Note that g can be an element of G or any other graph. This leads to n dissimilarities, d1 = d(g, p1 ), . . . , dn = d(g, pn ), which can be arranged in an n-dimensional vector (d1 , . . . , dn ). In this way we can transform any graph from the training as well as any other graph set (for instance a validation or a test set of a classification problem), into a vector of real numbers. Definition 3 (Graph Embedding). Let G be a finite or infinite set of graphs and P = {p1 , . . . , pn } ⊆ G be a set of prototypes. Then, the mapping ϕP n : G → Rn is defined as the function ϕP → (d(g, p1 ), . . . , d(g, pn )), n (g) where d(g, pi ) is any graph dissimilarity measure between graph g and the i-th prototype. The embedding procedure proposed in this paper makes use of graph edit distance. Since the exact computation of graph edit distance is exponential in the number of nodes for general graphs, the complexity of the graph embedding procedure introduced in Def. 3 is exponential as well. However, as mentioned above, there exist a number of efficient approximation algorithms for graph edit distance computation (e.g. [20] with cubic time complexity). Consequently, given n predefined prototypes the embedding of one particular graph is established by means of n distance computations with polynomial time. 3.2
Relationship to Graph Kernel Methods
Another idea to overcome the lack of algorithmic tools for graph classification, which is closely related to graph embedding procedures, is kernel methods [4, 21, 22]. In recent years, kernel methods have become one of the most rapidly emerging sub-fields in intelligent information processing. The vast majority of work on kernel methods is concerned with transforming a given feature space into another one of higher dimensionality without computing the transformation explicitly for each individual feature vector. As a fundamental extension the existence of kernels for symbolic data structures, especially for graphs, has been shown [7]. By means of suitable kernel functions, graphs can be implicitly mapped into vector spaces. Consequently, a large class of kernel machines for classification, most of them originally developed for feature vectors, become applicable to graphs.
622
K. Riesen and H. Bunke
Definition 4 (Graph Kernel). Let G be a finite or infinite set of graphs, gi , gj ∈ G, and ϕ : G → Rn a function with n ∈ N. A graph kernel function is a mapping κ : G × G → R such that κ(gi , gj ) = ϕ(gi ), ϕ(gj ). According to this definition a graph kernel function takes two graphs g1 and g2 as arguments and returns a real number that is equal to the result achieved by first mapping the two graphs by a function ϕ to a vector space and then computing the dot product ϕ(g1 ), ϕ(g2 ) in the feature space. The kernel function κ(g1 , g2 ) provides us with a shortcut (kernel trick) that eliminates the need for computing ϕ(.) explicitly. Note that kernel machines make use of scalar products only and do not need the vectors of individual objects. It is well known that many classification algorithms can be kernelized, i.e. formulated in such a way that only scalar products of vectors are needed. Hence, applying a graph kernel provides us access to all these algorithms. Based on the graph embedding ϕP n established above, one can define a valid graph kernel κ by computing the standard dot product of two graph maps in the resulting vector space P κ (gi , gj ) = ϕP n (gi ), ϕn (gj )
Of course, not only the standard dot product can be used but any valid kernel function defined for vectors, e.g. an RBF kernel function P 2 κRBF (gi , gj ) = exp −γ||ϕP n (gi ) − ϕn (gj )|| where γ > 0. In a recent book graph kernels were proposed that directly use graph edit distances [8]. This approach turns the existing dissimilarity measure (graph edit distance) into a similarity measure by mapping low distance values to high similarity values and vice versa. To this end we use a simple monotonically decreasing transformation. Note the fundamental difference between such an approach and our embedding procedure. While in the former methodology the existing dissimilarity measure is turned into a similarity measure (i.e. a kernel value) and subsequently plugged into a kernel machine, the latter uses the dissimilarities to n prototypes as features for a new description of the underlying object. Therefore not only kernel machines, but also other non-kernelizable algorithms can be applied in conjunction with the proposed graph embedding method.
4
Prototype Selectors
One crucial question about the proposed graph embedding is how to define a set P of prototypes that lead to a good performance of the classifier in the feature space. Often, the prototype set P is defined as a subset of the training set of graphs T , i.e. P ⊆ T [15]. In [13–15] different prototype selection algorithms are discussed. These prototype selection strategies use some heuristics based on the underlying dissimilarities in the original graph domain. It was shown that
Dissimilarity Based Vector Space Embedding of Graphs
623
none of them is globally best, i.e. the quality of the selected prototypes and in particular their number depends on the underlying data set. Thus, both the selection strategy and dimensionality are determined with the target classifier on a validation set. This procedure is known as wrapper method [23]. As this validation procedure is time consuming, a prototype selection method that does not need to be optimized on a validation set is highly desirable. In [16] a survey of prototype selection methods in conjunction with nearest neighbor classifiers is provided. These prototype selection methods aim at overcoming the three major drawbacks of nearest neighbor classifiers, viz. large storage requirement, large computational effort for distance evaluation, and sensitivity to outliers. These reduction schemes determine a subset P ⊆ T such that the elements in T (or at least a considerable part of them) are still correctly classified using a nearest neighbor classifier. That is, these reduction schemes reduce redundancy in terms of selecting similar graphs out of the same class and find significant graphs which help to correctly classify the graphs in T . In fact, this particular selection paradigm seems appropriate in order to find a prototype set P for dissimilarity embedding. Remember that we use the distances to the prototypes as features for our vectorial description of the considered graphs. Similar objects yield a similar contribution to the vectorial representation of a graph [13]. Therefore, redundant graphs should be omitted. Moreover, it might be advantageous to select representative graphs related to each of the given classes such that the dissimilarity representation becomes most informative. As stated above, the reduction schemes for nearest neighbor classification account for both of these requirements. According to the taxonomy of prototype reduction schemes described in [18] we use selective prototype selectors where the number of prototypes is uncontrollable. These two constraints are motivated through the following considerations. First, the fact that we are dealing with graphs makes the creation of new prototypes quite difficult. For instance, whereas the creation of a weighted mean of two feature vectors is straightforward, the same operation on graphs is highly complex. Secondly, we want to bypass the time consuming validation of the dimensionality of the resulting embedding space by means of the target classifier. Hence, we leave the determination of the number of prototypes to the prototype selection algorithm. In the next subsections the six prototype selection strategies used in this paper are reviewed and adapted to the domain of graphs. For all strategies we assume a training set of graphs T = {g1 , . . . , gN } is given. As output a prototype set P = {p1 , . . . , pn } ⊆ T is provided. All of our reduction schemes make use of a classifier of the nearest neighbor type (N N ). Condensing (Cond). The idea of condensing a training set T is to iteratively select graphs gi ∈ T as prototypes until all graphs from T are correctly classified using the respective prototypes [24]. As a disadvantage, this procedure depends on the order in which the graphs are processed.
624
K. Riesen and H. Bunke
Modified Condensing (mCond). In order to overcome the limitation of order dependency in the condensing method, a modified condensing procedure can be applied [25]. In this scheme the prototypes are selected iteratively from the training set T . We start with a basic set of prototypes containing one graph per class. To this end, we use the set center graph (centroid ) of a given graph set G. The centroid of G is the graph for which the maximum distance to all other graphs in G is minimum. Formally, the centroid is defined as centroid(G) = argmin max d(g, g ) g∈G
g ∈G
For each class, the centroid is selected as the first prototype. Then the graphs from T are classified by means of this initial set of prototypes. Using only the misclassified graphs, the class centroids are computed and subsequently added to the existing set of prototypes. This procedure is repeated until all graphs from T are correctly classified. Both condensing methods (Cond and mCond) possess the property that the number of prototypes selected from a particular class depends on the variation of the graphs in the respective class. That is, if graphs from a certain class differ only little in their structure and labels, only few prototypes are necessary to correctly classify all graphs from this class. Yet the higher the variety of the graphs in a given class is, the more prototypes are selected from this specific class. Editing (Edit). The basic idea of editing a training set T is to delete outliers from T [26]. For this purpose, we classify each graph gi from T with a 3-N N classifier. If gi is misclassified we assume that this particular graph is an outlier and therefore should not be included in the prototype set. Reducing (Red). The idea of reducing is built up on condensing [27]. First, the training set T is condensed to a prototype set P (according to Cond). Next, each prototype pi is iteratively removed from P. The training graphs are then classified using the reduced prototype set P \ {pi }. If all graphs are classified correctly with this reduced prototype set, the respective prototype is useless and can therefore be omitted. Otherwise, the prototype is necessary and therefore kept in P. Merging (Merg). The basic idea of merging [28] a training set is to define two graph sets P and Q, where initially P is empty and Q contains all training graphs from T . First, an arbitrary graph from Q is selected as prototype, i.e. moved from Q to P. Next, we consider the two closest graphs p and q from P and Q, respectively. If the class of p is not the same as that of q, q is moved from Q to P. Otherwise, p and q are merged to p∗ ∈ Q, where p∗ minimizes the sum of distances to p and q. The accuracy of the N N classifier using P ∪ {q} is then compared with the accuracy when P \ {p} ∪ {p∗ } is used as prototype set. Whenever the former outperforms the latter, q is moved from Q to P. Otherwise, p and q are removed from P and Q, respectively, and p∗ is moved from Q to P. This procedure is repeated until no graphs are left in Q.
Dissimilarity Based Vector Space Embedding of Graphs
625
Selecting (Sel). In [29] another algorithm for reducing the training set T is introduced. This approach is based on the idea of related neighbors. We define gj ∈ T as a related neighbor to gi ∈ T if gi and gj are out of the same class, and gj is nearer to gi than any other sample gk ∈ T from another class. We denote Ri as the set of all related neighbors to gi . Formally, Ri = {gj ∈ T | ωi = ωj and d(gi , gj ) < min d(gi , gk ) where ωi = ωk } , gk ∈T
where ωi denotes the class of the i-th graph from T . The selection of the prototypes is now stated as finding a small number of graphs such that each of these graph has at least one related neighbor. In [29] a procedure is given such that the minimum set is found. In the present paper a greedy algorithm is employed seeking for a small number of prototypes.
5
Experimental Evaluation
In this section we provide the results of an experimental evaluation of the proposed embedding procedure. For graph edit distance computation the suboptimal algorithm introduced in [20] has been used. This graph edit distance algorithm shows superior performance in time and accuracy compared to other suboptimal algorithms. The classifier used in the vector space is the support vector machine (SVM) [21]. Of course, any other classifier could be used for this purpose as well. 5.1
Data Sets
The pattern classification tasks considered in this paper involve a total of four different graph data sets. Note that all of these data sets are publicly available [30]1 . Each of our graph sets is divided into three disjoint subsets, viz. a training, a validation, and a test set. The first database used in the experiments consists of graphs representing distorted letter line drawings (Letter). In this experiment we consider the 15 capital letters of the Roman alphabet that consist of straight lines only (A, E, F, ...). To obtain large sample sets of drawings, distortion operators are applied to prototype drawings. This results in randomly shifted, removed, and added lines. These drawings are then converted into graphs by representing lines by edges and ending points of lines by nodes. Each node is labeled with a twodimensional attribute giving its position. The second graph set is constructed from the AIDS Antiviral Screen Database of Active Compounds [31] (AIDS). Graphs from this data set represent molecules out of two classes (active, inactive), which represent molecules with activity against HIV or not. The molecules are converted into graphs in a straightforward manner by representing atoms as nodes and the covalent bonds as edges. Nodes 1
http://www.iam.unibe.ch/fki/databases/iam-graph-database
626
K. Riesen and H. Bunke
Future
(a) Letter
(b) AIDS
(c) Fingerprint
Flu
Strain
Epidemic
Pandemic
Threat
Start
Increase
Manual
(d) Part of a webgraph
Fig. 1. Exampe images from all data sets Table 1. Graph dataset characteristics Database Letter AIDS Fingerprint Webpage
|G| |Ω| ∅|V | ∅|E| max|V | max|E| 2250 15 4.7 4.5 2000 2 9.5 10.0 2800 4 5.4 4.4 2340 20 186.1 104.6
9 85 26 834
9 328 24 596
are labeled with the number of the corresponding chemical symbol and edges by the valence of the linkage. The third data set is given by graphs representing fingerprint images of the NIST-4 database [32] out of the four classes arch, left, right, and whorl (Fingerprint). We construct graphs from fingerprint images by extracting characteristic regions in a fingerprint and converting the result into an attributed graph [8]. Each node is labeled with a two-dimensional attribute giving its position and edges are attributed with their respective angle to the horizontal axis. The last data set consists of graphs representing webpages [33] that originate from 20 different categories (Business, Health, Politics, . . .) (Webpage). For the graphs used in our experiments, all words occurring in the web document (except for stop words) are converted into unique nodes in the resulting web graph. If a word wi immediately precedes word wi+1 , a directed edge between the corresponding nodes is inserted in the graph. Note that the graph datasets used in our experiments are of quite different nature, coming from a variety of applications. In Fig. 1 a sample image of each data set is given. Furthermore. the graph sets differ in their characteristics, such as the number of available graphs (|G|), the number of different classes (|Ω|), and the average and maximum number of nodes and edges per graph (∅|V |, ∅|E|, max|V |, max|E|). In Table 1 a summary of all graph datasets and their corresponding characteristics is given. 5.2
Reference Systems
Three reference systems are used to compare the performance of the proposed graph embedding procedure with. The first is a similarity kernel directly derived from the edit distances [8] (referred to as GED). That is, for this reference
Dissimilarity Based Vector Space Embedding of Graphs
627
method no explicit graph embedding is conducted but the dissimilarities are merely turned into kernel values κ(gi , gj ) = −d(gi , gj )2 . The second reference system interprets all distances to the whole training set T as vectorial description, i.e. the graphs are explicitly embedded but no attempts are made to reduce the number of prototypes (referred to as All). Finally, the last reference method is a heuristic prototype selector applied in [15] (referred to as k-med). This prototype selector is based on k-medoids clustering. After application of the clustering procedure, the cluster centers are selected as prototypes. (hence, parameter k for clustering is equal to the number of prototypes selected.) Similarly to our novel approach, all reference systems make use of an SVM. Therefore the weighting parameter C has to be validated for all methods. The parameter C controls whether the maximization of the margin or the minimization of the error is more important. For the second and third reference system, where the graphs are embedded in a real vector space by means of ϕP n , an SVM with RBF Kernel is used. Hence, the kernel values are given by P 2 κRBF (gi , gj ) = exp −γ||ϕP n (gi ) − ϕn (gj )|| where γ > 0 denotes the second meta parameter to be tuned. For the third reference system the SVM optimization is performed on a validation set for every possible dimension of the target space. The parameter combination that results in the lowest classification error on the validation set is finally applied to the independent test set. 5.3
Results and Discussion
In Table 2 the classification results of all reference systems and the proposed approach using all prototype reduction schemes are given. Note that the best result per data set is emphasized by bold face. For the proposed prototype reduction schemes, the fraction of retained training elements is indicated in brackets. Comparing the results of our novel approach with the results achieved by the first reference system (GED), we observe the following. On the Webpage data the first reference method outperforms all other systems. On this particular data set, the transformation of the edit distances into kernel values seems to be the Table 2. Experimental Results Ref. System
Proposed Method
Data Set
GED
All
k-med
Cond
mCond Edit
Red
Merg
Sel
Letter
92.27
91.73
92.40
AIDS
93.60
97.20
97.73
Fingerprint 79.35
82.10
77.70
82.44
81.92
91.73 (0.34) 97.13 (0.20) 81.60 (0.55) 82.44 (0.36)
91.47 (0.27) 97.53 (0.18) 81.75 (0.54) 83.21 (0.35)
92.00 (0.27) 97.20 (0.17) 82.10 (0.53) 82.95 (0.33)
92.53 (0.40) 98.27 (0.23) 82.80 (0.62) 81.15 (0.42)
92.00 (0.27) 97.20 (0.12) 82.30 (0.60) 81.03 (0.26)
Webpage
84.62
92.00 (0.89) 97.20 (0.94) 81.70 (0.45) 76.92 (0.81)
628
K. Riesen and H. Bunke
best choice. However, on the three remaining data sets it is beneficial to use the embedding approach rather than the direct transformation. On the Letter data the SVM based on the merged prototypes performs better than the first reference system and on the AIDS and Fingerprint data even all reduction schemes lead to better classification accuracies than the similarity kernel. Note that 12 out of 13 improvements, but only 5 out of 11 deteriorations, compared to the first reference system are statistically significant (using a Z-test with α = 0.05.). Regarding the results achieved by the second reference system (All), we observe that our approach using prototype reduction outperforms this reference system on all data sets (at least with one of the proposed reduction schemes). Hence, besides the speed-up in computation, it is beneficial for achieving a higher recognition rate to use prototype reduction for embedding rather than using the whole training set as prototypes. Finally, comparing our approach with the procedure proposed in [15] (k-med), we observe that the former outperforms the latter on all data sets with at least one reduction scheme per data set. Note, furthermore, compared to k-med our novel approach features the crucial advantage that we do not have to validate the dimensionality of the resulting vector space with the target classifier. Assuming that the dimensionality n is incrementally validated with an interval size of t, the speed-up factor for the validation procedure is roughly Nt , where N is the size of the training set T . Comparing the prototype reduction schemes against each other, one can conclude that the merging approach performs generally best. On three out of four data sets this reduction scheme leads to the overall best classification result (including the reference methods). On the other hand, condensing and modified condensing lead on three data sets to the lowest recognition rate among the prototype reduction approaches. In summary, one can draw the following conclusions. An explicit embedding of the graphs into a vector space using dissimilarity representation is advantageous compared to turning dissimilarities into similarity values (GED). Rather than using all available training graphs as prototypes (All), the selection of a few representative graphs leads to further improvements of the classification accuracy. Using the proposed prototype reduction schemes rather than a heuristic selection of the prototypes (k-med) is beneficial for two reasons. First, the cumbersome validation of the embedding space dimensionality can be omitted. Secondly, at least one of the proposed reduction schemes outperforms the heuristic selection on all data sets.
6
Conclusions
For objects given in terms of feature vectors a rich repository of algorithmic tools for classification has been developed over the past years. Graphs are a versatile alternative to feature vectors, and are known to be a powerful and flexible representation formalism. The representational power of graphs is due to their ability to represent not only feature values but also relationships among
Dissimilarity Based Vector Space Embedding of Graphs
629
different parts of an object, while their flexibility comes from the fact there are no size or labeling constraints that restrict the representation of a given object. However, graph based object representation suffers from the lack of mathematical structure in the graph domain. Hence, in contrast to vectors, most of the basic mathematical operations required for classification do not exist for graphs. Graph embedding offers an elegant solution to this problem. In the present paper a novel approach to graph embedding using prototypes and dissimilarities, is proposed. Our embedding procedure explicitly makes use of graph edit distance and can therefore deal with various kinds of graphs (labeled, unlabeled, directed, undirected, etc.). The basic idea of the embedding method is to describe a graph by means of n dissimilarities to a predefined set of graphs termed prototypes. In contrast to previous work, where both the prototypes and their number are determined by using the target classifier on a validation set (wrapper approach), in the present paper six prototype reduction schemes originally developed for nearest neighbor classification are employed for the task of prototype selection. We use selective and uncontrollable prototype selectors, such that no new prototype graphs have to be generated, and the number of prototypes is determined by the prototype selection algorithm. In several experiments a high degree of robustness and flexibility of the proposed approach is empirically verified. On four graph sets the classification accuracy of the proposed embedding kernel is compared to a similarity kernel, to an embedding procedure without prototype selection, and an embedding method based on heuristic prototype selection. Summarizing the results of the experiments, the following main findings can be stated. First, using dissimilarities to prototypes as features rather than the original graph edit distances (transformed into a similarity kernel) is clearly advantageous. Secondly, reducing the number of prototypes rather than using the whole training set as prototypes leads to improvements in classification accuracy as well. Thirdly, the proposed prototype reduction schemes for graph embedding are attractive compared to heuristic prototype selections as there is no need to validate the number of prototypes by means of a wrapper. Finally, we observe that on three out of four data sets, the merging approach prevails over the other ones. Hence, there is a clear tendency that this prototype reduction scheme is a good choice for graph embedding by means of dissimilarities and prototypes.
Acknowledgements This work has been supported by the Swiss National Science Foundation (Project 200021-113198/1).
References 1. Perner, P. (ed.): MLDM 2007. LNCS (LNAI), vol. 4571. Springer, Heidelberg (2007) 2. Perner, P. (ed.): ICDM 2006. LNCS (LNAI), vol. 4065. Springer, Heidelberg (2006)
630
K. Riesen and H. Bunke
3. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000) 4. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 5. Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. Journal of Pattern Recognition and Artificial Intelligence 18(3), 265–298 (2004) 6. Cook, D., Holder, L. (eds.): Mining Graph Data. Wiley-Interscience, Hoboken (2007) 7. G¨ artner, T.: Kernels for Structured Data. World Scientific, Singapore (2008) 8. Neuhaus, M., Bunke, H.: Bridging the Gap Between Graph Edit Distance and Kernel Machines. World Scientific, Singapore (2007) 9. Bunke, H., Allermann, G.: Inexact graph matching for structural pattern recognition. Pattern Recognition Letters 1, 245–253 (1983) 10. Luo, B., Wilson, R., Hancock, E.: Spectral embedding of graphs. Pattern Recognition 36(10), 2213–2223 (2003) 11. Wilson, R., Hancock, E., Luo, B.: Pattern vectors from algebraic graph theory. IEEE Trans. on Pattern Analysis ans Machine Intelligence 27(7), 1112–1124 (2005) 12. Robles-Kelly, A., Hancock, E.: A Riemannian approach to graph embedding. Pattern Recognition 40, 1024–1056 (2007) 13. Pekalska, E., Duin, R.: The Dissimilarity Representation for Pattern Recognition: Foundations and Applications. World Scientific, Singapore (2005) 14. Spillmann, B., Neuhaus, M., Bunke, H., Pekalska, E., Duin, R.: Transforming strings to vector spaces using prototype selection. In: Yeung, D.Y., Kwok, J., Fred, A., Roli, F., de Ridder, D. (eds.) SSPR 2006 and SPR 2006. LNCS, vol. 4109, pp. 287–296. Springer, Heidelberg (2006) 15. Riesen, K., Neuhaus, M., Bunke, H.: Graph embedding in vector spaces by means of prototype selection. In: Escolano, F., Vento, M. (eds.) GbRPR 2007. LNCS, vol. 4538, pp. 383–393. Springer, Heidelberg (2007) 16. Bezdek, J., Kuncheva, L.: Nearest prototype classifier designs: An experimental study. Int. Journal of Intelligent Systems 16(12), 1445–1473 (2001) 17. Kim, S., Oommen, B.: On using prototype reduction schemes to optimize dissimilarity-based classification. Pattern Recognition 40, 2946–2957 (2006) 18. Kim, S., Oommen, B.: A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Analysis and Applications 6, 232–244 (2003) 19. Neuhaus, M., Riesen, K., Bunke, H.: Fast suboptimal algorithms for the computation of graph edit distance. In: Yeung, D.Y., Kwok, J., Fred, A., Roli, F., de Ridder, D. (eds.) SSPR 2006 and SPR 2006. LNCS, vol. 4109, pp. 163–172. Springer, Heidelberg (2006) 20. Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision Computing (2008) (accepted for publication) 21. Vapnik, V.: Statistical Learning Theory. John Wiley, Chichester (1998) 22. Sch¨ olkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002) 23. Kohavi, R., John, G.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997) 24. Hart, P.: The condensed nearest neighbor rule. IEEE Trans. on Information Theory 14(3), 515–516 (1968) 25. Susheela Devi, V., Murty, M.: An incremental prototype set building technique. Pattern Recognition 35(2), 505–513 (2002)
Dissimilarity Based Vector Space Embedding of Graphs
631
26. Devijver, P.A., Kittler, J.: On the edited nearest neighbor rule. In: Proc. 5th Int. Conf. on Pattern Recognition, pp. 72–80 (1980) 27. Gates, G.W.: The reduced nearest neighbor rule. IEEE Transactions on Information Theory 18, 431–433 (1972) 28. Chang, C.L.: Finding prototypes for nearest neighbor classifiers. IEEE Trans. on Computers 23(11), 1179–1184 (1974) 29. Ritter, G., Woodruff, H., Lowry, S., Isenhour, T.: An algorithm for a selective nearest neighbor decision rule. IEEE Trans. on Information Theory 21(6), 665–669 (1975) 30. Riesen, K., Bunke, H.: IAM graph database repository for graph based pattern recognition and machine learning. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR 2008. LNCS, vol. 5342, pp. 287–297. Springer, Heidelberg (2008) (accepted for publication) 31. DTP, D.T.P.: AIDS antiviral screen (2004), http://dtp.nci.nih.gov/docs/aids/aids_data.html 32. Watson, C., Wilson, C.: NIST Special Database 4, Fingerprint Database. National Institute of Standards and Technology (1992) 33. Schenker, A., Bunke, H., Last, M., Kandel, A.: Graph-Theoretic Techniques for Web Content Mining. World Scientific, Singapore (2005)
Using Graph-Kernels to Represent Semantic Information in Text Classification Teresa Gon¸calves and Paulo Quaresma ´ Departamento de Inform´ atica, Universidade de Evora ´ 7000-671 Evora, Portugal {tcg,pq}di.uevora.pt
Abstract. Most text classification systems use bag-of-words representation of documents to find the classification target function. Linguistic structures such as morphology, syntax and semantic are completely neglected in the learning process. This paper proposes a new document representation that, while including its context independent sentence meaning, is able to be used by a structured kernel function, namely the direct product kernel. The proposal is evaluated using a dataset of articles from a Portuguese daily newspaper and classifiers are built using the SVM algorithm. The results show that this structured representation, while only partially describing document’s significance has the same discriminative power over classes as the traditional bag-of-words approach.
1
Introduction
Current Information Technologies and Web-based services need to manage, select and filter increasing amounts of textual information. Text classification allows users, through navigation on class hierarchies, to browse more easily the texts of their interests. This paradigm is very effective both in filtering information as in the development of online end-user services. As the number of documents involved in these applications is large, efficient and automatic approaches are necessary for classification. A Machine Learning approach can be used to automatically build the classifiers. The construction process can be seen as a problem of supervised learning: the algorithm receives a relatively small set of labelled documents and generates the classifier. However, as learning algorithms do not directly interpret digital documents, it is required to get a compact representation of their content. The most common approach, called bag-of-words, uses a statistical representation of the document, counting, in any way, its words. Language structures (such as syntax and semantic) typical of natural language documents are completely neglected. To assess the semantic information value in text classification we developed a structured document representation that includes its logical form (context independent sentence meaning) and is capable of being used by kernel methods. The semantic information is represented in a graph structure using Discourse Representation Structures (DRS) from the Discourse Representation Theory [13] P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 632–646, 2009. c Springer-Verlag Berlin Heidelberg 2009
Using Graph-Kernels to Represent Semantic Information
633
(DRT). Learners are built using the Support Vector Machine (SVM) algorithm since it supports structured representations and is known to produce good results on text classification tasks [12]. This setup is applied to dataset of articles from a Portuguese newspaper and the results are compared to common document representation techniques that use the bag-of-words approach. This paper is organised as follows: Section 2 presents the concepts related to linguistic information and automatic text classification, Section 3 introduces the novel document representation, Section 4 describes the experiments and displays the results while Section 5 evaluates them. Conclusions and future work are pointed out, respectively, on Sections 6 and 7.
2
Concepts and Tools
This section introduces the concepts employed in this work: the representation of linguistic information and kernel functions for structured data. It concludes by presenting the used software tools. 2.1
Linguistic Information
The Portuguese language is morphological rich: while nouns and adjectives have 4 forms (two genders – masculine and feminine and two numbers – singular and plural), a regular verb has 66 different forms (two numbers, three persons – 1st , 2nd and 3rd and five modes – indicative, conjunctive, conditional, imperative and infinitive, each with different number of tenses ranging from 1 to 5). Representation. Morphological information includes word’s stem and its morphological features, like grammatical class and flexion. While some natural language processing tasks use word’s stem, others use its lemma. Most syntactic language representations are based on the context-free grammar (CFG) formalism introduced by [4] and, independently, by [1]: given a sentence, it generates the corresponding syntactic structure. It is usually represented by a tree structure, known as sentence’s parse tree, that contains its constituents structure (such as noun and verb phrases) and words’ grammatical class. On the other way, some semantic information can obtained by context independent sentence meaning by examining words’ meaning and combining them. This information can be produced directly from sentence’s syntactic structure, and is named sentence’s logical form. Discourse Representation Theory [13] (DRT) is a dynamic semantic theory that uses a language over Discourse Representation Structures (DRS) to represent dependent context meaning. A simple DRS, is a pair of a set of discourse referents U known as its universe, and a set of conditions C. In its general form, a DRS is represented as a pair U, C. Intuitively the universe collects the discourse entities, while the conditions express entity restrictions (properties, relations). Figure 1 shows the DRS graphical notation for the sentence “Mother observes her daughter playing with the doll”. There are three referents, x, y and z, and
634
T. Gon¸calves and P. Quaresma x y z mother(x ) observe(x ,y) daughter(y) play(y,z ) doll(z )
Fig. 1. DRS for the sentence “Mother observes her daughter playing with the doll”
five conditions over them: z refers the mother, y the daughter, and z the doll while the other conditions represent the action observe (being x the subject and y the object) and play. 2.2
Kernel Functions
Most approaches to text classification use the basic vector space model (VSM) to represent documents. The simplest measure that takes into account words frequency in each document can be naturally reinterpreted as a kernel method [12]. Normalisation and term reduction approaches can also be interpreted as kernel functions [20] and other standard kernels (like the polynomial one) apply non linear transformations to the usual VSM approach. The convolution kernel [9] is the most well-known kernel for structured objects. A structured object is an object formed by the composition of simpler components; frequently, these components are, recursively, simpler objects of the same type. It’s the case of strings, trees or graphs. The convolution kernel definition is based on kernels defined over structure’s components. Tree kernels. For tree structured objects, the feature space is indexed by subtrees and similarity is based on counting common subtrees. Examples of tree kernels are the subtree kernel [21] and the subset tree kernel [5]. Intuitively, the first counts all common n-descendants until the leaves (being n the root node) and the second adds to that counting all trees considering as leaves all internal nodes. Using some syntactic structures (and possibly some background knowledge) these kernels have produced good results on parse tree ranking [5], predicate argument classification [3] and question classification [23]. Graph kernels. The application of kernels to graph structured data was introduced independently by [7] and [14]. Conceptually they are based on measures over graphs’ walks with common labels: the first counts walks with initial and final common labels and the last calculates probabilities of equal label sequences on random walks. Gartner et al. [8] prove that the computation of a kernel function able to completely recognise graph structure is NP-hard and introduce a walk based kernel function that computes in polynomial time including both previous kernels as special cases. This kernel, known as product graph kernel is based on the concept of the direct product graph counting walks on that graph. Product
Using Graph-Kernels to Represent Semantic Information
635
graphs are a discrete mathematics tool [10] and the direct product graph is between the four most important ones. 2.3
Tools
Next we briefly present the linguistic information extraction tools and learning software used in this study. Information extraction. POLARIS, a lexical database [15], was used to retrieve Portuguese words’ lemma and PALAVRAS [2] parser, developed in the context of the VISL project by the Institute of Language and Communication of the University of Southern Denmark, was used to obtain sentences parse tree. SIN2SEM [18] tool transforms PALAVRAS parse tree into a DRS revealing sentence logical form. The structure is represented by a two term Prolog predicate: the referents and a set of conditions over them. Although the representation theory considers sentence’s meaning in the context where it is produced, SIN2SEM tool builds DRSs considering, only, each sentence independently. Learning software. For the traditional bag-of-words experiments we used WEKA that implements the sequential minimal optimisation algorithm [17]. WEKA [22] is a software package, developed in New Zealand’s Waikato University, implementing a large collection of Machine Learning algorithms. For the semantic information experiments we used SVMlight -TK [16]. This software is an extension of SVMlight [11], that uses convolution kernels to represent tree structures.
3
Document’s Logical Form Representation
In order to use a kernel method like the SVM algorithm, it is necessary to define a specific kernel or to adapt the problem representation to an existing one. This section describes DRS adaptation to a directed graph enabling the use of document’s logical form by kernel methods through the direct product kernel. 3.1
DRS as a Directed Graph
A DRS can be transformed into a directed graph by mapping DRS’s basic units (referents and conditions) into graph’s basic ones (vertices and edges) and defining how they interconnect: – a referent becomes a vertex labelled by referent’s name; – a condition becomes a vertex labelled by condition’s name and a set of edges that represent referents’ restrictions. Edges are directed from the condition to the restricted referents vertices and are labelled with a number indicating referent’s position in the condition.
636
T. Gon¸calves and P. Quaresma
Fig. 2. Directed graph representation of a DRS
Using this map, DRS from Figure 1 becomes the directed graph depicted in Figure 2. Nevertheless, this direct mapping does not allow the use of the direct product kernel, since SIN2SEM analyses each sentence individually (generating different referents for each one) and does not make any posterior unification. From the definition of the direct product graph: – its vertex set comprises only common vertices. Using SIN2SEM, it will only be made up of those vertices that represent common conditions; – its edges set comprises common edges between vertices from the direct product graph. Since all edges connect condition vertices to referent vertices, and there are no common referents between two different DRSs, there are no edges in the direct product graph. Using the definition of the adjacency matrix, for a direct product graph composed only by a set of vertices, it would always be the zero matrix. 3.2
Referent Substitution
Although not being able to use the direct product kernel with the direct mapping, by making the appropriate kind of referent unification/substitution this restriction disappears. The substitutions are related to conditions associated with proper nouns and properties (atomic conditions). Proper nouns. A proper noun y associated with an entity x is represented by a condition name(x,y). By replacing all referent instances constrained by the name y (and removing the name condition) we obtain a corpus referent unification connecting all referents that refer to the same proper noun y. Formally, and using the general DRS representation this replacement is given by U, C −→ U , C
Using Graph-Kernels to Represent Semantic Information
637
where U = U − {x : ∃y name(x, y) ∈ C} C = C − {name(x, y) : name(x, y) ∈ C} − {c : c ∈ C ∧ (∃name(x, y) ∈ C : x ∈ args(c))} ∪ {d : (∃c ∈ C, ∃name(x, y) ∈ C : x ∈ args(c) ∧ pred(d) = pred(c) ∧ args(d) = subst(args(c), x, y))} where pred(c) returns the condition name c, args(c) returns c referents list and subst(l, a, b) is the function that substitutes in l all occurrences of a by b. Figure 3 shows an example with two DRSs corresponding to sentences “Mother observes her daughter Carlota” and “Daughter Carlota plays with the doll”. w x mother(w ) daughter(x ) name(x ,Carlota) observe(w ,x )
y z daughter(y) name(y,Carlota) doll(z ) play(y,z )
w mother(w ) daughter(Carlota) observe(w ,Carlota)
z daughter(Carlota) doll(z ) play(Carlota,z )
↓
↓
Fig. 3. DRS representation before and after proper noun replacement
Properties. A property associated with an entity x is represented by an atomic condition prop(x). By replacing all this referent instances by prop (and removing the prop condition) we obtain a sentence unification. Since the same referent can be restricted by more than one property, we should build a list of referent properties and use it in the replacement. Formally, and using the general DRS representation this replacement is given by U, C −→ U , C where U = U − {x : (∃c ∈ C : args(c) = [x])} C = C − {c : c ∈ C ∧ #args(c) = 1} − {d : d ∈ C ∧ (∃x ∈ U, ∃c ∈ C : args(c) = [x] ∪ {e : (∃d ∈ C, ∃x ∈ U, ∃c ∈ C : args(c) = [x]
∧ x ∈ args(d))}
∧ x ∈ args(d) ∧ pred(e) = pred(d) ∧ args(e) = subst(args(d), x, y) ∧ pred(c) ∈ y))}
638
T. Gon¸calves and P. Quaresma
where pred(c) returns the condition name c, args(c) and #args(c) return, respectively, c referents list and number of referents and subst(l, a, b) is the function that substitutes in l all occurrences of a by b. This replacement can only be made at sentence level. Figure 4 shows that for the sentence “Mother observes her daughter playing with the doll”.
x y z mother(x ) observe(x ,y) daughter(y) play(y,z ) doll(z ) ↓ observe({mother},{daughter}) play({daughter},{doll}) Fig. 4. DRS representation before and after property replacement
Proper noun and Property. Since the same referent x can be restricted by name(x,y) and prop(x) conditions, it is always necessary to replace a referent by a list. Figure 5 displays the DRS replacement when applying proper noun and property substitutions to the sentence “Mother observes her daughter Carlota that plays with the doll”.
x y z mother(x ) observe(x ,y) daughter(y) name(y,Carlota) play(y,z ) doll(z ) ↓ observe({mother},{daughter,Carlota}) play({daughter,Carlota},{doll}) Fig. 5. DRS representation before and after proper noun and property replacement
3.3
DRS Representation
The replacements just described allows the use of the direct product kernel. With it, DRS basic units (referents and conditions) are transformed into graph’s basic ones (vertices and edges) as follows:
Using Graph-Kernels to Represent Semantic Information
639
– a condition becomes a vertex labelled by condition’s name and a set of edges. Edges are directed from the condition to its referent vertices. Edges are labelled with a number indicating referent’s position in the condition. – a referent becomes • a vertex labelled by referent’s name and an out-degree equal to zero; • a sub-graph comprised by a vertex with a proper label (equal for all subgraphs that represent a referent) and a set of edges connecting that vertex to the each element of the substitution set. – each element of the substuition set becomes a vertex labelled by the property or proper noun identifier that restrict the referent represented by the substuition set. These vertices have out-degree equal to zero. Figure 6 illustrates the graph representation of the DRS depicted on Figure 5.
Fig. 6. Directed graph representation of a DRS with referent substitution
3.4
Document Representation
Using DRT, document’s logical form is given by the DRSs set that describes it. In this way, document’s representation is given by the graph constituted by its DRSs. Nevertheless, since some substitutions are made at sentence level, it is possible to have some sub-graph replication in different DRSs. For the proper use of the direct product kernel, it is necessary to remove those replications by changing edges that come from different DRSs to the same subgraph (and removing the replications).
4
Experiments
This section introduces the used dataset, describes the experimental setup and presents the obtained results for the traditional bag-of-words and semantic representations. 4.1
Dataset Description
P´ ublico is a Portuguese daily newspaper and Publico corpus contains its 1994 and 1995 news taken from 9 different sections (used as semantic classes). It totals
640
T. Gon¸calves and P. Quaresma
101646 documents, where there are 282657 distinct words, and, on average, 512 running words (tokens) and 254 unique words (types) per document. For the semantic information experiments, a subset of this corpus with the October 1995 news was used. Pub9510 has 4290 documents, with 70743 distinct words, and, on average, 215 tokens and 124 types per document. Table 1 shows the semantic classes and proportion of documents for each dataset. Table 1. Publico and Pub9510 corpora: classes and proportion of documents Publico Pub9510 section doc % doc % ciˆencias, tecnologia e educa¸ca ˜o (science, technology, education) 6.2 6.7 cultura (culture) 15.5 14.5 desporto (sports) 9.9 10.3 diversos (diverse) 8.2 8.1 economia (economy) 13.3 10.5 local (local) 17.2 21.3 mundo (world) 9.4 9.3 nacional (national) 9.2 10.3 sociedade (society) 11.2 9.1
4.2
Experimental Setup
Traditional bag-of-words representations used a linear kernel while the semantic information ones were run with the subset tree kernel. WEKA was run with default parameters (normalised training data and c=1, the trade-off between training error and margin) and SVMlight -TK was run with L=0.001 (decay factor) and c=10 (trade-off between training error and margin). A train-and-test procedure was applied with 33% of documents used for testing. Learner’s performance was analysed through precision (π), recall (ρ) and F1 (f1 ) measures [19] of each category (obtained from classification’s contingency table: prediction vs. manual classification). For each one, we calculated the micro(µ ) and macro-averages (M ) and made significance tests regarding a 95% confidence level. 4.3
Bag-of-Words Representation
First we considered the traditional bag-of-words representation. It’s the typical representation used in Information Retrieval techniques and it serves as a baseline experiment. We used the word and its lemma (lm) as indexing terms and considered several filtering and weighting techniques. Word’s selection was made using three classes of experiments: stopword elimination (st), a filtering function (word frequency – fr and mutual information – mi) and a threshold value (t). To weight the selected terms we used the three usual components: document (term frequency – t), collection (no component – x, and inverse term frequency – f) and
Using Graph-Kernels to Represent Semantic Information
641
Fig. 7. Tree-dimensional spaces for normalisation, selection and weighting options
normalisation (co-sin – c). All these options can be graphically represented in a tridimensional space with normalisation, selection and weighting axes. In turn, selection and weighting techniques can also be represented in other three-dimensional spaces. Figure 7 illustrates this, with the marking point corresponding to the experiment with stopword elimination, using lemma as the indexing term, mutual information as the filtering function with threshold equal to one and tfidf as the weighting technique. Results. We made experiments with all combinations of options described above and threshold values ranging from 1 to 4000 (this value indicates the smallest frequency above which the term is selected), in a total of 88 different runs. Table 2 shows the minimum, maximum, average and standard deviation values for the micro- and macro averages of the performance measures. Table 2. Publico min, max, avg and std values for micro- and macro averages
min max avg std
πµ .787 .843 .824 .012
ρµ .787 .843 .824 .012
f1µ .787 .843 .824 .012
πM .779 .842 .821 .013
ρM .770 .831 .810 .013
f1M .774 .836 .815 .013
For each performance measure, we searched those experiments with values having no significant difference with the maximum. There were 5 experiments with all 6 performance measures in that set: – word lemmatisation (lm), with threshold value 1 (t1 ); – stopword elimination (st), term frequency filtering function (fr) and threshold value 50 (t50 ); – stopword elimination, mutual information filtering function (mi), tfidf weighting technique (tfc) and threshold value 50 (t50 ). Table 3 displays those performance measures. Boldface values have no significant difference between them.
642
T. Gon¸calves and P. Quaresma Table 3. Publico performance values with bag-of-words representation
lm.fr.txc.l1 lm.fr.tfc.l1 st.fr.txc.l50 st.fr.tfc.l50 st.mi.tfc.l50
4.4
πµ .840 .843 .839 .840 .840
ρµ .840 .843 .839 .840 .840
f1µ π M .840 .839 .843 .842 .839 .837 .840 .838 .840 .839
ρM .828 .831 .826 .828 .828
f1M .833 .836 .831 .832 .833
Semantic Representation
One property of the graph that represents sentence’s logical form is that it has no cycles, allowing the transformation of this structure into a tree, a less complex one. The use of tree structures enables the application of tree kernels that, in turn, are more performant than the direct product kernel. This transformation comprises two steps: – the creation of a new vertex (with equal label for all documents) that constitutes the tree’s root and a set of directed edges from it to each vertex that identifies DRS conditions; – the replication of DRS conditions’ sub-graphs, when two or more conditions are connected to the same vertex (in-degree greater than one). The obtained tree has the following characteristics: the root represents the document, nodes at depth one represent DRS conditions and each of their child represent the restricted referents. For referents substituted by proper nouns and/or properties another level exists for representing the substitution set. Since it is necessary to point referent’s position inside the condition the tree should be ordered. On the other hand, since DRS conditions constitute a set, it is necessary to order them alphabetically; the same kind of ordering needs to be done over the vertices that constitute a substitution set. This document representation that incorporates its logical form was named discourse-structure representation. The different kind of substitutions proposed on Section 3.2 gives rise to distinct discourse-structure representations. Figure 8 illustrates a two sentence document, the SIN2SEM output and the discourse-structure representation with both substitutions. This discoursestructure representation corresponds to the graph illustrated on Figure 3. Results. For the discourse-structure representation (dis) we considered two different kinds of referent substitution: proper nouns (noun) substitution and proper nouns and property substitution (noun+pro). For each one we also tried to use the first n DRSs of each document with n ∈ {1, 3, 5, 10}. Aiming to access the structure influence on the classification process, we also considered a bag-of-words representation with the words extracted from the discourse-structure representation. Table 4 shows the obtained performance measures (presenting in boldface the values with no significant difference when compared with best value obtained for each measure).
Using Graph-Kernels to Represent Semantic Information
643
Fig. 8. Original document, SIN2SEM output and discourse-structure representation Table 4. Pub9510 performance values with semantic information
dis dis1 noun dis3 dis5 dis10 bag dis dis1 noun+pro dis3 dis5 dis10 bag
5
πµ .655 .364 .484 .545 .593 .821 .833 .471 .679 .740 .787 .814
ρµ .655 .364 .484 .545 .593 .821 .833 .471 .679 .740 .787 .814
f1µ .655 .364 .484 .545 .593 .821 .833 .471 .679 .740 .787 .814
πM .732 .533 .604 .660 .692 .816 .831 .484 .671 .735 .780 .822
ρM .599 .278 .418 .481 .538 .808 .817 .437 .645 .710 .772 .788
f1M .623 .288 .451 .510 .567 .810 .820 .445 .651 .717 .773 .788
Evaluation
Looking at traditional bag-of-words approach (Table 3) one can say that it was possible to reduce the number of attributes (t50 ) without compromising performance. However, these values were achieved only for experiments with the original words and not with lemmatisation. It also seems that the mutual information filtering function should be used with the tfidf weighting one, while when filtering by the term frequency, the weighting function seems indifferent. On the other hand, when using document’s semantic information (Table 4), a structured representation with the proper noun and property substitutions (noun+pro) seems to add valuable information when compared to the corresponding bag-of-words representation (it has better macro- recall and f1 values). For comparing both linguistic levels, one using morphological and the other using semantic information, we elected a “best” experiment for each level. For the morphological level we chose the bag-of-words representation using word’s lemma and tfidf weighting measure with co-sin normalisation (lm.fr.tfc.t1), while for the semantic one we chose a discourse-structure representation with proper nouns and property conditions substitutions (dis.noun+pro).
644
T. Gon¸calves and P. Quaresma Table 5. Pub95 performance values for both linguistic information levels
Morphological Semantic
πµ ρµ f1µ π M ρM f1M .855 .855 .855 .854 .840 .844 .833 .833 .833 .831 .817 .820
Since discourse-structure experiments used a subset of Publico corpus, a SVM was run on this subset (Pub9510) for the chosen bag-of-words representation. Table 5 displays the performance values for each linguistic information level (once again, boldface values points to values with no significant differences). Although the attainment of the semantic representation is more time consuming the the morphological one, it is possible to say that both representations have the same discriminative power over classes, since there is no significant difference between all performance values. Moreover, we can also say that semantic information uses a valid form of attribute selection since it has 46186 types while the traditional bag-of-words approach has 70743.
6
Conclusions
This paper presents a series of experiments aiming at comparing our proposal of incorporating linguistic information using structured representations with the usual methods adopted on text classification problems. The experiments were conducted on a Portuguese written dataset. Concerning morphological information, results show that, when properly combined, word normalisation, filtering and weighting functions and threshold values can sharpen performance. Comparing Publico dataset results with previous work on text classification with Portuguese written documents [6], one can conclude that the best combination depends on the dataset (or its domain). The results also show that, when using semantic information, the discoursestructure representation with proper noun and property substitutions presents at least the same discriminative power as the non-structured representation. Moreover, the traditional bag-of-words approach (morphological information) and the proposed use of semantic information, show equivalent performance values. This statement demonstrates that both representations, one based on statistics over words and other based on document’s meaning, are valid. Considering the number of types used by morphological and semantic representations, it is possible to say that documents’ logical form performs a valid form of attribute selection: an about 30% reduction was accomplished. Finally, one can conclude that the proposed discourse-structure representation is able to contain document’s logical form and seems promissing since at this time it only describes document’s meaning partially. We believe that by perfecting document’s logical form, the semantic representation performance will be higher than the morphological one.
Using Graph-Kernels to Represent Semantic Information
7
645
Future Work
Regarding future work, we intend to perform further tests on different collections/domains and languages. It will be important to evaluate if these results are bound to the Portuguese language and/or the kind of the dataset domain. We also want to verify the validity of this approach using the graph kernel directly. On the other way, it is possible to obtain document’s semantic representation closer to its real meaning by eliminating some of the known limitations of the used natural language tools. Although always generating an output, even in presence of incorrect or incomplete sentences, in some situations (for example, in presence of interrogative sentences) PALAVRAS generates incorrect parse trees. These errors are then propagated to DRS generation since parse trees become the input of SIN2SEM. SIN2SEM can also be refined by removing some of its limitations. For example, it does not implement verb sub-categorisation, being unable to add specific verb properties that would otherwise give a closer sentence’s meaning. It also does not give the proper output for noun clauses with more than one entity. Finally, the DRSs would translate more accurately the true meaning of documents whith the incorporation of other linguistic information like synonymous, anaphora resolution and identification of named entities.
References 1. Backus, J.: The syntax and semantics of the proposed international algebraic of the Zurich ACM-GAMM Conference. In: Proceedings of the International Conference on Information Processing – IFIP Congress, pp. 125–132. UNESCO, Paris (1959) 2. Bick, E.: The Parsing System PALAVRAS – Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus University Press (2000) 3. Bloehdorn, S., Moschitti, A.: Combined syntactic and semantic kernels for text classification. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 307–318. Springer, Heidelberg (2007) 4. Chomsky, N.: Three models for the description of language. IRI Transactions on Information Theory 2(3), 113–124 (1956) 5. Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In: ACL 2002, 30th Annual Meeting of the Association for Computational Linguistics, pp. 263–270 (2002) 6. Gon¸calves, T., Silva, C., Quaresma, P., Vieira, R.: Analysing part-of-speech for portuguese text classification. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 551–562. Springer, Heidelberg (2006) 7. G¨ artner, T.: Exponential and geometric kernels for graphs. In: NIPS 2002, 16th Neural Information Processing Systems – Workshop on Unreal Data: Principles of Modeling Nonvectorial Data (2002) 8. G¨ artner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and efficient alternatives. In: Sch¨ olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003) 9. Haussler, D.: Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz (1999)
646
T. Gon¸calves and P. Quaresma
10. Imrich, W., Klavzar, S.: Product Graphs: Structure and Recognition. John Wiley, Chichester (2000) 11. Joachims, T.: Making large-scale SVM learning practical. In: Sch¨ olkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999) 12. Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002) 13. Kamp, H., Reyle, U.: From Discourse to Logic: An Introduction to Model Theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Studies in Linguistics and Philosophy, vol. 42. Springer, Heidelberg (1993) 14. Kashima, H., Inokuchi, A.: Kernels for graph classification. In: ICDM 2002, IEEE International Conference on Data Mining – Workshop on Active Mining (2002) 15. Lopes, J.G., Marques, N.C., Rocio, V.J.: Polaris: POrtuguese Lexicon Acquisition and Retrieval Interactive System. In: The Practical Applications of Prolog, p. 665. Royal Society of Arts (1994) 16. Moschitti, A.: A study on convolution kernels for shallow semantic parsing. In: ACL 2004, 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, SP, pp. 335–342 (2004) 17. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Sch¨ olkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999) ´ 18. Quaresma, P., Quintano, L., Rodrigues, I., Salgueiro, P.: University of Evora in QA@CLEF-2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 534–543. Springer, Heidelberg (2005) 19. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGrawHill, New York (1983) 20. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 21. Vishwanathan, S.V., Smola, A.J.: Fast kernels on strings and trees. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, vol. 15, pp. 569–576. MIT Press, Cambridge (2003) 22. Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 23. Zhang, D., Lee, W.S.: Question classification using support vector machines. In: SIGIR 2003, 26th ACM International Conference on Research and Developement in Information Retrieval, pp. 26–32 (2003)
A General Framework of Feature Selection for Text Categorization Hongfang Jing1,2 , Bin Wang1 , Yahui Yang3 , and Yan Xu4 1
3
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China {jinghongfang,wangbin}@ict.ac.cn 2 Graduate University, Chinese Academy of Sciences, Beijing, 100080, China School of Software & Microelectronics, Peking University, Beijing, 102600, China [email protected] 4 Center of Network Information and Education Technology, Beijing Language and Culture University, Beijing, 100083, China [email protected] Abstract. Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection framework called Distribution-Based Feature Selection (DBFS) based on distribution difference of features. This framework generalizes most of the state-of-theart feature selection methods including OCFS, MI, ECE, IG, CHI and OR. The performances of many feature selection methods can be estimated by theoretical analysis using components of this framework. Besides, DBFS sheds light on the merits and drawbacks of many existing feature selection methods. In addition, this framework helps to select suitable feature selection methods for specific domains. Moreover, a weighted model based on DBFS is given so that suitable feature selection methods for unbalanced datasets can be derived. The experimental results show that they are more effective than CHI, IG and OCFS on both balanced and unbalanced datasets.
1
Introduction
Feature Selection (FS) is a technique that selects the most effective features from the original feature space to reduce the feature space dimension. It improves the efficiency of categorization significantly. FS can be classified into filter methods and wrapper methods depending on whether they rely on categorization algorithms. Wrapper methods may be more effective than filter methods, but their computational costs are high. The measuring functions of filter methods are independent of categorization algorithms, so filter methods are often more efficient than wrapper methods [8,10]. We concern more on filter methods because of their effectiveness and efficiency. A variety of FS methods have been proposed from different perspectives. Some methods are designed for special categorization algorithms [14,16,17]; Also, a series of FS methods are designed based on Genetic Algorithm, Kernel Space, P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 647–662, 2009. c Springer-Verlag Berlin Heidelberg 2009
648
H. Jing et al.
Rough Set Theory or other techniques [4,13,27]. Besides, a type of feature selection method aiming at selecting category information words was proposed in recent years [6,12,29]. Category information words are features that are usually present in certain categories but hardly in others. In addition, some researches consider the stability or worst-case generalization power of the categorization function [1,25]. Moreover, some feature selection strategies are aware of contextual information or correlation among different features [5,18,19,21]. Furthermore, Many other types of FS methods have also been proposed [2,11,15,26]. In this paper, we will not list them all. Yiming Yang compared the performances of some FS methods and analyzed their characteristics [24]. However, no researches give out how to determine the performance and suitable applications of a FS method. Balanced and unbalanced are two basic characteristics of applications. In balanced datasets, the number of documents in every category is similar, while the number of documents in each category has a huge difference in the case of unbalanced datasets. A common category contains a large number of documents, while a rare category consists of only a few documents. We would like to summarize some typical FS researches to compare their performances on unbalanced datasets. Among traditional feature selection approaches, the popular ones are Information Gain (IG), χ2 -Test (CHI), Document Frequency (DF), Term Strength (TS), Mutual Information (MI), Odds Ratio (OR), Weight of Evidence for Text (WET) and Expected Cross Entropy (ECE). IG and CHI are regarded as two of the most effective approaches [24]. However, IG does not perform very well on unbalanced datasets [17]. Besides, OCFS algorithm is more efficient and effective than IG and CHI in handling balanced data [22]. But its objective function is not suitable for unbalanced data. Relief type algorithms are successful attribute estimators. They estimate the quality of features according to how well their values distinguish neighbor instances [21]. But they are not suitable for unbalanced datasets, because rare categories contain only a few instances, so it is difficult to distinguish their instances from the neighbors belonging to common categories. Zhaohui Zheng et al. demonstrated that explicitly and optimally combing the positive and negative features can improve the text categorization performance for unbalanced datasets [28]. However, it is time-consuming to obtain the ratios of positive to negative features for each category. A basic research question is whether there is a general form of FS methods. Analyzing a FS method using this general form, we can obtain the merits and drawbacks of the FS method and also its suitable applications. Further, optimal or suitable FS methods for specific applications can be derived from it. In this paper, we derive the essential of FS from the rough set theory, and then propose a novel FS framework based on distribution characteristics of features contributive to categorization. It is named as Distribution-Based Feature Selection (DBFS). It aims to select features that distribute unevenly among different categories but distribute as evenly as possible within each category. The main merits of DBFS are: 1) it is a universal framework of feature selection which is highly extensible; 2) the merits and drawbacks of FS methods can be
A General Framework of Feature Selection for Text Categorization
649
discovered by analyzing their formulae for each component of this framework; 3) it helps to understand which FS method performs better for what reason; 4) suitable FS methods for specific applications can be obtained from it. To show the effectiveness and efficiency of our DBFS implementations, we conducted experiments on Chinese Text Categorization Corpus of Fudan University (Fudan Corpus), 20 Newsgroup (20NG) and Reuters-21578. 20NG is a balanced dataset and Reuters-21578 is an unbalanced dataset. They are both English datasets and Fudan Corpus is a Chinese and unbalanced dataset. The rest of this paper is organized as follows. In Section 2, IG, CHI and OCFS are introduced. Section 3 describes the DBFS framework and its implementations. Also, the relationship between DBFS and OCFS, MI, ECE, IG, CHI and OR are analyzed. Section 4 introduces the corpora and then presents the experimental results by Micro F1 and Macro F1 curves. The results are discussed in Section 5. Section 6 draws the main conclusion and suggests the future work.
2
Related Work
In this section, some feature selection methods are introduced briefly. IG, CHI and OCFS are used as the baseline due to their effectiveness and popularity. First, some symbols used in this paper need to be introduced. Let t denote a feature, t denote absent t, d denote a document, C denote the train set, {c1 , . . . , ci , . . . , cm } denote the category set of C, ni denote the number of documents in ci , n denote the total number of all the documents in C, and xdt denote the weight of t for d in the text representation model. 2.1
Information Gain (IG)
Information gain measures information obtained for category prediction when knowing the presence or absence of a feature. The information gain of t is: IG(t) = −
m
P (ci )logP (ci )+P (t)
i=1
m i=1
P (ci |t)logP (ci |t)+P (t¯)
m
P (ci |t¯)P (ci |t¯)
i=1
Information gain takes feature absence into account. For unbalanced datasets, IG(t) mainly depends on P (t)and this results in the performance decline. 2.2
Chi-Square (CHI)
χ2 (t, ci ) statistic measures the lack of independence between t and ci . When χ2 (t, ci ) becomes larger, the independence between them decreases and the correlation between them increases, and vice-versa.χ2(t, ci ) is expressed as: χ2 (t, ci ) =
N [P (t, ci ) · P (t, ci ) − P (t, ci ) · P (t, ci )]2 . P (t) · P (t) · P (ci ) · P (ci )
(1)
In the above formula, ci denotes the collection of all the categories except ci .
650
H. Jing et al.
In general, can be measured by m the statistic between t and all categories 2 χ2avg (t) = i=1 P (ci )χ2 (t, ci ) or χ2max (t) = maxm i=1 χ (t, ci ). CHI statistic is a normalized value which assures that the scores for the same category are comparable, but it is not reliable for low-frequency features [24]. 2.3
Optimal Orthogonal Centroid Feature Selection (OCFS)
OCFS optimizes the objective function implied by the Orthogonal Centroid algorithm in the discrete solution space H d×p . Intuitively, it maximizes the sum of distances between all the category means [22]. The following symbols are defined for OCFS: oci : the centroidof ci and the value of it for t denoted as oci (t) is computed as oci (t) = (1/ni ) d∈ci xdt . oc: the centroid of C and the value of it for t denoted as oc(t) is computed as oc(t) = (1/n) d∈C xdt . The score for t by OCFS is computed as s(t) =
m nj j=1
n
(oci (t) − oc(t))2
(2)
In OCFS, common categories have much larger weights than rare categories, so OCFS favors common categories. Hence, OCFS performs well on balanced datasets but may not do well on unbalanced datasets.
3
DBFS Framework
For the sake of discussion, some symbols are defined as follows: T Fd (t): the number of times that t appears in d. |d|: the length of d. It is the sum of T Fd (t). |Vd |: the size of Vd where Vd is the vocabulary of d. DF (t): document frequency of t that is the number of documents in which t is present. DF (t) is the number of documents in which t is absent. cdfi (t): the document frequency of t in ci . CF (t): the number of categories in which t is present. 3.1
Main Idea
Variable precision rough set model allows a certain classification error rate. For non-empty subsets X and Y in universe U, the classification error rate caused by classifying X to Y is c(X, Y ) = 1 − |X ∩ Y |/|X|. Let K = (U, R) denote a knowledge base where R is the set of equivalence relations. For P, Q ⊆ R, Q depends on P in degree k = γP (Q) = |posP (Q)|/|U | [20]. Let U denote the vector space that consists of the train set, X = {d|T Fd (t) > 0} and Y = {d|d ∈ ci }, then the error rate caused by classifying d to ci according to the presence of t is c(X, Y ) = 1 − cdfi (t)/DF (t). Similarly, the corresponding figure is c(X , Y ) = 1 − cdfi (t)/DF (t) according to the absence of
A General Framework of Feature Selection for Text Categorization
651
t where X = {d|T Fd (t) = 0}. Small c(X, Y ) assures that the presence of t is important for classifying documents to ci and small c(X , Y ) implies the absence of t is contributive to classify documents belonging to ci . So the precision for ci according to t is max(cdfi (t)/DF (t), cdfi (t)/DF (t)). Assume that ni = nj and cdfi (t) = ni if cdfi (t) = 0 where i, j = 1, · · · , m and i = j, then max(cdfi (t)/DF (t), cdfi (t)/DF (t)) ⇔ max(1/CF (t), 1/CF (t)). Therefore, if t is only present in ci , then d belongs to ci when t is present in d. Similarly, if t is only absent in ci , then d is most likely to belong to ci when t is absent in d. Intuitively, features discriminating categories should be present or absent in fewer categories than other features. From the distribution point of view, their distributions among different categories are unbalanced. However, only few features are present or absent in all the documents of a category. Let T = {t, t}, Y = {d|d ∈ ci }. If t is only present or absent in part of the documents in ci , then U/T = {{ci }, {c1 , . . . , ci−1 , ci , . . . , cm }}. So when t is only present or absent in part of the documents in ci , the dependence degree is γT (Y ) = γt (Y ) =
cdfi (t) n
or
γT (Y ) = γt (Y ) =
cdfi (t) ni − cdfi (t) = n n
In order to categorize documents belonging to ci more precisely according to t or t, large γt (Y ) or γt (Y ) is needed. This implies that features indicating categories should be present or absent in as many documents within those categories as possible. This characteristic reveals that effective features for categorization distribute more evenly within categories than other features. From above analysis, we can conclude that feature selection methods should select features that spread unevenly among different categories but spread as evenly as possible within each category. 3.2
Framework
The following functions measure the degree of difference for feature distribution. Definition 1. Distribution Difference between ci and other Categories (DACi ) of t denoted as Dinter (t, ci ) measures the degree of distribution difference for t between ci and other categories. Definition 2. Distribution Difference among Categories (DAC ) of t denoted as Dinter (t) measures the degree of distribution difference for t among different categories. Large DAC implies that feature distributions vary greatly from category to category, while small DAC means that feature distributions remain comparatively stable among different categories. Features with small DAC have little discriminative information, such as stop words. Thus, feature selection methods should select features with large DAC. Definition 3. Distribution Difference within Category ci (DICi ) of t denoted as Dinner (t, ci ) measures the degree of distribution difference for t in all documents belonging to ci . Features with large DICi distribute unevenly in documents belonging to ci .
652
H. Jing et al.
Definition 4. Distribution Difference within all Categories (DIC ) of t denoted as Dinner (t) measures the degree of global distribution difference for t within each category. Large DIC means that features distribute unevenly within some categories. Hence, features that are indicative of categories should have small DIC. Table 1 shows the characteristics of features that have discriminative information and noises. Note that a feature having one or more of the characteristics listed in the second row is a noise. Table 1. The Characteristics of Features Feature Type DACi DAC DICi DIC Indicating Categories Large Large Small Small Noises Small Small Large Large
From above analysis, we can conclude that features important for categorization should have large DAC and small DIC. A score is used to measure the importance degree of t, and the score can be computed as score(t) = Dinter (t) − Dinner (t) or score(t) = Dinter (t)/Dinner (t) where Dinter (t) = m i=1 wi Dinter (t, ci ), Dinter (t, ci ) = f (F (t, ci ), F (t)), Dinner m (t) = i=1 wi Dinner (t, ci ) and Dinner (t, ci ) = g(F (t, ci ), F (t, d)). F (t, d) is the distribution function of t in d and F (t, ci ) is the distribution function of t in ci . Similarly, F (t) is the distribution function of t in ci or in C depending on the objective of feature selection methods. For feature selection methods measuring the distribution difference of features between each category and other categories, F (t) should be F (t, ci ). This kind of methods are relatively time-consuming. Taking the efficiency into account, F (t, C) is used instead because it does not need to recalculate for different categories. However, this kind of methods are relatively less effective. Distribution difference can be measured by variance, distance or other ways. 3.3
Implementations
This section gives some implementations of DBFS. By maximum likelihood estimation, F (t, d) can be computed by F (t, d) = T Fd (t)+1 |d|+|Vd | . But important features do not necessarily have large weights in text representation model. For instance, features indicating a common category may be present in most of the documents of this category, so the weights of them in TFIDF model are not large enough to express their importance. We do not improve the TFIDF model in this paper, so we use F (t, d) = xdt despite that this will decline performance. F (t, ci ) and F (t) are computed by: F (t, ci ) =
m cdfi (t) 1 × F (t, d), F (t) = wi F (t, ci ) ni ni i=1 d∈ci
A General Framework of Feature Selection for Text Categorization
653
In the above formulae, F (t, ci ) penalizes features appearing in the minority of documents, so it is anti-noise. For example, if there are a few garbage documents which are produced by duplicating one word many times, F (t, ci ) declines by multiplying the factor cdfi (t)/ni , then the score for t may decline, and t may avoid being selected. The following are two obvious methods to compute F(t) by F (t, ci ): 1 F (t, ci ) m i=1 m
F (t) =
F (t) =
m ni i=1
n
F (t, ci )
(3)
(4)
Formula (3) treats each category equally, therefore, it is suitable for unbalanced datasets. Formula (4) gives much larger weights to common categories than to rare categories, so it performs well on common categories. Dinter (t) =
m
wi (F (t, ci ) − F (t))2
(5)
i=1
The weighted model (5) measures DAC. Common categories have a huge influence on Micro F1. In order to achieve satisfied Micro F1 performance, generally the weight function w(x) in formula (5) is a function of category sizes. w(ni ) is used as the weight for ci , that is wi . This function can be simply expressed as w(x) = x. 1 Dinner (t, ci ) = (F (t, d) − F (t, ci ))2 (6) ni d∈ci
The above formula can be used to measure Dinner (t, ci ), but the computational cost is high when processing a large scale dataset. In order to compute Dinner (t, ci ) effectively, F (t, d) and F (t, ci ) are used. If t appears in d, then F (t, d) is equal to 1, otherwise it is equal to 0. In this case, F (t, ci ) = cdfi (t)/ni and Dinner (t, ci ) becomes
Dinner (t, ci ) =
1 1 cdfi (t) 1 2 (F (t, d) − F (t, ci ))2 = − ( − ) ni 4 ni 2
(7)
d∈ci
1 D (t, ci ) + α m i=1 inner m
Dinner (t) =
score(t) = Dinter (t)/Dinner (t)
(8) (9)
In order to process Dinner (t) when it is equal to 0 in (9), a very small real number is used in (8). In our experiments, the value for α is 0.001. The DBFS implementation computing F(t) by (3) is named as DBFS-Category (DBFS-Cat) and that by (4) is named as DBFS-Document (DBFS-Doc). The computation method of F(t) implies that DBFS-Cat performs well on each category
654
H. Jing et al.
especially on rare categories when selected features are few, while DBFS-Doc performs well on common categories. In formula (5), the larger weights for common categories lead to better performance on common categories. Similarly, better performance on rare categories can be achieved by giving larger weights to rare categories. √DBFS-Cat becomes DBFSSqr-Cat and DBFSLog-Cat corresponding to w(x) = x and w(x) = log(x), respectively. Similarly, this idea also applies to other implementations of DBFS. DBFS-Cat concerns more on common categories than DBFSLog-Cat, so it performs better on common categories than DBFSLog-Cat. But DBFSLog-Cat concerns more on rare categories than DBFS-Cat, thus, it performs better on rare categories than DBFS-Cat. Other implementations have similar behaviors. The time complexities of all given DBFS implementations are O(Vn) where V is the vocabulary size of the train set and n is the total number of documents in the train set. This is the same as OCFS, CHI and IG. For our DBFS implementations, the computation of Dinter (t) is so similar to OCFS that the time of running these two algorithms are very close. Also, The time complexity of Dinner (t) is O(Vm) where m is the number of categories (m is much smaller than n). Hence, the implementations of DBFS in this paper are as fast as OCFS. OCFS can process a dataset in only about half the time of IG and CHI [22]. Thus our DBFS implementations are faster than CHI and IG. 3.4
The Relationships between DBFS and Some FS Methods
DBFS is a general framework of feature selection methods. The computation approaches of each component can be modified to accommodate a variety of feature selection methods. Table 2 presents the formulae of OCFS, MI, ECE, IG, CHI and OR for each component involved in DBFS. Here let M IAV G and CHIAV G be MI and CHI respectively. From the above table, we can conclude that OCFS, MI, ECE, IG, CHI and OR also measure Dinter (t) which is a component of DBFS, so they are special implementations of DBFS framework. By analyzing the computation approaches of a FS method for each component, the characteristics and suitable applications of it can be derived. OCFS assigns much larger weights to common categories Table 2. The Relationships between DBFS and Some FS Methods Method F (t, d) OCFS xdt MI – ECE – IG
–
CHI
–
OR
–
F (t, ci ) oci (t) logP (t|ci ) logP (t, ci )
F (t) oc(t) logP (t) logP (t)
Dinter (t, ci ) wi in Dinter (t) score(t) (F (t, ci ) − F (t))2 P (ci ) Dinter (t) F (t, ci ) − F (t) P (ci ) Dinter (t) F (t, ci ) − F (t) P (t, ci ) Dinter (t) Dinter (t) logP (t, ci ) logP (t) F (t, ci ) − F (t) P (t, ci ) +Dinter (t) n ∗ P (ci )P (ci ) P (t|ci ) P (t|ci ) /P (t)P (t) P (ci ) Dinter (t) ∗P (t|ci ) ∗P (t|ci ) ∗(F (t, ci ) − F (t))2 log(P (t|ci )/ log(P (t|ci )/ 1 for ci F (t, ci ) − F (t) Dinter (t) (1 − P (t|ci ))) (1 − P (t|ci ))) 0 for ci
A General Framework of Feature Selection for Text Categorization
655
than to rare categories in the computations of F (t) and Dinter (t), so it performs better on common categories than on rare categories. This is the reason why OCFS is not suitable for unbalanced datsets. Formula Dinter (t, ci ) for MI shows that MI selects positive features. Positive features indicate documents belonging to categories. Compared to MI, ECE uses P (t, ci ) rather than P (ci ) as the weight for ci . In general, low-frequency features have small P (t, ci ). Thus Expected Cross Entropy avoids selecting low-frequency features to a certain extent. The Dinter (t, ci ) in CHI is not reliable for low-frequency features because its value are much larger for low-frequency features than that for high-frequency features in the same category. Original OR actually uses Dinter (t, ci ) as the score for t, so it favors in selecting features that are indicative of ci (positive category). Thus,it is suitable for applications which concern on a special category. OCFS, MI, ECE, IG, CHI and OR measure DAC in some ways. However, all of them except OCFS do not take F (t, d) into account and other functions used are not appropriate enough. Our DBFS implementations take F (t, d) into account, and adopt more precise functions and proper weights for each category to measure distribution difference. Furthermore, they also consider DIC. So they measure the importance of features more precisely and are more effective. Moreover, weights can be flexibly assigned to categories and F (t) can be computed by different ways according to the characteristics of applications, so proper FS methods can be easily derived from DBFS for different applications. For unbalanced datasets, the weight function should assign large enough weights to rare categories and F (t ) should be computed by (3).
4
Experiments
To compare the effectiveness of different FS methods, SVM classifier is used because it is the top performing classifier among KNN, Naive Bayes, SVM, Logistic Regression and Decision Trees. Also, it is widely used in text categorization research [3,7,22,23]. The implementation of SVM is supplied by Weka1 . Linear kernel and all the default values for parameters are used. The pre-processing run on the English corpora includes removal of the stop words and stemming. For the Chinese corpus, the pre-processing is segmentation using ICTClAS provided by Institute of Computing Technology, Chinese Academy of Sciences. After this, nouns, verbs and adjectives are selected as features. 4.1
Datasets
Three datasets are used, Fudan Corpus2 , Reuters-215783 and 20NG4 [9]. Fudan Corpus. Fudan Corpus contains 20 categories including 19,637 documents collected by Center for International Database of Fudan University. After 1 2 3 4
http://www.cs.waikato.ac.nz/ml/weka http://www.nlp.org.cn/categories/default.php?cat id=16 http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html http://people.csail.mit.edu/jrennie/20Newsgroups
656
H. Jing et al.
removing the duplications and damaged documents, the new dataset consists of 14,377 documents in which 8,214 documents belong to the train set and 6,163 documents belong to the test set. This dataset is unbalanced. The largest category contains 1,369 training documents, while the smallest category only contains 25 training documents. Furthermore, there are 11 categories in the corpus each of which has less than 100 training documents. Reuters-21578. The ModeApte version of Reuters-21578 is used. There are 90 categories in total. Each category contains at least one train document and one test document. 7,769 documents belong to the train set and 3,019 documents belong to the test set. It is unbalanced. The train set of the largest category consists of 2,877 documents, but the train set of 82% categories contains less than 100 documents. We use this corpus as a single label dataset by only keeping the first label if a document is multi-labelled. To avoid wide fluctuation of performances, we select the categories whose train set consists of more than ten documents and test set consists of more than 5 documents. Finally, 37 categories are selected. There are 20 categories each of which has less than 100 training documents. 20NG. 20NG is a newsgroup collection containing approximately 20000 Usenet articles. Each Newsgroup is a category which consists of nearly 1000 documents. Each document belongs to one newsgroup exactly. We use the “bydate” version of 20 Newsgroup whose train and test sets were split by its provider. This dataset is a balanced corpus. 4.2
Performance Measures
Precision, Recall, F1 for ci , MicroP and MicroR are computed as follows: m m ai ai ai 2Pi Ri i=1 ai Pi = , Ri = , F1i = , M icroP = i=1 , M icroR = m m bi di Pi + Ri i=1 bi i=1 di where ai is the number of documents correctly classified to ci , bi is the total number of documents classified to ci and di is the number of documents in ci . m F1i 2M icroP × M icroR M icroF1 = , M acroF1 = i=1 M icroP + M icroR m Micro F1 largely depends on common categories while Macro F1 is influenced by each category. In order to compare the performances on all categories and each category, both Micro F1 and Macro F1 are used in this paper. 4.3
Experimental Results
All the experimental results are shown by the Micro F1 and Macro F1 curves in this section.
A General Framework of Feature Selection for Text Categorization
657
Fig. 1. Micro F1 and Macro F1 results on Fudan Corpus
Fig. 2. Micro F1 and Macro F1 results of DBFSs on Fudan Corpus
Fudan Corpus. Fig.1 shows that DBFS-Doc outperforms OCFS, IG and CHI constantly by both Micro F1 and Macro F1. For Micro F1, the improvement is 4%, 9.7% and 5.7% at most compared to OCFS, IG and CHI, respectively. For Macro F1, the corresponding improvement is 12.2%, 49.7% and 41%. The highest Micro F1 and Macro F1 are both obtained by DBFS-Doc. So DBFS-Doc is more effective than OCFS, IG and CHI on this unbalanced dataset. Fig.2 compares the performance of different DBFS implementations presented in this paper. DBFS-Cat performs better than DBFS-Doc when the
658
H. Jing et al.
Fig. 3. Micro F1 and Macro F1 results on Reuters-21578
number of features is less than 1000, but worse than DBFS-Doc when the number of features gets larger. The reason for this is that DBFS-Cat concerns more on rare categories than DBFS-Doc. When only a few features are selected for rare categories, the performances on them are good, so DBFS-Cat achieves better results than DBFS-Doc. However, when the number of features is large, DBFSCat may select too many features in which noises are involved for rare categories but not enough features for common categories. DBFS-Doc can select enough features for common categories and not too many features for rare categories, so it gets good performance on each category. The relationship between DBFSLogDoc and DBFS-Doc is similar to the relationship between DBFS-Cat and DBFSDoc. Therefore, weight function can be flexibly designed to obtain expected performance. For example, good performance on rare categories can be obtained by DBFSLog-Doc. The Macro F1 improvement by DBFSLog-Doc is 26.4%, 67.5% and 58.8% at most compared to OCFS, IG and CHI. Reuters-21578. This corpus is so unbalanced that common categories have much influence on both Micro F1 and Macro F1. DBFSSqr-Doc is used as the implementation of DBFS. It favors common categories and rare categories are also concerned to a certain extent. Fig.3 illustrates that the highest Micro F1 and Macro F1 are obtained by DBFSSqr-Doc. OCFS and DBFSSqr-Doc get their best Micro F1 performances using 500 features, while IG and CHI actually needs 3000 and 10000 features to get their best Micro F1, respectively. OCFS, CHI and DBFSSqr-Doc get their best Macro F1 when the number of features is 500, but IG gets its best Macro F1 using 1000 features. Compared to OCFS, IG and CHI, the Macro F1 improvement by DBFSSqr-Doc is 12.2%, 58.5% and 19.7% at most. Note
A General Framework of Feature Selection for Text Categorization
659
that, when the size of features is moderate, DBFSSqr-Doc performs slightly worse than OCFS, because the features selected by it are too many for rare categories but may not be enough for common categories. However, DBFSSqrDoc outperforms other methods when the number of features is small or large enough, and it gets best performance. Thus, DBFSSqr-Doc performs best among them. The results have also validated that OCFS is more effective than IG and CHI. This is consistent with the conclusion in [22]. 20NG. For balanced datasets, the weights for each category are almost equal to each other no matter what kind of weight function is used. 20NG is a balanced dataset, so DBFS-Doc is used because its weight function is simple to compute. The Macro F1 curves are very similar to the Micro F1 curves, thus Macro F1 results are not given here. Besides, some results are so close that it is difficult to distinguish them in a graph, hence they are listed in a table. Table 3. The Micro F1 results on 20NG Micro F1 100 OCFS .524 IG .521 CHI .506 DBFS-Doc .552
500 .693 .691 .68 .712
1000 .735 .721 .734 .739
5000 .772 .764 .766 .774
10000 .793 .78 .78 .794
Table 3 illustrates that DBFS-Doc performs better than OCFS, CHI and IG on this balanced dataset. For Micro F1, the improvement figure is 5.4%, 9.1% and 6% at most by DBFS-Doc compared to OCFS, CHI and IG, respectively. The highest Micro F1 by DBFS-Doc is as good as that by OCFS, which is about 0.015 higher than that by CHI and IG. From the above experimental results and analysis, we can conclude that the DBFS implementations proposed in this paper are more effective than OCFS, CHI and IG on both balanced and unbalanced datasets.
5
Discussion
Performance on different types of categories benefits from variations in the implementations of DBFS. Implementations computing F(t) by (3) perform better on rare categories but worse on common categories than that by (4). This is because (3) assigns larger weights to rare categories than (4) does. Similarly, the weights in Dinter (t) have the same effect. So FS methods suitable for unbalanced data problems can be derived from DBFS by assigning large enough weights to rare categories. We discover that OCFS, MI, ECE, IG, CHI and OR fall into the DBFS framework. But these methods only consider distribution difference of features among different categories. So some features distributing unevenly among categories and also fluctuating greatly within categories may still be selected by these methods.
660
H. Jing et al.
DBFS avoids selecting this kind of features to some extent by also considering distribution difference of features within categories. Besides, F (t) and the weight function in Dinter (t) can be flexibly designed for different applications. Moreover, feature distributions are also considered in documents. Furthermore, the distribution functions in our DBFS implementations can measure the distribution more precisely. If using more precise F (t, d) and more appropriate text representation model instead of TFIDF model, the performances of our DBFS implementations can be further improved. Thus, our DBFS implementations are more effective than the above mentioned methods.
6
Conclusion and Future Work
In this paper, we deduce the distribution characteristics of features contributive to text categorization from the rough set theory, and then propose a novel feature selection framework called DBFS based on these characteristics. We discover that OCFS, MI, ECE, IG, CHI and OR can be treated as implementations of DBFS that only consider Distribution Difference among Categories (DAC). In addition, this framework is significant in revealing the merits and drawbacks of existing methods. Moreover, the unified framework also gives insight to why some existing methods perform better than others. Furthermore, we also give a weighted model and demonstrate the influence of the weight functions. The methods derived from the weighted model outperform OCFS, IG and CHI on both balanced and unbalanced datasets, and also on both English and Chinese corpora. Thus, it is expected that suitable feature selection methods can be derived from DBFS for specific applications. One interesting future direction of this research is to design suitable feature selection methods adaptively for a specific domain. Besides, DBFS does not consider the correlation among different features. Taking this into account is also a valuable direction for future improvements.
Acknowledgement The work is supported by China National 863 Key Project (2006AA010105), China National 973 Project (2007CB311103) and Natural Science Founding (60873166, 60603094). For helpful comments on this paper we would like to thank the anonymous reviewers.
References 1. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature Selection Methods for Text Classification. In: ACM SIGKDD, pp. 230–239 (2007) 2. Doumpos, M., Salappa, A.: Feature selection algorithms in classification problems: an experimental evaluation. In: AIKED, pp. 1–6 (2005) 3. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 1289–1305 (2003)
A General Framework of Feature Selection for Text Categorization
661
4. Hong, J., Cho, S.: Efficient huge-scale feature selection with speciated genetic. Pattern Recognition Letters, 143–150 (2006) 5. Hong, S.J.: Use of Contextual Information for Feature Ranking and Discretization. IEEE Transactions on Knowledge and Data Engineering 9(5), 718–730 (1997) 6. How, B.C., Kulathuramaiyer, N., Kiong, W.T.: Categorical term descriptor: A proposed term weighting scheme for feature selection. In: IEEE/WIC/ACM WI, pp. 313–316 (2005) 7. Joactfims, T.: Text categorization with support vector machines learning with many relevant features. In: N´edellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998) 8. John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: ICML, pp. 121–129 (1994) 9. Lang, K., NewsWeeder: Learning to filter netnews. In: ICML, pp. 331–339 (1995) 10. Langley, P.: Selectuion of relevant features in machine learning. In: AAAI Fall Symposium on Relevance, pp. 140–144 (1994) 11. Legrand, G., Nicoloyannis, N.: Feature Selection Method Using Preferences Aggregation. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS, vol. 3587, pp. 203–217. Springer, Heidelberg (2005) 12. Li, S., Zong, C.: A new approach to feature selection for text categorization. In: IEEE NLP-KE, pp. 626–630 (2005) 13. Li, F., Guan, T., Zhang, X., Zhu, X.: An Aggressive Feature Selection Method based on Rough Set Theory. Innovative Computing, Information and Control, 176–179 (2007) 14. Liu, Y., Zheng, Y.F.: FS SFS: A novel feature selection method for support vector machines. Pattern Recognition 39, 1333–1345 (2006) 15. Luo, S., Corne, D.: Feature selection strategies for poorly correlated data: correlation coefficient considered harmful. In: AIKED, pp. 226–231 (2008) 16. Mak, M., Kung, S.: Fusion of feature selection methods for pairwise scoring SVM. Neurocomputing 71, 3104–3113 (2008) 17. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: ICML, pp. 258–267 (1999) 18. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence 53, 1226–1238 (2005) 19. Perner, P.: Improving the Accuracy of Decision Tree Induction by Feature PreSelection. Applied Artificial Intelligence 15(8), 747–760 (2001) 20. Polkowski, L., Tsumoto, S., Lin, T.Y.: Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems. Springer, Heidelberg (2000) 21. Robnik-Siikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of Relief and Relief. Machine Learning Journal 53, 23–69 (2003) 22. Yan, J., Liu, N., Zhang, B.: OCFS: Optimal orthogonal centroid feature selection for text categorization. In: ACM SIGIR, pp. 122–129 (2005) 23. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM SIGIR, pp. 42–49 (1999) 24. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997) 25. Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: ACM SIGKDD, pp. 803–811 (2008) 26. Zhao, P., Liu, P.: Spectral feature selection for supervised and unsupervised learning. In: ICML, pp. 1151–1157 (2007)
662
H. Jing et al.
27. Zhao, T., Lu, J., Zhang, Y., Xiao, Q.: Feature Selection Based on Genetic Algorithm for CBIR. In: CISP, pp. 495–499 (2008) 28. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 80–89 (2004) 29. Zhou, Q., Zhao, M., Hu, M.: Study on feature selection in chinese text categorization. Journal of Chinese Information Processing 18, 17–23 (2004)
New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps Walaa K. Gad and Mohamed S. Kamel Department of Electrical and Computer Engineering, University of Waterloo Waterloo, Ontario, Canada N2L 3G1 {walaakh,mkamel}@pami.uwaterloo.ca
Abstract. Most text clustering techniques are based on words and/or phrases weights in the text. Such representation is often unsatisfactory because it ignores the relationships between terms, and considers them as independent features. In this paper, a new semantic similarity based model (SSBM) is proposed. The semantic similarity based model computes semantic similarities by utilizing WordNet as an ontology. The proposed model captures the semantic similarities between documents that contain semantically similar terms but unnecessarily syntactically identical. The semantic similarity based model assigns a new weight to document terms reflecting the semantic relationships between terms that co-occur literally in the document. Our model in conjunction with the extended gloss overlaps measure and the adapted Lesk algorithm solves ambiguity, synonymy problems that are not detected using traditional term frequency based text mining techniques. The proposed model is evaluated on the Reuters-21578 and the 20Newsgroups text collections datasets. The performance is assessed in terms of the Fmeasure, Purity and Entropy quality measures. The obtained results show promising performance improvements compared to the traditional term based vector space model (VSM) as well as other existing methods that include semantic similarity measures in text clustering. Keywords: Gloss overlaps measure, semantic similarity measures, WordNet, word sense disambiguation, text clustering.
1
Introduction
Text clustering is an unsupervised learning method which groups text document into related clusters, and discovers hidden knowledge between clusters. Text clustering has been applied to many applications as indexing, information retrieval, browsing large document collections and mining text data on the Web. Most text clustering methods use the traditional term based vector space model (VSM). The VSM is based on frequencies weights to reflect the terms P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 663–677, 2009. c Springer-Verlag Berlin Heidelberg 2009
664
W.K. Gad and M.S. Kamel
importance in documents. As terms might be semantically similar but lexicographically different, clustering based on terms frequencies may fail to group related documents. Thus, many researchers suggested using semantic relatedness and semantic measures to document clustering. Semantic relatedness refers to the degree to which given terms are semantically related. All semantic relatedness and semantic measures are based on lexical ontologies to extract the relatedness between documents. Semantic similarity is a kind of relatedness [1]. Semantic relatedness covers relationships between concepts (terms) that include similarity as well as lexical relationships such as meronymy and antonym. Combining ontologies as background knowledge shows that clustering performance can be improved. For example, WordNet senses have been used to augment document feature vectors by terms synonyms and achieve better results than the traditional term based VSM [2,3,4]. A similar technique is adopted by mapping document terms to their senses and construct document feature vectors to represent the documents. The clustering performance is improved but statistical analysis showed that this improvement is not significant. In this paper, different semantic similarity measures are reviewed. A new semantic similarity based model (SSBM) using the extended gloss overlaps measure is proposed for document clustering. The semantic similarity based model (SSBM) is proposed to add a new semantic weights to document terms. The SSBM updates frequencies weights by including the semantic similarities values between terms. The SSBM assigns higher semantic weights to terms that are semantically close. The SSBM uses the extended gloss overlaps measure and the adapted Lesk algorithm to get the semantic relatedness for terms pairs. The SSBM solves the ambiguity and synonymy problems that lead to erroneous and unnoticed similarities between text documents. Results show that SSBM has a significant clustering improvement. The SSBM outperforms the term based vector space model (VSM) as well as other methods that use semantic similarities. We perform the clustering using bisecting kmeans and kmeans algorithms and assess the clustering performance in terms of Fmeasure, Purity and Entropy performance measures. The rest of the paper is organized as fellows. Section 2 introduces a brief review of relevant semantic similarities measures. The proposed semantic similarity based model is presented in section 3. Test data, evaluation methods, and results are described in section 4. Finally, conclusions are discussed in section 5. Following are the explanations of the basic notations used in this paper: len(ci , cj ): The length of the shortest path in WordNet from concept ci to cj . depth(ci ): The depth of the node ci . The path length from ci to the root. lso(c1 , c2 ): The lowest supper ordinate or the most specific common subsumer of c1 and c2 . rel(c1 , c2 ): The semantic relatedness between two concepts c1 and c2 . P OS: Parts of speech(noun, verb, adverb and adjective).
New SSBM for Text Clustering Using Extended Gloss Overlaps
665
wji : Frequency weight of term i in document j. w ˜ij : Semantic weight of term i in document j.
2
Semantic Similarity Measures
Most semantic similarity measures have been used in conjunction with WordNet. WordNet [6] is an online lexical reference system developed at Princeton University. WordNet organized into taxonomic hierarchies (ontologies). Nouns, verbs, adjectives and adverbs are grouped into synonym sets (synsets). The synsets are also organized into senses, the same term has different meanings. The synsets are related to other synsets higher or lower in the hierarchy by different types of relationships. The most common relationships are the Hyponym/Hypernym (Is-A relationships), and the Meronym/Holonym (Part-Of relationships). We classify the semantic measures into the following categories and introduce the popular measures for each category. Table 1 shows a general classification of relatedness and similarity measures, advantages and disadvantages of each measure. 2.1
Path Length Based Measures
These measures are based on viewing taxonomy as a graph. The relatedness between concepts is determined as a function of the shortest path length between concepts. Hirst and St. Onge [7] considered many relations in WordNet for all parts of speech. Their measure classified all WordNet relations as horizontal, upward, or downward. Upward relations connect more specific concepts to more general ones. Downward relations join more general concepts to more specific ones. Horizontal relations maintain the same level of specificity. Hirst and St-Onge weighted the relation between the concepts as: relHS (c1 , c2 ) = C − len(c1 , c2 ) − k × turns(c1 , c2 ) where C and k are constants (C=8 and k=2), and turns(c1 , c2 ) is the number of times the path between c1 and c2 changes direction. Wu and Palmer [8] defined the similarity measured as concepts position in the taxonomy relative to the position of the most specific concept that subsumes the two concepts being measured. simW &P (c1 , c2 ) =
2H N1 + N2 + 2H
where H is depth(lso(c1 , c2 )) and Ni is the len(ci , lso(c1 , c2 )) Li el al. [9] combined the shortest path and the depth of the ontology information in a nonlinear function: simLi (c1 , c2 ) = e−αL
eβH − e−βH eβH + e−βH
666
W.K. Gad and M.S. Kamel
where L stands for the shortest path between two concepts and α and β are parameters scaling the contribution of shortest path length and depth respectively. Leacock and Chodorow [10] measured the similarity based on the lengths of paths between noun concepts. The shortest path is the one which includes the fewest number of intermediate concepts. This value is scaled by the depth of the hierarchy (distance of the farthest node from the root node). simL&C (c1 , c2 ) = − log
2.2
len(c1 , c2 ) 2 × maxc∈W ordN et depth(c)
.
Information Based Measures
These measures attempt to counter the problems of path length methods by incorporating an additional knowledge source, namely, information content from corpus. Resnik [11] assumed that the semantic relatedness of two concepts is proportional to the amount of information they share. The quantity of common information of two concepts is determined by the information content of the lowest concept in the hierarchy that subsumes both. The information content of w∈W (c)
count(w)
concept c is given by IC(c) = − log p(c) and p(c) = . N where w(c) is the set of words (nouns) in the corpus whose senses are subsumed by concept c, and N is the total number of word (noun) tokens in the corpus that are also presented in WordNet. simRes (c1 , c2 ) = IC(lso(c1 , c2 )) = − log p(lso(c1 , c2 )) In this measure, neither the information content nor the path length of the concepts themselves is considered. This leads to quite a few concepts having the same least common subsumer, and would have identical values of similarity assigned to them. Jiang and Conrath [12] defined a hybrid approach to augment path length between concepts and information content. This measure includes the information content of the concepts themselves along with the information content of their lowest common subsumer. simJ&C (c1 , c2 ) = IC(c1 ) + IC(c2 ) − 2 × IC(lso(c1 , c2 )) Lin [13] measured the similarity of two concepts by the ratio of the amount of information needed to state the commonality of the two concepts to the amount of information needed to describe them. The commonality of two concepts is captured by the information content of their lowest common subsumer and the information content of the two concepts themselves. simLin (c1 , c2 ) =
2 × IC(lso(c1 , c2 )) IC(c1 ) + IC(c2 )
New SSBM for Text Clustering Using Extended Gloss Overlaps
2.3
667
Feature Based Measures
These measures assume that each term is described by a set of terms indicating its properties or features [14]. The more common characteristics two terms have and the less non common characteristics they have, the more similar the terms are. Knappe el al. [15] defined the similarity measure using the information of generalization and specification of two compared concepts. simKnappe (c1 , c2 ) = p ×
|Ans(c1 ) ∩ Ans(c2 )| |Ans(c1 ) ∩ Ans(c2 )| + (1 − p) × |Ans(c1 )| |Ans(c2 )|
where Ans(c1 ) and Ans(c2 ) correspond to description sets (the ancestor nodes) of terms c1 and c2 . p defines the relative importance of generalization vs. specification and ranges from [0,1]. Table 1. Comparison of Relatedness and Similarity Measures Type Path based
Measure Wu& Palmer
Principal The path length
Leacock& The shortest path Chodorow Hirst & St-Onge
Information Resnik Content based
Jiang& Conrath Lin
Feature based
3 3.1
Knappe
Advantage • Simplicity •Simplicity
The short path • Includes all POS relative to direction • More than IS-A change relations • Not restricted to nouns Shared information • Uses information content based on corpora
Disadvantage • WordNet nouns only • IS-A relations •WordNet nouns only •IS-A relations • WordNet only
• Does not use the IC of individual concepts • WordNet only • IS-A relations Information content • Includes the IC of • WordNet nouns &path length concepts & their lso only • IS-A relations The commonality & • Includes the IC of • WordNet nouns IC concepts & their lso only • IS-A relations Features •Targeted toward • Do not exploit &properties cross ontology the structure & IC
Semantic Similarity Based Model (SSBM) Term Weighting
The traditional term based vector space model (VSM) represents documents by words and/ or phrases frequencies [16,17]. The frequency reflects the importance
668
W.K. Gad and M.S. Kamel
of the terms in the document. The lack of common terms between documents may not mean dissimilarity. Some terms may not be syntactically identical but semantically similar. To capture the semantic similarity, the semantic similarity based model (SSBM) adds a new weight to term frequency to reflect the semantic relationships between terms. The SSBM defines the terms as a stemmed nonstop word using the Porter stemmer [18]. Stemming has only been performed for terms that do not appear in WordNet as lexical entries. Thus, we use the morphological capabilities of WordNet to those terms to improve the results. Then, infrequent terms are eliminated by pruning because they may affect the results. Rare terms may add noise and do not help for discovering appropriate clusters. The SSBM assigns and adjusts the term weight based on its relationships with semantically similar terms that occur together in the document. The new semantic weight of term i in document j is defined as: w ˜ji1 = wji1 +
m
wji2 .simAdaptedLesk (i1 , i2 )
i2 =1 i2 =i1
where wji1 is the frequency weight of term i1 in document j, simAdaptedLesk (i1 , i2 ) is the semantic information between terms i1 and i2 using the extended gloss overlaps measure and the adapted Lesk algorithm, and m is the number of terms in document vector j. This step updates the original term frequency weight, and considers the semantic relation between each pair of terms. We use the term frequency and inverse document frequency tf.idf to compute terms frequencies weights. The tf.idf of term i in document j is defined by [19]: tf.idf (j, i) = log(tf (j, i) + 1) ∗ log(
|D| ) df (i)
where df (i) is a document frequency of term i that indicates how many documents term i appears. tf (j, i) how many times term i appears in document j. We use tf.idf because it assigns larger weights to terms that appear relatively rarely through the corpus, but very frequently in individual documents. The tf.idf measures 14% improvement in recall and precision in comparison to the standard term frequency tf [19]. Steinbach [20] showed that each class is categorized by set of core words, and the remaining words are too general terms. The SSBM reduces the effect of the general words that may affect the clustering performance by emphasizing the importance of the core words, and assigning higher semantic weights to terms that are semantically related. 3.2
The Extended Gloss Overlaps
The gloss overlaps were introduced by Lesk [21] to solve word sense disambiguation problems. The Lesk algorithm disambiguates words in short phrases. The
New SSBM for Text Clustering Using Extended Gloss Overlaps
669
lexicon, gloss, of each sense of a word is compared to the glosses of each word in the phrase. All the words occurring in the sense definition compose the sense bag. The sense whose gloss has the most words in common with the glosses of the neighboring is chosen and it is the appropriate sense. The original Lesk algorithm only considers overlaps among the glosses of the word and those that surround it in the given context, and relies on glosses found in traditional dictionaries as Oxford advanced learners. This is a significant limitation because dictionary glosses tend to be short and do not provide sufficient vocabulary to make distinctions in relatedness. Banerjee el al. [22] adapted and generalized the original Lesk. Rather than using a standard dictionary as the source of the glosses, the lexical ontology WordNet is employed. They extend overlaps in the glosses of neighboring words to include the glosses of words that are related to the words in the text being disambiguated. The extended gloss overlaps measure combines the advantages of gloss overlaps with the structure of a concept hierarchy to create an extended view of relatedness between synsets. The extended gloss overlaps measure generalizes the Adapted Lesk algorithm, and computes the relatedness between concepts by comparing the glosses of synsets that are related to concepts through relations of WordNet. The relatedness is not only based on glosses overlaps but also between glosses of hypernym, hyponym, meronym, holonym and troponym synsets of the input synsets, as well as between synsets related to input synsets through the relations of attribute, similar to, and also see. We use the extended gloss overlaps measure and adapted Lesk algorithm in conjunction with the SSBM to disambiguate the sense of a polysemous word and get the semantic similarities between terms due to: – The significant limitations of other similarity measures as shown in Table 1: • Methods that consider the positions of the terms in the hierarchy perform better than path length methods. • Feature based methods are targeted to compare terms from different ontologies (WordNet and MeSH). MeSH is an ontology of medical terms developed by the US National Library of Medicine. – Taking the highly interconnected relationships that WordNet offers. – It shows a good performance in word sense disambiguation compared to other semantic similarity measures [23]. 3.3
Document Similarity
We adopt the cosine similarity measure to calculate the cosine of the angle between the two document vectors dj1 and dj2 : m
cos(dj1 , dj2 ) =
w ˜j1 i · w ˜j2 i dj1 · dj2 = i=1 ||dj1 || · ||dj2 || m m w ˜j21 i w ˜j22 i i=1
i=1
670
W.K. Gad and M.S. Kamel
where w ˜j1 i represents the semantic weight of term i in document dj1 and w ˜j2 i represents the semantic weight of term i in document dj2 in our proposed model. The similarity measure above has a value between [0, 1].
4 4.1
Experimental Analysis Datasets
We used Reuters-21578 and 20-Newsgroups text documents collections to assess the clustering quality using the proposed model. Reuters-21578 and 20Newsgroups text collections are general domain and compatible with WordNet. WordNet has a general vocabulary and is not tailored for specific domain. In addition, Reuters-21578 and 20-Newsgroups have been used in comparable studies before, and well investigated datasets. In Reuters-21578, We performed some operations to prepare our base corpus. We selected only the documents that are assigned to one topic, and discarded all documents with an empty document body. This resulted our base corpus, 12902 documents, 9603 documents in the training set, 3299 documents in the test set, and ignore 8676 documents. Fig. 1 shows the distribution of classes in the base corpus. There are two biggest categories contain about two thirds of all the documents. The unbalanced distribution could lead to bias results due to the contribution of the two main categories. We derived new corpora from the base corpus. We restricted the maximum categories sizes, 20, 50 and 100 documents. Categories that have more than 100 documents are reduced by sampling. We derived these datasets to evaluate the SSBM performance under different configurations due to clusters sizes. Table 2 shows the minimum and maximum
Fig. 1. Classes distribution for the base corpus
New SSBM for Text Clustering Using Extended Gloss Overlaps
671
Table 2. Summary of Reuters-21578 datasets Datasets Minimum Maximum Total k #docs #docs #docs Reut01 15 20 815 41 Reut02 20 50 1611 38 Reut03 15 100 2619 60 Reut04 100 100 1300 13
Fig. 2. Classes distribution for reut01
number of documents in each category, the total number of documents that derived datasets contain, and k (number of classes). Fig. 2 to Fig. 5 show the distributions of classes in the derived datasets. Similar to [4], We extracted four datasets from the 20-Newsgroups text collection. Table 3 lists the derived datasets. The datasets News01 and News03 contain categories with different topics, while News02 and News04 consist of categories in similar topics. 4.2
Evaluation Measures
We evaluated the effectiveness of the SSBM using three clustering quality measures, Fmeasure, Purity and Entropy [20]. Fmeasure combines Precision and Recall measures. Precision measure is the percentage of relevant documents retrieved with respect to the number of retrieved documents. Recall measure is the percentage of relevant documents retrieved with respect to the total number of relevant documents in the dataset.
672
W.K. Gad and M.S. Kamel
Fig. 3. Classes distribution for reut02
Fig. 4. Classes distribution for reut03
The precision and recall of a cluster c ∈ C for a given class ∈ L are given by: |c∩| P (c, ) = |c∩| |c| and R(c, ) = || respectively. F measure(c, ) =
2PR P+R
where |c ∩ | is the number of documents belonging to cluster c and class , |c| is the size of cluster c, || is the size of class . The second measure is the
New SSBM for Text Clustering Using Extended Gloss Overlaps
673
Fig. 5. Classes distribution for reut04
Purity. The overall value for Purity is computed by taking the weighted average of maximal precision values: P urity(C, L) =
c∈C
|c| max P (c, ) |D| ∈L
The second measure is the Entropy, which measures how homogeneous a cluster is. The higher homogeneity of a cluster, the lower Entropy is, and vice versa. Entropy of cluster c is E(c) = P(c, ). log(c, ) and the entropy of all the ∈L
clusters is the sum of the entropy of each cluster weighted by its size. E(C) =
|c| P (c) |D|
c∈C
Table 3. Summary of 20-Newsgroups datasets Category News01(#docs) News03(#docs) comp.graphic 100 120 rec.sport.baseball 100 100 sci.space 100 59 talk.politics.mideast 100, 20 Category News02(#docs) News04(#docs) comp.graphic 100 120 comp.os.ms-windows 100 100 sci.rec.autos 100 59 electronics 100 20
674
4.3
W.K. Gad and M.S. Kamel
Results and Analysis
We implemented the Bisecting kmeans and kmeans techniques to test the effect of the proposed semantic similarity based model on text document clustering. Each evaluation result is an average of 20 runs to alleviate the effect of a random factor. The Fmeasure, Purity and Entropy values are the average over the 20 runs. Our objective is to maximize the Fmeasure, Purity and minimize the Entropy. We compared the results of our semantic similarity based model (SSBM) to the term based vector space model (VSM) as a baseline. Both the VSM and the SSBM have the same preprocessing techniques, stop words removal, stemming and pruning. The behavior of SSBM is evaluated on different datasets configurations using reut01 to reut04 and News01 to News04 datasets. We use Bisecting kmeans for reut01 to reut04 and kmeans for News01 to News02 to compare with methods introduced in [2,4]. Table 4 shows the results of the proposed semantic similarity based model (SSBM) and the term based vector space model (VSM) in terms of Fmeasure, Purity and Entropy clustering quality measures for Reuters-21578 and 20-Newsgroups text documents. Table 4. Clustering results of the SSBM and the VSM in terms of Fmeasure, Purity, and Entropy performance measures
Datasets Reut01 Reut02 Reut03 Reut04 News01 News02 News03 New04
Fmeasure VSM SSBM avg±stdev avg±stdev 0.58 ±0.017 0.65±0.015 0.56±0.017 0.61±0.015 0.56±0.015 0.63±0.012 0.66±0.013 0.76±0.017 0.82±0.017 0.9±0.012 0.5±0.015 0.63±0.013 0.61±0.018 0.72±0.014 0.46±0.014 0.52±0.012
Purity VSM SSBM avg±stdev avg±stdev 0.57±0.016 0.66±0.013 0.58±0.014 0.64±0.012 0.57±0.019 0.67±0.014 0.70±0.013 0.81±0.012 0.82±0.016 0.91±0.015 0.49±0.014 0.64±0.014 0.7±0.016 0.79±0.015 0.49±0.017 0.54±0.014
Entropy VSM SSBM avg±stdev avg±stdev 0.28±0.012 0.23±0.008 0.33 ±0.013 0.29±0.012 0.25±0.014 0.20±0.011 0.36±0.014 0.30±0.013 0.32±0.017 0.22±0.016 0.8±0.015 0.65±0.014 0.43±0.017 0.33±0.014 0.74±0.013 0.57±0.012
The experimental results show that the proposed semantic similarity based model SSBM improves the clustering quality better than the traditional term based VSM for all the datasets. The SSBM is more effective than VSM due to the contribution of non identical but semantically similar terms. The VSM relies on lexicographic terms matching and ignores the relationships between terms. In addition, We compared SSBM performance to other methods that introduced semantic similarities to text clustering. The methods [2,4] integrated WordNet to document clustering and enriched the document vectors by terms synonyms. We used the same datasets setup for comparison. Reut03 is used to evaluate the performance of bisecting kmeans text clustering based on the background method [2] in terms of Purity measure, while News01 to News04 were
New SSBM for Text Clustering Using Extended Gloss Overlaps
675
Table 5. Relative improvements of the SSBM in terms of Fmeasure, Purity and Entropy performance measures SSBM Datasets Fmeasure Purity Entropy Bisecting kmeans Reut01 12.07% 14.98% 17.85% Reut02 8.93% 10.34% 12.12% Reut03 12.5% 17.54% 20.64% Reut04 15.15% 15.71% 16.67% kmeans News01 9.75% 10.97% 31.25% News02 26% 30.61% 18.75% News03 18.03% 12.85% 23.25% New04 13.04% 10.20% 22.97% Table 6. Relative improvements of Background knowledge, Ontologies similarity methods [2,4] Quality Measure Reut03 Purity
Datasets
Fmeasure Entropy Fmeasure News02 Entropy Fmeasure News03 Entropy Fmeasure News04 Entropy News01
Background Bisecting kmeans 8.4% Ontology kmeans FW-kmeans 4.80% 4.38% 5.71% 16.02% 7.35% 6.18% 9.89% 17.24% 0.88% 0.69% 4.10% 4.75% 4.91% 4.61% 13.12% 18.37%
used to test the performance of kmeans and FW-kmeans (Feature Weighting kmeans) text documents clustering techniques based on ontology method [4] in terms of Fmeasure and Entropy measures. Table 5 and Table 6 show the relative improvements of our proposed model (SSBM) and other similarities methods. The experimental results shows that the SSBM has a significant improvement compared to VSM and other similarity measures. The percentage of improvement ranges from 10% to 26%, from 10% to 31% in Fmeasure and Purity respectively, and from 19% to 31% drop in Entropy (lower is better in Entropy) The reason behind this improvement is that the model is less sensitive to noise when it calculates the document vectors similarities. This is because of the new semantic weights that are assigned to document terms. The new semantic weights are based on the document context using the glosses overlap measure. Higher weights are assigned to terms that are semantically related. The SSBM disambiguates the senses of a polysemous word by choosing the correct sense of a
676
W.K. Gad and M.S. Kamel
word based on context, and captures the semantic importance of the terms that are related. While other similarity methods [2,4] augment document vectors by terms synonyms using WordNet. WordNet provides up to five senses for a term as a synonym. This results in extra overlaps between documents due to incorrect synonyms that are added to document vector.
5
Conclusion
In this paper, we propose the semantic similarity based model (SSBM) to improve text clustering quality. This model incorporates lexical ontology into a document representation to extract the meaning and semantics of document. The SSBM groups the documents based on the text meaning using the extended gloss overlaps measure and the adapted Lesk algorithm rather than terms frequencies. The SSBM assigns new weights to terms to reflect the semantic similarities between co-augmented terms in the documents. The model solves the ambiguity and synonymy problems, and realizes the hidden similarities between documents due to the contribution of semantically similar terms as well as insensitivity to noisy terms. We evaluated the proposed SSBM using Reuters-21578 and 20-Newsgroups datasets in terms of Fmeasure, Purity and Entropy performance measures for kmeans and bisecting kmeans clustering techniques. The evaluation demonstrates very promising performance improvements compared to the tradition term based vector space model and other methods that include semantic in text document clustering.
References 1. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. In: Computational Linguistics, vol. 32, pp. 13–47 (2006) 2. Hotho, A., Staab, S., Stumme, G.: WordNet Improve Text Document Clustering. In: SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003) 3. Sedding, J., Dimitar, K.: WordNet-based Text Document Clustering. In: COLING 2004 3rd Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113 (2004) 4. Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based Distance Measure for Text Clustering. In: IAM SDM Workshop on Text Mining (2003) 5. Wang, Y., Hodges, J.: Document Clustering with Semantic Analysis. In: The 39th Annual Hawaii International Conference on System Sciences, HICSS 2006, vol. 3, p. 54c (2006) 6. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998) 7. Hirst, G., St.-Onge, S.: Lexical Chains as Representation of Context for the Detection and Correction of Malapropisms. In: Fellbaum, C. (ed.) Wordnet: An electronic lexical database and some of its applications, pp. 305–332. MIT Press, Cambridge (1997) 8. Wu, Z., Palmer, M.: Verb Semantics and Lexical Selection. In: The 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138 (1994)
New SSBM for Text Clustering Using Extended Gloss Overlaps
677
9. Li, Y., Zuhair, A., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Transactions on Knowledge and Data Engineering 15, 871–882 (2003) 10. Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification. Fellbaum, 265–283 (1998) 11. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in Taxonomy. In: The 14th international joint conference Artificial Intelligence, pp. 448–453 (1995) 12. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: International Conference on Research in Computational Linguistics, pp. 19–33 (1997) 13. Lin, D.: An information-theoretic Definition of Similarity. In: The 15th International Conference on Machine Learning, pp. 296–304 (1998) 14. Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977) 15. Knappe, R., Bulskov, H., Andreasen, T.: Perspectives on Ontology-based Querying. International Journal Intelligent Systems 22, 739–761 (2007) 16. Shehata, S., Karray, F., Kamel, M.: A Concept-Based Model for Enhancing Text Categorization. In: The 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 629–637 (2007) 17. Hammouda, K., Kamel, M.: Efficient Phrase-based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16, 1279–1296 (2004) 18. Porter, M.: An algorithm for Suffix Stripping. Program 14, 130–137 (1980) 19. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGrawHill, New York (1983) 20. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000) 21. Lesk, M.: Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In: The ACM SIG-DOC Conference, pp. 24–26 (1986) 22. Banerjee, S., Pedersen, T.: Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In: Computational Linguistics and Intelligent Text Processing (2002) 23. Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Relatedness for Word Sense Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003)
Learning Betting Tips from Users’ Bet Selections ˇ ˇ Erik Strumbelj, Marko Robnik Sikonja, and Igor Kononenko University of Ljubljana, Faculty of Computer and Information Science Trˇzaˇska 25, 1000 Ljubljana, Slovenia {erik.strumbelj,marko.robnik,igor.kononenko}@fri.uni-lj.si
Abstract. In this paper we address the problem of using bet selections of a large number of mostly non-expert users to improve sports betting tips. A similarity based approach is used to describe individual users’ strategies and we propose two different scoring functions to evaluate them. The information contained in users’ bet selections improves on using only bookmaker odds. Even when only bookmaker odds are used, the approach gives results comparable to those of a regression-based forecasting model. Keywords: Machine learning, data mining, nearest neighbors, forecasting, sports betting.
1
Introduction
Since the introduction of sports, people have been compelled to bet on the outcome of sports events. Sports betting has shown unprecedented growth in the 19th and 20th century and is now a multi-billion dollar global industry. The world wide web and the introduction of online bookmakers have additionally contributed to the accessibility of sports betting and the availability of bettingrelated information. The growth of the betting industry has also resulted in a growing number of sports tipsters. Tipsters provide users with information about likely winners, either for free or for a fee. Their tips are based on insider information, an aggregation of publicly accessible information, or a combination of both. Tipsters range from knowledgeable friends and newspaper tipsters to online tipsters or tipster web-sites. Some tipster sites even alow their users to place bets as they would with a bookmaker, but virtual or play-money is used and efficient betting is stimulated by periodically rewarding the most successful users. These tipster-sites then try to use the information provided by their users to produce betting tips, usually forwarding the bet selections of their most successful users (for example, the users which achieve the highest return on their bets). In most cases, online tipsters provide tips on a daily basis and there is a required number of tips per day. In general, users with a high degree of sportsrelated knowledge or insider information are more likely to use that knowledge for actual betting, therefore, the quality level of tipster-site users is questionable. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 678–688, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning Betting Tips from Users’ Bet Selections
679
We are interested in answering the following question: Can useful information be extracted from tipster-site users’ bet selections and how? In this paper we offer some answers. As far as we know, no published related work deals directly with tipster sites, or with combining bet selections from multiple users, or across several different sports. Closely related work deals with analyzing forecasting abilities of various sources and incorporating bookmaker odds and other related data into sports forecasting. Andersson et al. analyzed the forecasting abilities of lay and expert users on the 2002 and 2006 Soccer World Cup competitions [1,2]. Results of both studies suggest that neither lay nor expert users perform significantly better than rules based on publicly available information, such as team world rankings. Boulier and Stekler [3] analyzed several seasons of NFL games (National Football League). They found that the betting market (i.e., bookmaker odds) is the best forecaster, followed by probit regression based on publicly available team rankings. The tipster’s forecasts are found to perform worse than naive forecasts. Forrest and Simmons [8] analyzed 1694 soccer games and found that newspaper tipsters fail to fully utilize publicly available information. In their study of the 2005 Wimbledon men’s tennis tournament Scheibehenne and Br¨oder [10] have shown that name recognition of laypeople performs at least as good as official rankings and experts when predicting the winner of a match. Both experts and laypeople are outperformed by bookmaker odds. Song, et al. (2007) [11] compared the performance of a large number of statistical models and expert tipsters on the 2000 and 2001 seasons of the NFL league. Their results indicate that neither models nor experts are significantly better predictors and they are both outperformed by betting odds. Therefore, related work does not offer any strong evidence that tipsters or naive users can outperform other sources of sports betting information. In fact, bookmaker odds are consistently reported to be significantly better than tipsterbased forecasts. For this reason we base our approach both on bookmaker odds and the ”advice” of individual users. For each user that selected a particular bet, we select a neighborhood of similar past bets selected by that user. Such a neighborhood provides an insight into the users betting pattern and with it we evaluate the quality of the user, relative to the current bet. Only bet selections of the best performing users are taken into account. We show that using the information contained in users’ bet selections improves the performance of our model. The model’s performance is compared with the performance of several other models, including a regression-based model, which is one of the most common methods used for predicting the outcome of sports events. As a minor contribution, we also investigate the quality of tipster-site users and the forecasting power of consensus among tipsters. The remainder of this paper is organized as follows. The next section formally introduces our approach. In Section 3 we describe a real-world data set and empirically compare several different approaches. With Section 4 we conclude the paper and offer suggestions for further work.
680
2 2.1
ˇ ˇ E. Strumbelj, M.R. Sikonja, and I. Kononenko
A Nearest Neighbors Based Approach Notation
Each bet, b ∈ B, is based on an event in sports. We assume that each such event has two possible outcomes (does or does not occur), which we describe with r : B → {0, 1}. Let p : B → (0, 1) and q : B → (0, 1) describe the objective probability of the event’s occurrence and the probability implied by the offered odds, respectively. A sports tipster-site can be described with a set of users, U, and a selection function, s : U × B → {0, 1}, which indicates if a user had selected a bet or not. At any given time, we can only learn from events which have already occurred, so a time component is required. We define t : B → so that the outcomes of bets {b|b ∈ B ∧ t(b) < t(bi )} are already known at the time of our decision-making regarding bet bi ∈ B. 2.2
Example
As an illustrative example, the following soccer match and the corresponding odds (in decimal form) could be offered for selection: home draw away FC Barcelona 1.62 3.95 4.70 Real Madrid
Three different bets are available for selection: b1 - home team (FC Barcelona) wins, b2 - draw, and b3 - away team (Real Madrid) wins. The offered odds indicate the payout (per unit invested). For example, if we bet 100 units on FC Barcelona and win, we get paid 162 units. The probabilities implied by these odds 1 1 1 are: q(b1 ) = 1.62 = 0.62, q(b2 ) = 3.95 = 0.25, and q(b3 ) = 4.70 = 0.21. Notice that these three events are collectively exhaustive and mutually exclusive, yet the probabilities implied by bookmaker odds sum up to more than one (1.07). This added probability is known as the bookmaker take. With it the bookmaker ensures a profit and reduces the possibility of a profitable betting strategy. For our example the expected return (per unit bet) and actual return for betting r(b2 ) 2) on a draw are ( p(b q(b2 ) − 1) and ( q(b2 ) − 1), respectively. The expected return is positive iff p(b) > q(b). Bets which satisfy this condition are often referred to by bettors as value bets. The objective of sports tipping is to provide the end user with value bets. This, and the fact that there is a required number of tips, is what differentiates generating tips from producing a profitable betting strategy. Betting strategies can also involve varying the bet size, they are more complex and more long-term, which makes them more difficult to relate to the end user. 2.3
Definition of the Approach
At the time of betting both the objective probability and the actual outcome are unknown and can only be approximated. At that time, we have the odds information, we know which users selected the bet, and we have the selection history
Learning Betting Tips from Users’ Bet Selections
681
for all the users (and of course the outcome history as well, which enables us to calculate the users’ past performance). We base our approach on the following assumption: the past performance of a user on similar bets is a good estimator of the value of the bet which the user has currently selected. Related work presents strong evidence that bookmaker odds are the best forecaster, so odds are used as a foundation for estimating the value of a bet. To achieve this, we extend the set of users with a ”dummy” user U = U ∪ {u0 }, which always selects a bet, therefore, s(u0 , b) = 1 for each b ∈ B. This also enables us to evaluate, in a uniform way, bets which were not selected by any user and bets which were selected by one or more poor performing users, but may still have value. To facilitate the use of neighborhoods, we define the following similarity measure:
d(bi , bj ) = h(bi , bj ) · |q(bi ) − q(bj )| · e
|t(bi )−t(bj )| β
(1)
The measure (1) is based on the difference in probabilities and multiplied by function h(bi , bj ), which is problem-dependant and describes the distance between bets in the sports hierarchy. When no sports-type information is available or when bets are always from the same sports-type, the function is assigned a constant value. The rightmost factor increases exponentially with the time difference, so that older bets are farther away. Before we define the scoring functions, with which we evaluate a neighborhood, we define the subset of K nearest neighbors. Let B(u, b) = {bk |bk ∈ B ∧ s(u, bk ) = 1 ∧ t(bk ) < t(b)}. Let B(u, b)K be the set of first K elements from B(u, b) when bets are sorted in ascending order according to distance (1) to bet b. Once a neighborhood is established, we evaluate it with one of the following two scoring functions. The first scoring function that we use to evaluate a bet (relative to the user that chose the bet) is based on a straightforward approximation of the bet’s value with the mean value of bets from the neighborhood. However, to avoid overestimating risky strategies, we introduce a reliability component. The probability that the mean value is greater than 1 (i.e., that the bet has value), P ( p(b) q(b) > 1), is approximated by bootstrapping [6] the sample mean value on the set B(u, b)K (re-sampling is repeated 500 times): σR (u, b) =
P ( p(b) q(b) > 1) −∞
if (s(u, b) = 1) ∧ (|B(u, b)| ≥ K) otherwise
(2)
The second score is based on the evaluation of the quality of users’ probabilistic assessments (i.e., the user’s probabilistic forecasting ability). We use the Brier score [4], which is a loss function often used for the evaluation of probabilistic forecasts. A bet’s score equals the ratio between the forecasting quality of the bookmaker odds’ and the user’s forecasting quality. Therefore, the better
682
ˇ ˇ E. Strumbelj, M.R. Sikonja, and I. Kononenko
the forecasting quality of the user is, relative to the forecasting quality of the bookmaker odds, the higher the score: ⎧ ⎪ ⎨ σB (u, b) =
⎪ ⎩
1 K
bk ∈B(u,b)K
−∞
(q(bk ) − r(bk ))2 (1 − r(qk ))2
if (s(u, b) = 1) ∧ (|B(u, b)| ≥ K) otherwise
(3) Scores σR and σB can not be used together, which effectively gives two different approaches. Regardless of which scoring function is used, we define the final score of a bet as the maximum score across all users: score(b) = max(σ(u0 , b), σ(u1 , b), ..., σ(u|U | , b))
(4)
Note that when a user does not select a bet, we do not know if the user opted not to select the bet or did not even consider or observe it. Therefore, when evaluating a betting opportunity, we can only use the information provided by the users that chose the bet. The main motivation for using (4) is that it ensures that bets selected by a high scoring user receive a high score, regardless of how many poor quality users have also selected it. The only limitation of (4) is that it does explicitly take into account the consensus between several high quality users (for example, two bets, one chosen by a single high quality user and the other by the same user and several additional users of the same quality, will receive the same score, although one might argue that the second betting opportunity is more reliable). This could possibly be solved by aggregating the scores of a smaller number of highest scoring users.
3 3.1
Application on a Real-World Data Set Data
An undisclosed tipster web site has provided a data set which contains all the bet selections their users made in the period between January 1st 2008 and August 31st 2008. During these 244 days 52315 betting possibilities from 8 different types of sports (each with several different competitions) were offered and 250 users took part in the competition. Bets were offered in the morning and had to be selected before noon. All bets resolve during the day they are offered for selection. The offered odds were taken from various bookmakers and more specific information about the exact source of individual odds is not known. In total, 34202 selections were made and 16158 distinct betting possibilities were selected by at least one user. Therefore, 36157 betting possibilities were not selected by any user. Figure 1 summarizes the quality of the users involved in the competition. Users which selected less than 10 bets during the 244 days are not shown in the figure (there were 67 such users). Only two users generate a significantly positive return at the 90% significance level (circle symbol). Further 50 users generate a
683
−0.2 −0.6
−0.4
mean return
0.0
0.2
Learning Betting Tips from Users’ Bet Selections
0
200
400
600
800
1000
1200
number of bets selected
Fig. 1. Number of selected bets and mean return across all selected bets for users who selected at least 10 bets during all 250 days. Circles represent users with a significantly positive return and triangles represent users who perform significantly better than selecting at random.
return significantly better than the bookmaker take (triangle symbol) and are therefore significantly better than picking bets at random. The remaining users are not significantly better than random betting. Therefore, most of the users are non-expert users. As we can already see from Figure 1, some users took a more active part than others and selected a larger number of bets. Additionally, Figure 2 shows that the distribution of bet selections among users resembles a power-law distribution. Let Bdata = {b1 , b2 , ..., b52315 } be the set of all distinct bets and Udata = {u1 , u2 , ..., u250 } the set of users. Let tdata (bi ) equal the day that bet bi was offered for selection. Selection function sdata is defined according to the users’ selections. The tipster site had a requirement of exactly 10 tips per day and we use the same number of required tips. The effect of changing this requirement is briefly addressed at the end of this Section. 3.2
Description of the Used Models
The most naive approach is to assume all betting opportunities to be equally good and therefore to bet on every bet (ALL). The second approach is based on the assumption that betting opportunities which were selected by at least one
ˇ ˇ E. Strumbelj, M.R. Sikonja, and I. Kononenko
800 600 400 0
200
number of bets selected
1000
1200
684
0
50
100
150
200
250
user
Fig. 2. Users, sorted according to the number of bets they selected during the 244 days of the competition
user are better than those which were not, so we bet only on the former (SEL). These two approaches will be used as baselines for comparison. The remaining approaches are restricted by the requirement of exactly 10 tips per day. Therefore, for each of these approaches the bets are sorted according to some scoring function and the 10 best (highest scoring) bets are selected. Ties are resolved by choosing at random. The first approach scores the bets according to the number of users that selected the bet (CON ). The predictive power of consensus was already mentioned by Forrest and Simmons [8], who found that a consensus among tipsters outperformed individual tipsters. The second approach is based on the probit regression estimation of the objective probability of the i-th bet p(bi ) (PROBIT ). Bets are scored according to p(bi ) their estimated values score(bi ) = qdata (bi ) . Probit (or logit) regression models are most common and successful methods used for predicting the outcome of sports events [3,5,7,9]. Note that only odds and sports/competition-type data are used for the probit estimation. We also use three variations of the nearest neighbors approach described in Section 2. For all three variations, the parameters were set by a sports betting expert. The number of nearest neighbors, K, was set to 100, β (see Eq. 1)) to 365 and the hierarchy weight defined as follows: ⎧ ⎨ 1.00 h(bi , bj ) = 1.05 ⎩ 1.10
if bi and bj are from the same sport and competition if bi and bj are from the same sport but not competition otherwise (5)
Learning Betting Tips from Users’ Bet Selections
685
For the purposes of this paper and real-world data set, we can use a simple 2 level hierarchy, because only sports and competition type data are available for each event. With more detailed event data, a taxonomy could be developed, which would also include geographical data and individual teams. Such a taxonomy could be integrated into the learning process, similar to recent attempts of incorporating taxonomies into recommender systems [12]. The first two nearest neighbors approaches, kNNR and kNNB, use the σR and σB scoring functions, respectively. The final variation, kNN-, is a simplified variation of the kNNR model. The only difference is that the set of users is reduced to U = {u0 }. Therefore, it uses only the information used by PROBIT and can be used to evaluate and compare the performance of our model when no user selection information is available. Note that PROBIT, kNNB, kNNR, and kNN- use incremental learning. The models are retrained after every day on all the data available at that time. 3.3
Results
The mean returns across all 244 days are shown in Table 1. Note that betting the same amount on every bet effectively makes the mean return equivalent to the mean value minus 1. As expected, selecting every available bet (ALL) is the worst approach and results in a return equal to the bookmaker take. Selecting only bets which were selected by at least one user (SEL) gives a significantly better result, so users possess a certain degree of betting knowledge. These two approaches serve as baselines for comparison and all other approaches are significantly better. The result of the consensus-based model (CON ) implies that consensus among users is indeed an indicator of the value of bets. The two nearest neighbor based approaches which incorporate user-related information (kNNB and kNNR) give better results than probit regression. Furthermore, kNN- produces results which are comparable with those of probit regression. This suggests that using both odds and user selections, is an improvement over using bookmaker odds only. The confidence intervals were obtained by treating each day as a sample and by bootstrapping [6]. Note that this does take into account the variance of event outcomes, but does not take into account the possible variance in the incremental learning procedure. A more thorough test shall be used when a larger data set is available. The last four approaches include incremental learning and their performance should improve over time. Figure 3 shows how the returns change over the course of the last 144 days. Over 170 days are required for the performance of our approaches to reach a point after which there is no more substantial improvement. Note that kNNR and kNNB give significantly better tips than probit regression when the first 100 days are excluded (the corresponding p-values are 0.002 and 0.043). However, across all 244 days the difference is not as significant (the corresponding p-values are 0.197 and 0.105 for kNNR and kNNB, respectively). These results suggest that probit regression is more suitable than our approach when only a small learning set is available.
686
ˇ ˇ E. Strumbelj, M.R. Sikonja, and I. Kononenko Table 1. Mean returns across 244 days and 90% confidence intervals mean return -0.159 -0.098 -0.077 -0.033 -0.038 -0.058 -0.052
90% conf. int. (-0.166, -0.153) (-0.109, -0.083) (-0.099, -0.053) (-0.056, -0.013) (-0.055, -0.019) (-0.081, -0.035) (-0.078, -0.033)
0.05
ALL SEL CON kNNB kNNR kNNPROBIT
−0.05 −0.10 −0.15
mean return (last 100 days)
0.00
kNNB kNNR PROBIT ALL
100
120
140
160
180
200
220
240
day
Fig. 3. A window of 100 days was used to show how the mean return of models changes as time progresses. Both nearest neighbor based models take more than 170 days to reach a plateau after which there is no more significant improvement.
Finally, changing the required number of tips per day may have a significant impact on the model’s performance. Figure 4 shows how the mean return changes with an increasing number of required tips per day. When only a small number of tips is required (from 1 to 4) PROBIT outperforms kNNB. The correlation coefficients between the mean return and the required number of tips per day are −0.931 and −0.904 for PROBIT and kNNB, respectively. The strong negative
687
−0.05 −0.10
mean return
0.00
Learning Betting Tips from Users’ Bet Selections
kNNB PROBIT
0
10
20
30
40
50
tips per day
Fig. 4. The mean return decreases with an increasing number of required tips per day
correlation confirms that both models’ scores impose an order which correlates with the value of bets.
4
Conclusion
We have shown that our similarity based approach is suitable for selecting sports betting tips. When only odds and sports/competition type information are used, the performance is comparable with that of the probit regression based approach. When additional information provided by users’ bet selections is also used, our approach performs better than probit regression. Similarly to related work, we found that consensus amongst tipsters is relevant for predicting the outcome of sports events, but bookmaker odds are a better source of outcome-related information than tipsters. The next step in our ongoing research is to improve the model by including an automatic and incremental tuning of parameters. The implementation of two advanced aspects of bet selections has been delegated to future work: variable bet size instead of binary bet selections and the dependence between bets on different outcomes of the same match/game. As a part of future work it would be interesting to research if odds and the information from a tipster site can be combined into a profitable betting strategy. If and when more data becomes available, it would also be interesting to analyze the relation between the number of active users a tipster site has and the number and/or quality of tips the site can provide.
688
ˇ ˇ E. Strumbelj, M.R. Sikonja, and I. Kononenko
References 1. Andersson, P., Edman, J., Ekman, M.: Predicting the world cup 2002 in soccer: Performance and confidence of experts and non-experts. International Journal of Forecasting 21(3), 565–576 (2005) 2. Andersson, P., Memmert, D., Popowicz, E.: Forecasting outcomes of the world cup 2006 in football: Performance and confidence of bettors and laypeople. Psychology of Sport and Exercise 10(1), 116–123 (2009) 3. Boulier, B.L., Stekler, H.O.: Predicting the outcomes of national football league games. International Journal of Forecasting 19(2), 257–270 (2003) 4. Brier, G.W.: Verification of forecasts expressed in terms of probability. Monthly Weather Review 75, 1–3 (1950) 5. Dixon, M.J., Pope, P.F.: The value of statistical forecasts in the uk association football betting market. International Journal of Forecasting 20, 697–711 (2004) 6. Efron, B., Tibshirani, R.: Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1(1), 54–75 (1986) 7. Forrest, D., Goddard, J., Simmons, R.: Odds-setters as forecasters: The case of english football. International Journal of Forecasting 21(3), 551–564 (2005) 8. Forrest, D., Simmons, R.: Forecasting sport: the behaviour and performance of football tipsters. International Journal of Forecasting 16, 317–331 (2000) 9. Goddard, J., Asimakopoulos, I.: Forecasting football results and the efficiency of fixed-odds betting. Journal of Forecasting 23, 51–66 (2004) 10. Scheibehenne, B., Broderb, A.: Predicting wimbledon 2005 tennis results by mere player name recognition. International Journal of Forecasting 23(3), 415–426 (2007) 11. Song, C., Boulier, B.L., Stekler, H.O.: The comparative accuracy of judgmental and model forecasts of american football games. International Journal of Forecasting 23(3), 405–413 (2007) 12. Ziegler, C., Lausen, G., Kostan, J.A.: On exploiting classification taxonomies in recommender systems. AI Communications 21(2-3), 97–125 (2008)
An Approach to Web-Scale Named-Entity Disambiguation Lu´ıs Sarmento1 , Alexander Kehlenbeck2, Eug´enio Oliveira1 , and Lyle Ungar3 1
Faculdade de Engenharia da Universidade do Porto - DEI - LIACC Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal [email protected], [email protected] 2 Google Inc New York, NY, USA [email protected] 3 University of Pennsylvania - CS 504 Levine, 200 S. 33rdSt, Philadelphia, PA, USA [email protected]
Abstract. We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information from documents.
1 Introduction Realistic named-entity disambiguation (NED) of Web data involves several challenges that have not yet been considered simultaneously. First, when moving NED to the web we need to deal with high levels of ambiguity. Since there are so many documents in the Web, the same name will often refer to hundreds of different entities. This makes the problem much harder as compared with NED approaches for small collections where one needs to disambiguate only among a few possibilities. Second, distributions of mentions on the web are highly skewed. For each ambiguous name, there is usually one or two dominant entities to which the vast majority of mentions refer to, even when many entities share the same name. For example, most mentions of the name “Paris” found on the web refer to the capital of France (and a smaller number to Paris Hilton), while there are dozens of well-known entities with that name 1 . Table 1 shows hit counts for five queries sent to Google containing the word “Paris” and additional (potentially) disambiguating keywords. These values are merely indicative of the orders 1
See the Wikipedia disambiguation page for “Paris”: http://en.wikipedia.org/ wiki/Paris_(disambiguation)
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 689–703, 2009. c Springer-Verlag Berlin Heidelberg 2009
690
L. Sarmento et al.
of magnitude at stake, since hit counts are known to change significantly over time. The real challenge is to be able to disambiguate between mentions of the less frequently mentioned entities, for which there is proportionally much less information and more noise. Third, most solutions to NED presented so far involve processing relatively small data-sets. Realistic NED involves processing web-scale collections (terabyte size), requiring computationally efficient ways of representing and processing data and, sometimes, involving practical decisions that might affect negatively final results for some cases. Table 1. Number of Google hits obtained for several entities named “Paris” # query # hit count (x106 ) paris 583 paris france 457 paris hilton 58.2 paris greek troy 4.130 paris mo 1.430 paris tx 0.995 paris sempron 0.299
% 100 78.4 9.99 0.71 0.25 0.17 0.04
There are also other fundamental questions that have not yet been investigated. Many of the solutions to NED involve data-driven techniques, such as clustering. Such techniques usually benefit from processing larger amounts of data. Therefore, one would expect to obtain better NED results as the size of the collection to be disambiguated increases. However, as the size of the collection to be disambiguated becomes larger, the variety of different entities and contexts that have to be dealt with also increases. As the contexts in which mentions occur become more diverse, data-driven approaches potentially become harder. The exact balance between these two effects has yet to be quantified. In this paper we present a clustering-based approach to disambiguating entities on the Web. The algorithm we propose is capable of dealing with an arbitrarily high number of entities types, is scalable to the number of mentions on the web, and can be distributed over a cluster of machines to process large web collections. For evaluating the results of the disambiguation procedure we developed a gold standard based on entity information extracted from Wikipedia. We experimented disambiguating samples of the web with increasingly large sizes to test how well the algorithm scales and whether or not more data leads to better results. Results suggest that as the size of the collection increases, more complex cases of ambiguity emerge, making the definition of the NED task itself less clear. This seems to be an intrinsic characteristic of highly heterogeneous document collections, and suggests the existence of fundamental upper limits on the performance of clustering-based approaches to NED based only on name co-occurrence information.
2 Related Work There are currently two main lines of research on NED: (i) clustering approaches based on information extracted from the documents (e.g., [1,2,3,4]) and (ii) approaches that use
An Approach to Web-Scale Named-Entity Disambiguation
691
external knowledge sources containing information about entities (e.g., the Wikipedia) to perform disambiguation (e.g., [5,6,7]). Mann and Yarowsky [3] present a disambiguation procedure for person names based on a multi-pass clustering procedure. First, mentions are compared using an all-againstall strategy, in order to obtain very “pure”, yet small, seed clusters, which should represent the main entities. Then, the remaining mentions are assigned to these seed clusters using a nearest-neighbor policy. In a third step, clustering proceeds until no more clustering is possible. The authors experimented using several different features to describe mentions. Best results were obtained using biographic features in combination with other statistically obtained features. Another clustering-based disambiguation method is presented in [1]. Mentions are described by a vector composed of tf-idf weighted terms extracted using a 55-word window. The authors compare two methods based on variations of streaming-clustering (which are computational efficient but order dependent and sensitive to outliers) and one agglomerative clustering method (which involves all-against-all comparisons). Results showed that the agglomerative clustering method leads to better precision and recall figures and higher stability to changes in parameters (similarity threshold and data partitioning). Two other methods to disambiguate personal names, based on clustering and graph partitioning, are presented and compared in [2]. Again, information about name co-occurrence is used to disambiguate person entities. The authors conclude that name co-occurrence information provides an advantage over using other features to achieve disambiguation: However, this method considers only situations where only one name at a time is ambiguous. The approach presented in [8] is more sophisticated because it assumes that co-occurring names are themselves ambiguous. Thus, an iterative clustering approach is proposed that aims at collectively resolving ambiguity. In [5], a set of disambiguation vectors is built using information extracted from Wikipedia. Assuming each entity has its own Wikipedia page, a vector description of the entity is build using words found inside a 55-word window around mention of the name in the corresponding page. Wikipedia categories are also added to the vectors, using a pre-computed word to category index, thus exploiting strong correlations between words in text (sparse features) and categories (e.g: “concert” is more strongly correlated with category “Musicians” than with “Professional Wrestlers”). Disambiguation is made by comparing vectors of mention to be disambiguated with the set of disambiguation vectors. In [6] Wikipedia is also used to build vector representations of entities. However, the method does not rely on direct comparison between vector representations of entities and vector of each individual mentions but, instead, it tries to maximize the agreement between all the disambiguation hypothesis of all mentions in a document. In [7] the authors attempt large-scale taxonomy based disambiguation / resolution, over a collection 264 million documents (although the number of mentions to disambiguate was limited to 550 million). The method involved comparing the 10-word window context around a mention with “typical” contexts that had been previously collected and manually associated with the 24 reference nodes, i.e. largest nodes of the taxonomy (e.g. city, profession, country). Disambiguation (or resolution) is achieved by finding which node in the taxonomy that includes the ambiguous name belongs to the subtree of the reference node
692
L. Sarmento et al.
with higher similarity with the context of the mention (based on cosine metric and tf-idf feature weighting). The authors report an accuracy of 82%.
3 A Clustering Approach to NED In this work we focus on the disambiguation problem, that is the problem of determining whether occurrences of the same name in different documents refer to the same entity, or to different ones that share the same lexical representation (following standard practice – [9] – we assume that a name inside a document can only refer to one entity). For example, the name “Amsterdam” can be used refer to many different geographic locations, to a novel, to several songs, to a ship, to a pop music band, and to many other entities2 . We do not address the related problem of conflating mentions that use different names to refer the same entity (e.g., “George W. Bush”, “George Bush”, “Mr. Bush”, “president Bush”, “the President”, “Dubya”). Solution to the name conflation problem can be built on top of the solution provided for the name ambiguity problem (for an interesting approach to large-scale name conflation check [4]). NED can be formulated as a clustering task. Let mij represent a mention, i.e., the occurrence of name ni in document dj , and let Mall = {m11 , m21 , ...mik } be the set of all mentions found in a given document collection C = {d1 , d2 , ...dk }. Disambiguation can be achieved by clustering together all mentions in Mall that refer to the same entity ej . The goal is to partition Mall in several disjoint clusters of mentions, M1 , M2 , M3 ... Mn , so that each of them contains mentions that refer to one and only one entity ej . Also, all mentions of a given entity ej should end up in a single cluster. 3.1 Feature Vector Generation We start by assuming that a mention of a given name can be disambiguated using information about the names with which it co-occurs within the same document. For example, mentions of “Amsterdam” that refer to the capital of the Netherlands will probably co-occur with mentions of “Netherlands”, “Utrecht” or “Rijksmuseum”, while those mentions of Amsterdam that refer to the novel, will probably co-occur with “Ian McEwan” or “Amazon”. Under this assumption, describing mentions using the set of cooccurring names as features ({“Netherlands”, “Utrecht”, “Rijksmuseum”...} vs. {“Ian McEwan”, “Amazon”...}) should lead clusters that group mentions that refer unambiguously to one specific entity (the capital of the Netherlands vs. the novel). Let N(dk ) be set of names found in document dk. The mention of name nj in document dk , mjk will be described by a feature vector of tuples name - value, (ni , vi ): mjk = [(n1 , v1 ), (n2 , v2 ), (n3 , v3 ), ...(ni , vi )]
(1)
with ni ∈ N (dk )\ nj , and vi being a value obtained through a generic feature weighing function (for example TF-IDF or Mutual Information). The input for our clustering procedure is an annotated collection of documents, Cannot . Therefore, it requires names to be previously identified in each document, although type classification is not needed. 2
Check http://en.wikipedia.org/wiki/Amsterdam_(disambiguation)
An Approach to Web-Scale Named-Entity Disambiguation
693
3.2 Clustering Procedure Overview The procedure we propose for performing NED over a collection of annotated documents Cannot starts by extracting all names from each document dk to generate mention feature vectors mjk (a mention is the occurrence of a name in a document). Feature vectors are then grouped by name, so as to have a set of mention feature vectors per name: M (nj ) = {mj1 , mj2 ...mjx }. Vectors inside each set M (nj ) are then compared according a given comparison strategy and similarity metric sim(mnj , mnk ) (e.g: Cosine or Jaccard Distance). Finally a clustering algorithm is applied to each M (nj ), using information about vector similarity computed in previous step. The algorithm itself is generic in the sense that it does not establish any specific strategy for comparing feature vectors prior to clustering, nor a specific choice for the clustering technique. At this point, we assume only that an efficient algorithm exists for performing vector comparison and clustering. For example, Min-Hash techniques [10] provides a efficient way for computing an approximation to the nearest-neighbor problem, which can be used for computing distances between vectors. Clustering by Committee [11] and variations of streaming clustering techniques [12] might be an option for the clustering stage. In any case, one important advantage of this algorithm is that it provides a natural way for distributing computational load. Since feature vectors are grouped by name, all information that is required to resolve ambiguity for each name is aggregated and can be processed separately: both the calculation of vector similarities and the clustering process can be distributed over a cluster of machines, on a per-name basis, thus helping scalability.
4 Vector Comparison and Clustering As explained, the size of the clustering problem at stake - millions of names and thousands of millions of mentions - requires distributed algorithms that can be deployed on large computer clusters. Right from the beginning our method was designed to be run on a Map-Reduce [13] platform, a data intensive supercomputing paradigm that simplifies the distribution of data (hundreds of gigabytes) and tasks over thousands of computer nodes (typical commodity computers). Map-reduce provides a generic framework for scaling algorithms to very large data sets but in order to choose an appropriate clustering method for NED, some specific characteristics of the dataset and of the problem should be taken into account. First, the mention distribution is highly skewed, and is dominated by the one or two most popular entities. Thus, the clustering algorithm chosen should be able to handle unbalanced data distributions and still produce correct clusters both from dominant and non-dominant entities. Second, the number of entities in which the set of mentions M (nj ) should be mapped, and thus the final number of clusters, is not known in advance. Therefore, the stopping criteria for the clustering procedure should not depend on a predefined number of final clusters desired, which is difficult to estimate. Instead, it should depend on parameters related with input data and cluster properties. We propose using a graph-based clustering approach. For each name nj , we start by computing pairwise distances between feature vectors to build the link graph G(nj ). Two mentions are linked in the graph if their similarity is higher than a given threshold
694
L. Sarmento et al.
smin . Then, find the connected components of the Link Graph G(nj ). The retrieved connected components represent the clusters we seek. The only parameter of this approach is smin ; there is no need to set the target number of clusters to be produced. So far we have not yet found an automatic method for estimating the smin parameter. Values used in our experiments range from 0.2 to 0.4. When building the link graph for each name G(nj ) one only needs to perform enough comparisons between mentions to build a graph that is sufficiently connected to allow retrieving the correct components. The fact that the distribution of mentions among the entities is highly skewed turns out to be advantageous for building the link graph G(nj ). If we pick mentions randomly from the set M (nj ), for any of the mentions belonging to the dominant entities (one or two) it should be possible to quickly find another one that turns out have a higher than threshold similarity (because there are so many of them). Then, for mentions of the dominant entities, we can obtain a significant decrease in the number of comparisons while almost surely keeping enough connectivity to retrieve the connected components. We showed elsewhere [14] that if each mention is compared to other mentions only until kpos above-threshold similar mentions are found, it is possible to build a sufficiently connected link graph in O(|M (nj )| · C · kpos ), with C being the number of true clusters (i.e., different entities for the name nj ) in M (nj ). Since the number of entities for each name is expected to be orders of magnitude smaller than the number of it mentions, this approach leads to significant savings in computational work as compared to an all-against-all comparison strategy (i.e. O(|M (nj )|2 )). 4.1 Additional Scalability Issues There are still some important scalability problems that we need to solve. First, there are so many mentions on the web for the most frequent names that the corresponding feature vectors cannot be simultaneously fit into the RAM of a single machine to perform comparisons between them. For illustration purposes, we present in Table 2 the number of documents (hence mentions under our definition) found by Google for a few very frequent, and ambiguous, names (we use the number of possible entities found in the corresponding Wikipedia disambiguation page for each name as a rough indicator of its ambiguity). Second, even if they did fit simultaneously in RAM, processing these very frequent names would require much more time than processing less frequent names (which may have only a few hundred mentions), leading to extremely long tails in the overall processing time. Therefore, we need to break the set of mentions for each Table 2. An illustration on the number of Google hits found on the web for some frequent names (hits may change), and the corresponding number of entities found in Wikipedia name # Wiki Entities Google Hits (×106 ) Paris 90 583 Amsterdam 35 185 Jaguar 34 73.4 Pluto 25 13.8
An Approach to Web-Scale Named-Entity Disambiguation
695
name into smaller partitions, each with nmax mentions, so that they can be distributed more evenly across machines. However, by splitting the data into multiple partitions and placing them in different machines, we loose the ability to compare all mentions that would be required to find appropriate (i.e. complete) clusters. In fact, for each (frequent) name we are breaking the corresponding clustering problem into several independent clustering problems. Many of these partitions will produce clusters that correspond to the same entity, and so they need to be merged afterwards. Since after the first clustering pass we should have much less clusters than mentions, re-clustering these clusters is certainly a more tractable problem. Clusters can be described by the feature vectors generated from the aggregation of feature vectors of the mentions they contain (e.g., their centroid). Comparisons can then be made using any vector distance metric over such vector descriptions, also on a per-name basis. After the first stage of clustering, the size of the resulting clusters should also follow a highly skewed distribution. There will be several larger clusters corresponding to the few dominant entities, and many smaller clusters corresponding both to non-dominant entities and to (small fragments of) dominant entities. Taking into account this typical distribution (that we systematically found in our experiments), we developed a dedicated re-clustering procedure to merge results from partitions. This procedure is applied independently for each name, and thus it can be trivially run in parallel. For each name, we group all clusters obtained in each partition and divide them in two groups: Big Clusters, Cbig and Small Clusters, Csmall . Cbig is composed of the 10% biggest clusters produced in the first clustering pass, while all others are included in Csmall . We then use the following re-clustering strategy: 1. Pre-assign Small Clusters to Big Clusters. Start by trying to assign each small clusters to one big cluster. This assignment is made using a nearest neighbor strategy (with a minimum similarity threshold), and thus tends not to make many incorrect assignments, while greatly reducing the total number of clusters. Cluster descriptions are updated accordingly. 2. Merge Small Clusters. Try to merge all the unassigned small clusters with each other. The main goal here is to make sure that some of the less represented entities grow into medium size clusters, so they get enough “critical mass” to be kept, even if we simply filter out the smaller clusters. Cluster descriptions are updated accordingly. 3. Merge Big and Medium Clusters. Try to re-cluster the medium and big clusters based on only a few top features. The intuition is that big clusters can usually be “described” by a small number of features (e.g., their top 3), which will be highly discriminative for the entity at stake. We thus achieve cluster consolidation, while reducing the risk of performing incorrect merge operations due to noisy features. 4. Repeat 2 and 3 to reduce fragmentation. Note that Big clusters and Small Clusters are never compared simultaneously, (i.e. allagainst-all), which avoids the problems that might come from comparing elements of with significant size differences.
696
L. Sarmento et al.
5 Evaluation Framework Evaluating the results of clustering algorithms is difficult. When gold standard clusters are available, one can evaluate clusters by comparing clustering results with the existing standard. Several metrics have been proposed for measuring how “close” test clusters are to reference (gold standard) clusters. Simpler metrics are based frequency counts regarding how individual items [15] or pairs of items [16,17] are distributed among test clusters and gold standard clusters. These measures, however, are sensitive to the number of items being evaluated, so we opted for two information-theoretic metrics, which depend solely on the item distributions. Given two sets of clusters, the test clusters, T with |T | clusters, and the gold clusters, G, with |G| clusters, we wish to evaluate how well clusters in T , t1 , t2 ,...t|T | represent the clusters in G, g1 , g2 ,... g|G|. We first obtain the |I| (intersection) matrix with |T | lines and |G| columns. Elements ixy of |I| indicate the number of items in common between the test clusters tx and gold clusters gy . Ideally, all the elements in a given test cluster, tx , should belong to only one of the gold clusters. Such tx cluster is considered “pure” if it contains only mentions of a unique entity as defined by the gold standard. If, on the other hand, elements from tx are found to belong to several gold clusters, then the clustering algorithm was unable to correctly delimit the entity, and disambiguation was not totally achieved. To quantify how elements in test cluster tx are distributed over the gold standard, we use the entropy of the distribution of the elements in tx over all the clusters gy . High quality clusters should be very pure and thus have very low entropy values. Let It (x) be the total number of elements of cluster tx that were found in gold clusters. Then: et (tx ) =
|G| y=0
−
ixy ixy · ln( ) It (x) It (x)
(2)
Therefore, for all test clusters obtained for name nj we can compute Et (nj ) as the weighted average of the entropy values e(tx ) obtained for each test cluster, tx : Et (nj ) =
|T (nj )|
|tx | · e(tx ) x=0 |T (nj )| |tx | x=0
(3)
with |tx | being the number of mentions in cluster tx , including those not found in gold clusters. |T (nj )| is the number of test clusters obtained for name nj . We are also interested in measuring how elements from clusters in gold standard are spread throughout the test clusters we produced. Again, we would like to have all elements of gold standard clusters in the least number of test clusters possible, ideally only one. Then, for each gold cluster gy we can also use entropy eg (gy ) to measure how the elements of a gold standard cluster gy are spread over the clusters we are testing. eg (gy ) can be computed by a formula similar to that of Equation 2, substituting references to test cluster by reference to gold clusters, and vice-versa. Similarly, a global performance figure, Eg (nj ), can be obtained by performing a weighted average over eg (gy ) for all gold clusters (similar to Equation 3).
An Approach to Web-Scale Named-Entity Disambiguation
697
Finally, we need to evaluate recall, i.e., the proportion elements in gold cluster that are in fact found in any test cluster. If Ig (y) is the total of elements in cluster gy that were found test clusters, we may define the mention recall metric for gold cluster gy as: Ig (y) rm (gy ) = |G(n )| j y=0
|gy |
(4)
An overall Recall figure for this name, Rm (nj ), could be obtained again by doing a weighted average of r(cgk ) over all gold clusters: |G(nj )| Rm (nj ) =
|gy | · rg (gy ) |G(nj )| |gy | j=0
k=0
(5)
Similarly we can compute Re (nj ) which measures how many of the entities included in the gold standard clusters for nj are found in the corresponding test clusters. This figure is important because mention distribution among entities is expected to be very unbalanced. The previous figures are calculated for each name, nj ∈ N . For assessing the global performance of the clustering-based NED procedure for all names in N , we need to combine the performances obtained for the individual names, ni . To do so, we use the arithmetic average of the previous metrics over all names: Et , Eg , Rm and Re . 5.1 Preparing the Gold Standard We used the English version of the Wikipedia to develop a gold standard for evaluating NED (although the procedure can be replicated for other languages). We assume that each article in Wikipedia can be related to one unambiguous entity / concept. Let Wseed (nj ) be the set of Wikipedia articles found for name nj (nj can usually be easily identified by the article title). If the number of articles for nj is greater than one, then nj is know to be ambiguous, and each possible entity is unambiguously related to one of the articles. The set Wseed (nj ) can be used as seed for obtaining more documents that unambiguously refer entities mentioned using name nj . For each page in Wseed (nj ), which refers to an unambiguous entity ek , we find all its immediate neighbors in the web link graph, both inside and outside Wikipedia. These linked pages will probably have mentions of the name nj , which can be assumed to refer to the same entity ek described by the Wikipedia article to which they are linked. The output of the expansion procedure is a set of gold clusters for each name, nj . These gold clusters are a set of pages that mention name nj and that can be uniquely assigned to one Wikipedia article (which stands for a specific entity). A problem arises when such pages are linked to more than one Wikipedia article that describes entities mentioned by the same name, i.e. to more than one article from the same seed set Wseed (nj ). In those cases, we cannot automatically decide which entity is in fact being mentioned, and thus all occurrences of the corresponding name in that document have to be considered ambiguous. Thus, those documents are excluded from the gold clusters for the name at stake (nj ). Using such
698
L. Sarmento et al.
expansion and filtering procedures, we obtained a gold standard with around 9.3 million mentions for about 52,000 ambiguous names. In Table 3 we present the distribution the gold names in four classes based on the entropy of the corresponding gold clusters. Low entropy values correspond to names where there is clearly one dominant entity to which the vast majority of the mentions belong, while high entropy values are related with names for which mention distribution among entities is less skewed. Table 3. Internal entropy of the names in the gold standard Entropy # names % names 0 to 0.1 768 1.5 0.1 to 0.5 7610 14.5 0.5 to 1 29304 56.0 1 or more 14657 28.0
6 Experimental Setup In order to investigate how scalable our algorithm is and whether or not NED performance improves as the amount of data to be disambiguated grows, we experimented clustering different portions of a large web collection with over a billion documents (in English). The web collection had been previously analyzed by a wide scope namedentity recognition system [18], so we were able to use name annotations in each document to produce feature vectors for the clustering procedure. We first took a 1% sample of the complete web collection (randomly choosing 1% of the documents) and we performed the complete NED procedure several times while slowly increasing the value of the smin parameter, i.e. the minimum similarity for two mention vectors to be considered linked in the Link Graph. This allowed us to obtain several reference points for the values of Et , Eg , Rm and Re for a 1% sample. We then proceeded by performing NED over samples of different sizes - 0.5%, 2% and 5% - so that we could compare the results with the ones previously obtained for 1%. To allow a fair comparison, we matched the results obtained for the 0.5%, 2% and 5% samples with those obtained for one of the 1% samples with the closest value for Et , i.e., similar “purity” values. Results were evaluated against the gold standard (see Section 5.1). All code was implemented in the Map-Reduce [13] paradigm and experiments were run in parallel over 2048 machines. Because of limited RAM and load balancing issues, names were divided in partitions of maximum size 3000. For very frequent names, this may lead to a considerable fragmentation, because there can be hundreds of thousand of mentions for such names. Each mention vector was limited to having, at most, 5000 features (i.e., corresponding to co-occurring names in the same document). We use the Jaccard Metric to compare vectors (we previously perform filtering of less significant features based on minimum tf-idf and frequency values). At the end of first stage of clustering, all clusters with less than 5 elements are filtered out to reduce the total number of clusters to be processed in the second stage. This can have obvious impacts on final recall values, if there are too many such small clusters at the end of the first stage.
An Approach to Web-Scale Named-Entity Disambiguation
699
7 Results and Analysis Table 4 contains the values for Et , Eg , Rm and Re for the 0.5%, the 2% and the 5% samples, and corresponding values for the 1% samples with the closest Et obtained. It also presents the value of smin with which each result was obtained, and the clustering ratio parameter, Crat , which gives the relation between the number of clusters obtained (after filtering) and the number of clusters in the gold standard. Table 4. Performance metrics for three different comparison scenarios %@smin [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
Et 0.0003 0.0001 0.0042 0.0042 0.0103 0.0140
Eg 0.0056 0.0085 0.0226 0.0312 0.0212 0.0797
Rm (%) 0.024 0.055 0.135 0.294 0.186 0.912
Re (%) 1.16 1.74 3.70 5.43 5.00 12.4
Crat 1.23 1.82 2.06 3.27 2.18 6.91
One first observation is that for keeping the values of Et comparable, the smin parameter of the larger sample has to be higher than that of the smaller sample. This was expected, because as the number of mentions to be disambiguated increases, the corresponding vector space tends to become more dense. Thus, in order to avoid noisy clusters we need to increase smin to make sure that only mention vectors that are really close in the vector space actually become linked in the Link Graph, and thus generate pure clusters. Increasing smin should, however, lead to higher fragmentation and to producing many small clusters. The Crat parameters increases both when the size of the sample increases, and when smin increases for the same sample size (the 1% sample), which confirms that fragmentation does in fact increase. Recall values, Rm and Re , seem very low. However, one has to take into account that the number of gold standard documents in the sample is proportional to the sample size. Thus, for the 1% sample, recall values cannot be higher than 1% (if sampling is unbiased as we expect it to be). We are primarily interested in observing the relative changes of recall with sample size. For that, we computed the ratios between the recall figures (Rm and Re ) obtained for the larger and the smaller samples that are being compared in each +/− +/− pair of rows. Table 5 shows the value of these two parameters rm , re for the three comparison situations. For the 0.5% vs 1% and the 1% vs 2% scenarios, we can see that even with better (i.e., lower) values for Et , the mention recall Rm increased faster than +/− the data size; in both cases the recall ratio rm is higher than the data increase ratio Table 5. Ratio between Recall values Rm and Re of larger and smaller samples % vs % 0.5% vs. 1.0% 1.0% vs. 2.0% 1.0% vs. 5.0%
+/−
+/−
rm re 2.28 1.5 2.17 1.48 4.9 2.48
700
L. Sarmento et al.
(twice as many documents). For the 1% vs 5%, the 5-fold increase in the number of documents did not lead to a 5-fold increase in Rm , although it almost did. However, if +/− we look at the re ratio for the entity recall, we see that it is not increasing as fast as the data size is, meaning that we are losing entities (found in the gold standard) as we process more data. The combination of these two factors indicates that for the entities being kept we are able to cluster more and more mentions, but we are losing all the mentions for some more obscure entities. Additionally, recall ratios are systematically decreasing as we increase the size of the data sets to be disambiguated. We believe that there are two main reasons for this. The first reason is a consequence of the compromises we had make in our algorithm to allow it to process web-scale collections. As we increase the size of the sample, and thus the number of mentions to be disambiguated, the number of partitions made for each name also increases (each partition has 3,000 mentions). The overall clustering problem is thus divided into a large number of smaller independent clustering problems whose solutions should ideally be merged in the re-clustering stage. However, for less frequent entities, the partitioning procedure will disperse the mentions over too many partitions, which, in combination with high values for smin , will lead to generation of more but much smaller clusters. Chances are that most of these clusters end up being filtered out after the first stage of clustering and do not even get the chance of being merged in the second clustering stage. Since our gold standard contains some quite exotic entities mentioned in Wikipedia that are probably under-represented in the web collection, the corresponding clusters will be relatively small and will eventually be completely filtered out. This progressively affects Rt , and also Rm , as we the sample gets larger, compensating possible positive effects that would result from having more data and a more dense vector space. These positive effects were only visible when partitioning was not too problematic (i.e., for the 0.5%, 1.0% and 2.0% samples). The second reason has to do with a more fundamental issue for NED, and it only became obvious after manually inspecting the results for very frequent names, such as “Amsterdam”. As we increased the size of the data to be disambiguated, and smin accordingly, we noticed that results for such type of names were composed of many clusters concerning the several possible entities, as expected, but for the dominant entities at stake (for example Amsterdam, the Dutch capital) there was a surprisingly high number of medium and large clusters. These clusters should have been merged together into a single very large cluster since they all rather obviously (based on inspection of their features) seemed to refer to the same (dominant) entity. However, each of these clusters appeared to contain mentions that referred to specific scopes to which the entity occurs, or to different facets that the entity could assume. For example, some clusters referred to “Amsterdam” as world capital, for which the typical features of the clusters (co-occurring names) were other large cities of the world, such as “Paris”, “New York” or “London”, while others clusters would refer to “Amsterdam”, a city in the Netherlands, and would have as typical features names of cities in the Netherlands. In other cases, the clusters produced had features that apparently were not related to the entity, but that were in fact associated with specific contexts of the entity at stake. For example, since there are many scientific editors based in Amsterdam, we found relatively large clusters whose typical features are names of editors
An Approach to Web-Scale Named-Entity Disambiguation
701
(such as “Elsevier” or “Elsevier Science”), and other names related to scientific conferences and societies. There are many other similar examples, where the clusters refer to distinct possible facets of the entities, such as different geographic scopes or different times in history (“Paris” nowadays v.s during the French Revolution). Interestingly, most clusters corresponding to different and highly specialized facets of a dominant entity contained many more mentions than the “main” clusters of non-dominant entities (e.g. “Amsterdam” the novel, or “Paris” of Troy from Greek mythology). From a clustering point of view, the different, yet consistent, name co-occurrence patterns that dominant entities are seen as distinct “sub-entities”, leading to smaller clusters in both clustering stages. The resulting fragmentation effect only becomes obvious when one tries to disambiguate very large and heterogeneous data-sets such as the web: as the size of the corpus increases, more facets of the same entity tend to emerge and make this fragmentation effect more visible. The key point is that, even if we had enough RAM and CPU resources to avoid the partitioning of mentions, fragmentation for these dominant entities would probably still occur. The problem arises from the features used to describe each mention, i.e., the set of co-occurring names, which does not carry sufficient information for merging the existing facets. Conceptually, this situation is close to the homonymy vs. polysemy problem ([19]), which is often encountered in word-sense disambiguation tasks. While homonyms have no related senses (“river bank” vs. “bank account”), polysemous words do share some relation (“the Stoic school” vs. “the school room”). In our case, different entities with the same name (“Amsterdam” the city vs. “Amsterdam” the novel) should be seen as homonynmy, while the multiple “facets” found for the same entity can be seen as the multiple “senses” of a polysemous name (“Amsterdam” a world capital vs. “Amsterdam” a city in the Netherlands). Recently, some Named-Entity Recognition (NER) evaluation programs, such as ACE [20] and HAREM [21], have recognized the existence of inherently ambiguous situations, specially those that exhibit a more or less systematic pattern. For example, ACE introduced the notion of geo-political entities for entities such as countries, that contain a population, a government, a physical location, and a political existence, and that can thus be mentioned by several different facets. However, the large number of possible facets that we observed in our experiments, some quite specialized (e.g. “Amsterdam” as an important city in the field of scientific publishing), does not allow a simple and systematic identification of all relevant cases. Ideally we would want to merge all facets belonging to the same entity but still keep information about the distinct facets (whose meaning might be understandable at a later stage). What our results show is that name co-occurrence information is not sufficient for merging facets and that more specialized information is required. For instance, e-mail addresses or biographic features might help merging different facets of people entities, as geographic related information (geo-codes) might help in the case of locations. More generally, web link information might provide good clues for merging facets of arbitrary types of entities. Mentions of the same name in highly connected parts of the web graph indicate that we are probably dealing with the same entity, even if the corresponding mentions have been placed in different clusters. All this additional information might be used in a third clustering stage to merge all possible facets (i.e clusters) of the same entity.
702
L. Sarmento et al.
8 Conclusion and Future Work We have presented a wide-scope NED algorithm that is scalable and explicitly handles the power law distribution of entities in the web, allowing us to cluster a billion mentions. We also presented a novel evaluation strategy that uses information extracted from Wikipedia to automatically generate a gold-standard. Our experiments do not provide a complete solution to web-scale NED. Instead, they raise several fundamental questions (both theoretical and practical) that have so far been neglected by most approaches to NED. We showed that NED on the web involves dealing not only with obvious scaling issues, but with less obvious and more fundamental problems related to the intrinsic variety of web data. As the data volume grows, new facets of entities become apparent, making NED a more complex and less clearly defined task. We showed that name cooccurrence information is not sufficient for merging distinct facets of the same entity. Future work will include investigating potential features such a document links, email addresses, and geocodes that can serve to merge different facets of entities.
Acknowledgements This work was developed while Lu´ıs Sarmento was an engineering intern and Lyle Ungar was a visiting researcher at Google offices in NYC. The authors would like to thank the Google team for all the help and support. Special thanks to Nemanja Petrovic for his work in developing the gold standard set and to Casey Whitelaw for his help in providing NER-annotated data.
References 1. Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: HLT-NAACL, pp. 9–16 (2004) 2. Malin, B.: Unsupervised name disambiguation via social network similarity. In: Workshop on Link Analysis, Counterterrorism, and Security in conjunction with the SIAM International Conference on Data Mining, pp. 93–102 (2005) 3. Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, Morristown, NJ, USA, pp. 33–40. Association for Computational Linguistics (2003) 4. Yates, A., Etzioni, O.: Unsupervised resolution of objects and relations on the web. In: Proceedings of NAACL HLT, Rochester, NY, April 2007, pp. 121–130 (2007) 5. Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 9–16 (2006) 6. Cucerzan, S.: Large scale named entity disambiguation based on wikipedia data. In: The EMNLP-CoNLL Joint Conference, June 2007, pp. 708–716 (2007) 7. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 178–186. ACM, New York (2003) 8. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007)
An Approach to Web-Scale Named-Entity Disambiguation
703
9. Gale, W.A., Church, K.W., Yarowsky, D.: One sense per discourse. In: HLT 1991: Proceedings of the workshop on Speech and Natural Language, Morristown, NJ, USA, pp. 233–237. Association for Computational Linguistics (1992) 10. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. of 30th STOC, pp. 604–613 (1998) 11. Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 199–206. ACM Press, New York (2002) 12. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515– 528 (2003) 13. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, google labs, pp. 137–150 (2004) 14. Sarmento, L., Kehlenbeck, A., Oliveira, E., Ungar, L.: Efficient clustering of web-derived data sets. In: Perner, P. (ed.) MLDM 2009. LNCS (LNAI), vol. 5632, pp. 398–412. Springer, Heidelberg (2009) 15. Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical report, University of Minnesota, Minneapolis (2001) 16. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Information Systems 17, 107–145 (2001) 17. Meil˘a, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007) 18. Whitelaw, C., Kehlenbeck, A., Petrovic, N., Ungar, L.: Web-scale named entity recognition. In: ACM 17th Conference on Information and Knowledge Management: CIKM 2008. ACM Press, New York (2008) 19. Krovetz, R.: Homonymy and polysemy in information retrieval. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL 1997), pp. 72–79 (1997) 20. Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., Weischedel, R.: The Automatic Content Extraction (ACE) Program–Tasks, Data, and Evaluation. In: Proceedings of LREC 2004, pp. 837–840 (2004) 21. Santos, D., Seco, N., Cardoso, N., Vilela, R.: Harem: An advanced ner evaluation contest for portuguese. In: Calzolari, N., Choukri, K., Gangemi, A., Maegaard, B., Mariani, J., Odjik, J., Tapias, D. (eds.) Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, ELRA, May 22-28, pp. 1986–1991 (2006)
A General Learning Method for Automatic Title Extraction from HTML Pages Sahar Changuel, Nicolas Labroche, and Bernadette Bouchon-Meunier Laboratoire d’Informatique de Paris 6 (LIP6) DAPA, LIP6 104, Avenue du Pr´esident Kennedy, 75016, Paris, France {Sahar.Changuel, Nicolas.Labroche, Bernadette.Bouchon-Meunier}@lip6.fr
Abstract. This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.
1
Introduction
With the rapid increase of information spreading on the Web, locating the relevant resources is becoming more and more difficult. One approach to make the Web more understandable to machines is the Semantic Web1 , where resources are enriched with descriptive information called metadata. Metadata is commonly known as a kind of structure data about data that can describe the content, semantics and services of data [1], playing a central role in supporting resources description and discovery. Basic metadata about a document are: its title, its author, its publisher, its date of publication, its keywords and its description [2]. Although manual annotations are considered as the main source of information for the Semantic Web, the majority of existing HTML pages are still poorly equipped with any kind of metadata. Hence automatic metadata extraction is an attractive alternative for building the Semantic Web. 1
http://www.w3.org/2001/sw/
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 704–718, 2009. c Springer-Verlag Berlin Heidelberg 2009
A General Learning Method for Automatic Title Extraction
705
The three main existing methods to generate metadata automatically are [3]: - Deriving metadata: creating metadata based on system properties. - Harvesting metadata: gathering existing metadata, ex: META tags found in the header source code of an HTML resource. - Extracting metadata: pulling metadata from resource content; metadata extraction occurs when an algorithm automatically learns metadata from a resource’s content. Automatic extraction may employ sophisticated indexing and classification algorithms to improve the metadata quality. In this paper we focus on title extraction from HTML documents as part of a more global application of automatic metadata extraction from learning resources. Title is a field that we can find in most metadata standards schemas: in Dublin Core2 , MARC3 , MODS4 , LOM-fr5 , EAD6 , etc. It is important to have the correct value of the title since it provides information on what a document is about providing scalability and usability for the resource. We propose an automatic title extraction method based on supervised machine learning technique such as Decision Trees and Random Forests methods. A well known drawback of the supervised machine learning method is the manual annotation of the input data set. In this paper, we reuse the knowledge embedded in the header of the HTML source code in order to obtain labeled training data for title extraction with limited human effort. Html pages can have their titles in text or in image format. In figure 1, the title of the page ‘a)’ is contained in an image , whereas the title of the page ‘b)’ is in text format.
Fig. 1. Example of HTML pages titles 2 3 4 5 6
Dublin Core Metadata Initiative, http://dublincore.org/ Machine Readable Cataloging, http://www.loc.gov/standards/ Metadata Object Description Schema, http://www.loc.gov/standards/mods /mods-schemas.html Lom-fr, http://www.lom-fr.fr/ Encoded Archival Description, http://www.loc.gov/ead/
706
S. Changuel, N. Labroche, and B. Bouchon-Meunier
To extract text format titles, style information such as font size, position, and font weight are used as features. We also proposed a new method to extract image format titles, which is a more challenging task since fewer information can be extracted from images, the ‘alt’ attribute was used to get the text content of each image in the labeling phase. The rest of the paper is structured as follows: In section 2, we introduce related works on automatic metadata extraction. In section 3, we explain the corpus creation and the feature extraction for the two methods of title extraction (text and image titles). Section 4 describes the classification algorithms we used and presents the obtained results . We make concluding remarks and highlight future research directions in section 5.
2
Related Work
Several methods have been used for automatic metadata extraction, most of them use one of the two approaches: harvesting or Machine Learning. To extract metadata, and especially the title, from HTML documents, systems generally harvest information from the header part of the HTML source code, this include Klarity and DC.dot [4] applications which generate metadata automatically from the author-supplied Meta tags in HTML documents. We also cite the MetaCombine Project [5] which uses an intuitive approach to generate some basic metadata tags; it checks for < title > tag in page, if present and nonempty, stores title in metadata, otherwise it considers the first 64 characters of the plain text as the title. A current limitation of the harvesting method is that the metadata elements are not always populated by resource creators or other persons. In our experiments, we found that 26.36% out of 2367 HTML pages contain correct values of the Meta < title > fields and that 51.42% of them have their titles in the first line text. In addition to harvesting existing information, Machine learning technologies have been used for automatic metadata extraction, authors in [6] proposed a method to conduct metadata extraction from header part of scientific research papers. They formalized the problem as that of classification and employed Support Vector Machines as a classifier using mainly linguistic features in the model. Fewer researchers were interested in extracting title separately as a metadata field, the paper in [7] describes a content-based and domain-independent method using Support Vector Machine for title extraction from Chinese scientific papers. They use bigram-specific features and sentence-specific features. Whereas, in [8], authors proposed a machine learning approach to title extraction considering Microsoft Office as a case of study and using formatting information such as font size as features in their models. In [9] authors proposed a machine learning method for title extraction from HTML pages based on format information. Our method is different in that we use a domain independent corpus composed of HTML pages labeled automatically without human effort. Moreover, to construct our feature vectors, we used
A General Learning Method for Automatic Title Extraction
707
fewer attributes by selecting, experimentally, the most informative ones, and we applied Decision Trees and Random Forests algorithms on these features. Furthermore, we have proposed a new method for image format title extraction. Experimental results indicate that both methods can generate good results (in term of F-measure) and performs well when applied to a different data sets which indicates that our method is domain independent. We also show that combining both methods can improve the results and can give higher performance to title extraction.
3
Pre-processing
Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process [10]. To avoid human annotation, we reuse the title tag embedded in the header of the HTML source code in order to obtain labeled training data with limited manual effort. In this section we explain our corpus creation methodology. 3.1
Text Format Title Extraction
3.1.1 Corpus Creation To create our corpus, the Meta < title > tag (M tag) is used from the HTML source code. Because this tag is generally absent or badly annotated, we need access to a large collection of HTML pages, which will be filtered depending on the title labeling correctness. Since the web is one of the most important knowledge repository, we decided to use it to construct our data set. Pages are harvested from the Web as follows: - A list of queries are submitted to a search engine using the Argos library7. - The system returns a list of web pages URLs. - The pages having correct M tag values are selected. • Queries Submission Within the context of our global application of automatic metadata extraction from learning resources, we are especially interested in extracting information from the education domain, thus, the words chosen in the queries belong to the education lexicon. Ex: - english+courses+student - history+geography+courses - chemistry+courses Queries are also constructed from French words so as to construct a mixed corpus, and to have a language independent methodology. Ex: 7
https://argos.dev.java.net/
708
S. Changuel, N. Labroche, and B. Bouchon-Meunier
- anglais+exercices+licence - chimie+cours - physique+exercices+licence The choice of the educational domain is in no way a restriction to our extraction approach, our method is conceived to be domain independent. Each query is submitted to the Web search engine and we get back a list of URLs as a result. • HTML page parsing For each result page, the corresponding M tag value is extracted. In order to analyze an HTML page for content extraction, its is passed first through an open source HTML syntax checker, Jtidy8 , which corrects the markup, transforming the ill-formed HTML document to a well-formed XML one. The resulting document is then parsed using the Cobra toolkit9 , which creates its Document Object Model Tree10 (DOM tree) representation. We consider the < body > HTML element as the root node of the tree. Our content extractor navigates the DOM tree and get the text content from the leaf nodes. A set of filters is used to ignore some tags or some specific attributes within tags like links, scripts, drop down menu, and many other elements from the page. Eliminating these elements avoids us reading useless nodes. For each page, we consider only the 20 first text nodes from the DOM tree, assuming that the title is in the top part of the page. Each text is then compared to this value using a similarity measure. • Similarity measure To compare a text “txt” with the M-tag value (m-txt ) of a page, we calculate the proportion of words from txt which are in m-txt. We suppose that txt can be a title if this proportion is greater than a threshold that has been experimentally set to 0.7. More formally, suppose T the set of words in txt and M the set of words in m-txt. Txt is considered similar to m-txt if the following equation is verified: t∈T s(t, M ) > 0.7 (1) |T | Where s(t, M ) = 1 if ∃ m ∈ M/t = m, and 0 otherwise. Unlike [9] our method doesn’t use the edit distance as a measure of similarity since, in addition to the title of the page, m-txt can contain other words describing the page, this can penalize the score of the real title if it exists. For an HTML page, the first text verifying the previous condition is considered as the title of the page, hence, its URL and the corresponding M-tag value are saved in our corpus. 8 9 10
http://jtidy.sourceforge.net/ http://lobobrowser.org/cobra.jsp http://www.w3.org/
A General Learning Method for Automatic Title Extraction
709
Fig. 2. Textual title corpus creation phases
This treatment is repeated for each page result of each query to obtain finally 624 correctly annotated pages from 2367 pages. This corpus is called the net-txt corpus. The different steps of the corpus creation are summarized in figure 2. Another collection of documents is also used to construct a test data set in order to evaluate the classifier. The Eureka11 collection is used for this purpose. It is an online educational portal offering resources with well annotated metadata in XML format respecting the LOM (Learning Object Metadata)12 schema. The URLs and the titles are extracted from these XML documents using XPATH queries, then the labeling process is continued by following the schema of the figure 2. 424 correctly annotated pages are obtained out of 1633, we call this corpus the Eureka-txt corpus. 3.1.2 Features Extraction In metadata extraction systems, title is generally extracted from the following sections of the HTML document: -
The M-tag value. The first line of the document. H1 and H2 tags. H1 tags.
As mentioned previously, few pages contain correct M-tag values, in fact only 26.36% of the pages resulting from the different queries contain the correct values, thus considering only the M-tag information for title extraction is not reliable. The other methods are tested on our corpus and the results are summarized in table 1. The columns list the title extraction method, the percentage of titles extracted with each method on the Net-txt corpus, on the Eureka-txt corpus, and on both corpora. 11 12
http://eureka.ntic.org/ http://www.lom-fr.fr/
710
S. Changuel, N. Labroche, and B. Bouchon-Meunier
The table shows that none of the methods listed is efficient for extracting the titles from HTML pages, other properties should be considered. Table 1. Performances of different methods for title extraction Net-txt Eureka-txt Both corpora First line title 44.36% 58.48% 51.42% H1-H2 tags 49.28% 35.95% 42.61% H1 tags 36.64% 30.87% 33.75%
Our method is based on style text properties to construct the features. The Cobra java toolkit13 is used to extract these information; the toolkit is a CSSaware HTML DOM parser and allows to get the style of each text i.e. the size, the alignment, the colour, etc. 36 features are constructed based on the following information: - Current node information, such as letters capitalization, font size, font weight (bold, strong, emphasized), colour, alignment, H1 tag, etc. - Format change with the previous and the following text: family font change, size change, alignment change ... - The proportion of the format compared to the rest of the page: size, family font, colour, etc. For each text node a feature vector is constructed. The last element of this vector is the class, i.e. whether the text is the title of the document or not, based on its similarity with the annotated title. More than one text can have a similarity measure greater that 0.7 in the same document, in that case, the text with the biggest font size is considered as the title of the page. In order to avoid a significant gap between the number of instances assigned to the class “title” and those assigned to the class “non title”, the 20 first instances are stored for each page. 8408 examples are obtained from the net-txt corpus, we call them net-txt attributes, and 4858 examples are acquired from the Eureka corpus, they are called Eureka-txt attributes. 3.2
Image Format Title Extraction
As mentioned before, an HTML page title can either be a text or an image. Extracting image format titles requires acquiring their text contents. A well known method for translating printed text into machine-editable text is the Optical Character Recognition method usually abbreviated to OCR, but this is out of the scope of our work. The alternative solution proposed in this paper focuses on images with filled ‘alt’ attributes. The alt attribute is an essential part of the Web accessibility, 13
http://lobobrowser.org/cobra.jsp
A General Learning Method for Automatic Title Extraction
711
providing a text equivalent for every non-text element14 . For an image containing text in some specific font style, that text is used as the value of the alt attribute. An example of alt attribute given by the HTML 2.015 is: . In this paper, the alt attribute is used in order to extract image title from HTML pages. The issue is that this attribute is rarely filled, which makes the task a challenging problem. 3.2.1 Corpus Creation We need a set of HTML pages having images titles with filled alt attributes. To create our corpus, the previous method based on querying the Web is adopted, but in spite of getting text nodes from each page, we get the alt attribute values from image nodes. Each one is compared with the M-tag value using the similarity measure defined in equation 1. Few pages satisfying the requirement are obtained (148 pages out of 7471 pages resulting from the queries), the reason is that people seldom specify the alt attribute when adding images to HTML pages. We call this corpus the net-img corpus. The same method is adopted to get 51 pages from the Eureka corpus, we call this corpus the Eureka-img corpus. 3.2.2 Features Extraction For image format titles, the style properties of the text contained in images can’t be obtained, thus, other information should be used for feature construction. In our method, we extract spatial features based on the following properties: -
Image alignment. Image height and width. Image surface. Image position in the page. H1 tag information.
13 features representing each image are constructed. The last element is the class of the feature: whether the image is the title or not; the alt text is compared with the page title using the equation 1. Feature extraction is done on both corpus: the net-img and the Eureka-img corpus, to get 743 instances from the first and 214 from the second.
4
Algorithms and Evaluations
4.1
Algorithms
A supervised learning technique is used to extract titles from HTML pages. Let {(x1 , y1 )...(xn ,yn )} be a two-class training data set, with xi a training feature 14 15
http://www.w3.org/TR/WAIWEBCONTENT/ http://www.w3.org/MarkUp/htmlspec/htmlspec toc.html
712
S. Changuel, N. Labroche, and B. Bouchon-Meunier
vector and their labels yi (1 for the class ‘title’ and -1 for the class ‘non title’). Experiments are conducted using two classifiers, Decision Tree and Random Forest, the results are compared for each corpus. • Decision Tree algorithm We use the C4.5 algorithm implemented in Weka16 , since in addition to nominal attributes it can deal with numeric attributes and with noisy data. Decision Tree algorithm works with a top down approach, seeking at each stage an attribute to split on that best separates the classes, then recursively processing the subproblems that result from the split. The information measure used as a basis for evaluating different splits is the entropy which characterizes the (im)purity of each arbitrary collection of instances. A set of rules is generated, one rule is generated for each leaf. The antecedent of the rule includes a condition for every node on the path from the root to that leaf, and the consequent of the rule is the class assigned by the leaf. Rules derived from trees are pruned to remove redundant test, the C4.5 algorithm adopts a strategy of post-pruning by building the complete tree and pruning it afterwards [10]. In our experiments, the confidence factor used for pruning is 25%. • Random Forest algorithm Random Forest algorithm was developed by Leo Breiman [11], to operate quickly over large datasets. More importantly, it can be diverse by using random samples to build each tree in the forest and combining the classifiers predictions by voting. In the Random Tree method, a tree is constructed as follow [12]: 1. Instances are randomly sampled, with replacement, from the original data set to create a new one of the same size to be used for tree construction (inBag). 2. Choose a random number of attributes k from the inBag data and select the one with the most information gain to comprise each node. 3. Continue to work down the tree until no more nodes can be created due to information loss. 4. Combine the trees by having them vote on each test instance. If one class receives more votes than any other, it is taken as the correct one. We note N the total number of attributes. If not specified, k is equal to the first integer less than log2 (N ) + 1. We use the Random Forest algorithm implemented in Weka which is based on the C4.5 Decision Tree algorithm, 10 is chosen as the number of Decision Tree classifiers. Using Random Forest algorithm should make the outcome of classification more stable by taking profit of the complementarily of the different trees. 16
http://www.cs.waikato.ac.nz/ml/weka/
A General Learning Method for Automatic Title Extraction
4.2
713
Evaluations
This section provides empirical results to evaluate the two methods of title extraction described previously. Performance is evaluated by precision and recall described as follows: - A : The number of items correctly labeled as belonging to the class ‘title’. - B : The number of items incorrectly labeled as belonging to the class ‘title’. - C : The number of items not labeled as belonging to the class ‘title’ but which should have been. Precision measures the number of correctly identified items as a percentage of A the number of items identified: P recision = A+B . Recall measures the number of correctly identified items as a percentage of A the total number of correct items: Recall = A+C . Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of completeness. Both are combined into a single measure: Fl-measure metric which is the weighted harmonic mean of precision and recall: F 1 − measure =
2 ∗ P recision ∗ Recall P recision + Recall
F-measure exhibits the desirable properties of being highest when both recall and precision are high . 4.2.1 The Text Title Extraction Method For text title extraction, both Decision Tree and Random Forest classifiers are used in our experiments. Initially, a 10-fold cross validation is conducted on the net-txt attributes and on the Eureka-txt attributes separately. Suppose k the number of attributes to be used in random selection for the Random Forest classifier: k = log2 (36) + 1 = 6. The classifier is first tested with the default value of k, then tests are made with different values of k, we mention the best results obtained. Table 2 summarizes the results. Table 2 shows that the performance of title extraction on the Eureka-txt corpus is better than that on the net-txt corpus in term of precision, recall and Fmeasure. This can be explained by the fact that pages of the Eureka-txt corpus belong to the same collection and share common patterns, whereas, the net-txt corpus contains domain independent pages belonging to different collections and sharing fewer stylistic patterns. Results performed by the Random Forest models are better than those generated by Decision Tree model. With the Random Forest algorithm the trees correlation provides a higher performance and predictions become more reliable as more votes are taken into account. The default number of features does not give the best results when using the Random Forest classifier, increasing k improves the performance of the model even if it leads to a higher correlation between the trees.
714
S. Changuel, N. Labroche, and B. Bouchon-Meunier Table 2. Cross validation results of the text title extraction method Corpus Net-txt
Methods Decision tree Random Forest Random Forest Eureka-txt Decision tree Random Forest Random Forest
k=6 k=13 k=6 k=13
Precision Recall F1-Measure 0.808 0.619 0.701 0.828 0.72 0.770 0.847 0.729 0.784 0.890 0.840 0.864 0.896 0.873 0.884 0.903 0.882 0.893
Table 3. Domain adaptation of the text title extraction method Corpus Methods Net-txt/Eureka-txt Decision Tree Random Forest Random Forest Eureka-txt/Net-txt Decision Tree Random Forest Random Forest
k=6 k=13 k=6 k=13
Precision Recall F1-Measure 0.834 0.748 0.789 0.830 0.748 0.787 0.810 0.785 0.798 0.722 0.554 0.627 0.661 0.575 0.615 0.696 0.601 0.645
We remark that our method is language independent since the net-txt corpus contains both French and English pages. To test the domain adaptation of our method, the model is first applied with the net-txt corpus data as input, and the Eureka-txt data set for test, then we invert the training and the test data sets. Results are summarized in table 3. The results of the Net-txt /Eureka-txt experiment show that our extraction model is domain independent, the net-txt corpus contains heterogeneous pages from different domains making the model sufficiently generalized. When swapping the two data sets the result decreases in term of precision, recall and F-measure, the reason of that is that pages from the Eureka-txt corpus share some common patterns. 4.2.2 The Image Title Extraction Method 10-fold cross validations are conducted, using Decision Tree and Random Forest classifiers. With the former, the default number of attributes is used in the random selection i.e. k = int(log2 (13) + 1) = 4. Table 4 shows the results obtained when using the 743 attributes generated from the net-img corpus. Both classifiers give good results and make an effective use of the image information for title extraction. Random Forest classifier is also evaluated using the 214 attributes from Eureka pages as a test set and the net-img corpus attributes for training. Results are summarized in table 5. The Results obtained with the Random Forest classifier are slightly better than those we get with the Decision Tree algorithm when using different data sets for training and for testing. This reproves that
A General Learning Method for Automatic Title Extraction
715
Table 4. Cross validation results of the image title extraction method Method Precision Recall F-Measure Decision Tree 0.865 0.779 0.820 Random Forest 0.870 0.820 0.844 Table 5. Domain adaptation of the image title extraction method Method Precision Recall F-Measure Decision Tree 0.705 0.705 0.705 Random Forest 0.833 0.795 0.814
our method is domain independent. The results above show that our method can still achieve a relatively high performance when titles are in image format. The technique employed in this paper, though simple, is quite effective. 4.2.3 Text versus Image Format Title Extraction Further investigation is done to analyze the relation between the text title extraction method and the image title extraction one, our aim is to test to what extend each method improves the other. For this purpose, to test both methods on the same corpus, the corpus we need should contain pages with both formats of titles: image and text. This corpus is obtained by filtering the net-img and the Eureka-img corpus. Our aim is to apply the text title extraction method to this corpus on the one hand, and the image title extraction method on the other hand. Hence, image attributes as well as text attributes are extracted from these pages, we call the text attributes txt-img attributes. Experiments are conducted on these features using Decision Tree classifier. We did not use Random Forest in this experiment since our data set is small. Firstly, the model trained with the net-txt attributes is applied to the txt-img attributes; we call this experiment Txt-Exp. Second, since our data set is not large, a 10-fold cross validation is conducted with the image attributes, this experiment is called Img-Exp. The results obtained are summarized in table 6. Txt-Exp gives less performing results than Img-Exp, this is due to the fact that the pages of the data set have essentially image titles and the annotated text titles are sometimes ambiguous and can be confused with the rest of the text. When further analyzing the results, we notice that the text method can sometimes extract the title of a page whereas the image method could not extract the image title and vice versa. This leads us to suppose that combining both methods can significantly improve the results. Figures 317 and 418 show this more clearly. 17 18
http://www.clg-armand.acaixmarseille.fr/spip/ http://www.polarfle.com/
716
S. Changuel, N. Labroche, and B. Bouchon-Meunier Table 6. Performance of text vs. image title extraction methods Experiment Precision Recall F-Measure Txt-Exp 0.742 0.529 0.617 Img-Exp 0.833 0.68 0.749
Fig. 3. Text versus image format title extraction
Fig. 4. Text versus image format title extraction
In figures 3, the image title was not found by the image classifier whereas the text title was extracted by the text method classification. In figures 4, we notice the opposite, the image title was found and the text title was not. Combining both methods can lead to a higher performance; to test this, we need a bigger corpus, and a large number of HTML pages with images having filled alt attributes, however, as mentioned before, this attribute is seldom used
A General Learning Method for Automatic Title Extraction
717
by HTML pages creators and there is still a lack of awareness of its importance especially for accessibility purpose. An alternative solution would be to use an optical character recognition method.
5
Conclusions
Under an application of automatic metadata extraction, we notice that title is a field which is present in all metadata schemas, and which is widely used in search engine applications. This paper describes two techniques for HTML pages title extraction based on machine learning methods. The first approach extracts text format title based on style properties, and the second extracts image format title using the alt attributes of < img > tags. A method for corpus creation was proposed. It is based on extracting pages by querying the Web, it guarantees that the obtained pages have diverse patterns, different languages and deal with distinct subjects. This paper shows that, on two data sets, our methods perform well in most cases, it shows also that combining text and image extraction methods can lead to better results, we suggested using OCR techniques to expand the data set of image titled pages . Future directions include discovering other fields of metadata from HTML pages so as to enrich resources and to make them more accessible.
References 1. Liu, L., He, G., Shi, X., Song, H.: Metadata extraction based on mutual information in digital libraries. In: First IEEE International Symposium on Information Technologies and Applications in Education, ISITAE 2007 (2007) 2. Noufal, P.P.: Metadata: Automatic generation and extraction. In: 7th Manlibnet Annual National Convention on Digital Libraries in Knowledge Management: Opportunities for Management Libraries, at Indian Institute of Management Kozhikode (2005) 3. Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 3–20 (2006) 4. Greenberg, J.: Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging 6, 59–82 (2004) 5. Krowne, A., Skinner, K., Halbert, M., Ingram, S., Gadi, U., Pathak, S.: Metacombine project interim report. Technical report, Emory University (2006) 6. Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Joint Conference on Digital Libraries, 2003. Proceedings, pp. 37–48 (2003) 7. Zhang, Z., Sun, M., Liu, S. (eds.): Proceedings of 2005 IEEE International Conference on Automatic content based title extraction for Chinese documents using support vector machine (2005)
718
S. Changuel, N. Labroche, and B. Bouchon-Meunier
8. Hu, Y., Li, H., Cao, Y., Teng, L., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. Inf. Process. Manage. 42, 1276–1293 (2006) 9. Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 250–257. ACM, New York (2005) 10. Ian, H., Witten, E.F.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Diane Cerra (2005) 11. Breiman, L.: Random forests. Machine Learning (2001) 12. Pater, N.: Enhancing random forest implementation in weka. In: Learning Conference Paper for ECE591Q (2005)
Regional Pattern Discovery in Geo-referenced Datasets Using PCA Oner Ulvi Celepcikay, Christoph F. Eick, and Carlos Ordonez University of Houston, Department of Computer Science, Houston, TX, 77204-3010 {onerulvi, ceick, Ordonez}@cs.uh.edu
Abstract. Existing data mining techniques mostly focus on finding global patterns and lack the ability to systematically discover regional patterns. Most relationships in spatial datasets are regional; therefore there is a great need to extract regional knowledge from spatial datasets. This paper proposes a novel framework to discover interesting regions characterized by “strong regional correlation relationships” between attributes, and methods to analyze differences and similarities between regions. The framework employs a two-phase approach: it first discovers regions by employing clustering algorithms that maximize a PCA-based fitness function and then applies post processing techniques to explain underlying regional structures and correlation patterns. Additionally, a new similarity measure that assesses the structural similarity of regions based on correlation sets is introduced. We evaluate our framework in a case study which centers on finding correlations between arsenic pollution and other factors in water wells and demonstrate that our framework effectively identifies regional correlation patterns. Keywords: Spatial Data Mining, Correlation Patterns, Regional Knowledge Discovery, Clustering, PCA.
1 Introduction Advances in database and data acquisition technologies have resulted in an immense amount of geo-referenced data, much of which cannot be adequately explored using current methodologies. The goal of spatial data mining is to automate the extraction of interesting and useful patterns that are not explicitly represented in geo-referenced datasets. Of particular interest to scientists are techniques which are capable of finding scientifically meaningful regions and representing their associated patterns in spatial datasets, as such techniques have many immediate applications in medicine, geosciences, and environmental sciences, such as the association of particular cancers with environmental pollution of sub-regions, the detection of crime zones with unusual activities, and the identification of earthquake hotspots. Since most relationships in spatial datasets are geographically regional [15], there is a great need to discover regional knowledge in spatial datasets. Existing spatial data mining techniques mostly focus on finding global patterns and lack the ability to systematically discover P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 719–733, 2009. © Springer-Verlag Berlin Heidelberg 2009
720
O.U. Celepcikay, C.F. Eick, and C. Ordonez
regional patterns. For example, a strong correlation between a fatal disease and a set of chemical concentrations in water wells might not be detectable throughout Texas, but such a correlation pattern might exist regionally which is also a reflection of Simpsons' paradox[16]. This type of regional knowledge is crucial for domain experts who seek to understand the causes of such diseases and predict future cases. Another issue is that regional patterns have a scope that—because they are not global—is a subspace of the spatial space. This fact complicates their discovery because both subspaces and patterns have to be searched. Work by Celik et al. [4] assumes the presence of an apriori given regional structure (e.g. a grid) and then searches for regional patterns. One unique characteristic of the framework presented in this paper is that it searches for interesting subspaces by maximizing a plug-in reward-based interestingness function and then extracts regional knowledge from the obtained subspaces. This paper focuses on discovering regional correlation patterns that are associated with contiguous areas in the spatial subspaces, which we call regions. Interesting regions are identified by running a clustering algorithm that maximizes a PCA-based fitness function. PCA is used to guide the search for regions with strong structural relationships. Figure 1 shows an example of discovered regions along with their highest correlated attribute sets (HCAS). For example, in Region 1 a positive correlation between Boron (B), Fluoride (F), and Chloride (Cl), and between Arsenic (As), Vanadium (V), and Silica (SiO2), as well as a negative correlation between Silica (SiO2) and Molybdenum (M) can be observed. As can be seen in the Figure 1, some of those sets differ quite significantly between regions, emphasizing the need for regional knowledge discovery. Also a new similarity measure is introduced to estimate the structural similarity between regions based on correlation sets that are associated with particular regions. This measure is generic and can be used in other contexts when two sets of principal components have to be compared. The main contributions of the paper are: 1. A framework to discover interesting regions and their regional correlation patterns.
Fig. 1. An Example of Regional Correlation Patterns for Chemical Concentrations in Texas
Regional Pattern Discovery in Geo-referenced Datasets Using PCA
721
2. A PCA-based fitness function to guide the search for regions with well-defined PCs 3. A generic similarity measure to assess the similarity between regions quantitatively. 4. An experimental evaluation of the framework in a case study that centers on indentifying causes of arsenic contamination in Texas water wells. The remainder of the paper is organized as follows: In section 2, we discuss related work. In section 3, we provide a detailed discussion of our region discovery framework, the PCA-based fitness function and HCAS similarity measure. Section 4 presents the experimental evaluation and section 5 concludes the paper.
2 Related Work Principal Component Analysis. (PCA). PCA is a multivariate statistical analysis method that is very commonly used to discover highly correlated attributes and to reduce dimensionality. The idea is to identify k principal components for an d-dimensional dataset (k<
3 Methodology We now present the methods our regional pattern discovery framework utilizes during the region discovery and post processing phases as illustrated in Figure 2.
722
O.U. Celepcikay, C.F. Eick, and C. Ordonez
Fig. 2. Regional Pattern Discovery Framework
3.1 Region Discovery Framework We employ the region discovery framework that was proposed in [10, 11]. The objective of region discovery is to find interesting places in spatial datasets—regions occupying contiguous areas in the spatial subspace. In this work, we extend this framework to find regional correlation patterns. The framework employs a rewardbased evaluation scheme to evaluate the quality of the discovered regions. Given a set of regions R={r1,…,rk} with respect to a spatial dataset O ={o1,…,on}, the fitness of R is defined as the sum of the rewards obtained from each region rj (j = 1,… ,k): k
q ( R ) = ∑ i( rj ) * size( rj ) β . j =1
(1)
where i(cj) is the interestingness of the region rj—a quantity based on domain interest to reflect the degree to which the region is “newsworthy”. The framework seeks for a set of regions R such that the sum of rewards over all of its constituent regions is maximized. In general, the parameter β controls how much premium is put on region size. The size(rj)β component in q(R), (β≥1) increases the value of the fitness nonlinearly with respect to the number of objects in the region rj. A region reward is proportional to its interestingness, but given two regions with the same value of interestingness, a larger region receives a higher reward to reflect a preference given to larger regions. Rewarding region size non-linearly ensures merging neighboring regions whose PCs are structurally similar. The Clever Algorithm. We employ the CLEVER [10] clustering algorithm to find interesting regions in the experimental evaluation. CLEVER is a representative-based clustering algorithm that forms clusters by assigning objects to the closest cluster representative. The algorithm starts with a randomly created set of representatives and employs randomized hill climbing by sampling s neighbors of the current clustering solution as long as new clustering solutions improve the fitness value. To battle premature convergence, the algorithm employs re-sampling: if none of the s neighbors improves the fitness value, then t more solutions are sampled before the algorithm terminates. In short, CLEVER searches for the optimal set of regions, maximizing a given, plug-in fitness function q(R), which in our case is the PCA-based fitness function.
Regional Pattern Discovery in Geo-referenced Datasets Using PCA
723
3.2 PCA-Based Fitness Function for Region Discovery The directions identified by PCA are the eigenvectors of the correlation matrix. Each eigenvector has an associated eigenvalue that is a measure of the corresponding variance and the PCs are ordered with respect to the variance associated with that component in descending order. Ideally, it is desirable to have high eigenvalues for the first k PCs, since this means that a smaller number of PCs will be adequate to account for the threshold variance which overall suggests that a strong correlation among variables exists[14]. Our work employs the interestingness measure in definition 1 to assess the strength of relationships between attributes in a region r: Definition 1. (PCA-based Interestingness – iPCA(r) ) Let λ1, λ2,…,λk be the eigenvalues of the first k PCs, with k being a parameter:
iPCA (r ) = (λ12 + ... + λk2 )/k .
(2)
PCA-based fitness function then becomes: k
qPCA ( R ) = ∑ iPCA (rj ) * size(rj ) β .
(3)
j =1
The fitness function rewards high eigenvalues for the first k PCs. By taking the square of each eigenvalue we ensure that regions with a higher spread in their eigenvalues will obtain higher rewards—reflecting the higher importance assigned in PCA to higher ranked principal components. For example; a region with eigenvalues {6, 2, 1…} will get a higher reward than a region with eigenvalues {4, 3, 2…} even though the total variance captured in both cases is about the same. We developed a generic pre-processing technique to select the best k value for the PCA-based fitness function for a given dataset that is based on a variance threshold: the smallest k is chosen so that the variance captured in the first k principal components is greater than this threshold. First, the algorithm applies PCA to the global data and determines the global k value (kg) for a given variance threshold which serves as an upper bound for k. Then, it splits the spatial data into grids (random square regions), applies PCA to each grid, and determines the k value for each region based on the variance threshold obtaining {kr1,…,krs}. The algorithm next selects the most frequent kr value in the set of regional k-values as the final result—to be used in the fitness function. For datasets with strong regional patterns, the chosen k is expected to be lower than kg: fewer PCs capture the same variance in the regional data, because regional correlation is stronger than global correlation. Our fitness function repeatedly applies PCA during the search for the optimal set of regions, maximizing the eigenvalues of the first k PCs in that region. Having an externally plugged in PCA-based fitness function enables the clustering algorithm to probe for the optimal partitioning and encourages the merging of two regions that exhibit structural similarities. This approach is also more advantageous than applying PCA once or multiple times on the data, since the PCA-based fitness function is applied repeatedly to candidate regions to explore each possible region combination.
724
O.U. Celepcikay, C.F. Eick, and C. Ordonez
3.3 Correlation Sets, HCAS and Region Similarity Highest correlated attribute sets (HCAS) are sets of correlation sets (CSs) which are signed sets of the attributes that are highly correlated. CSs are constructed from the eigenvectors of principal components (PCs). An attribute is added to the correlation set of a PC, if the absolute value of the PC coefficient of that attribute is above a threshold α along with the sign of the coefficient. The threshold α is selected based on the input from domain experts. For example, let’s assume that α=0.33 for PC1 in Table 1, a correlation set {Mo-,Cl+,SiO4+} is constructed, since only the absolute values of these attributes’ coefficients are above α (depicted in bold in the table). In this set, Mo is negatively correlated with both Cl and SiO4, whereas Cl and SiO4 are positively correlated. Table 1. Eigen-Vectors of first k PCs (k=3) Variables As Mo V B F SiO2 Cl SiO4 TDS WD
PC1 -0.323 -0.346 -0.301 0.138 0.325 -0.165 0.394 0.363 -0.243 0.097
PC2
PC3
-0.452 0.062 -0.463 -0.247 0.177 -0.389 -0.295 -0.277 -0.323 -0.233
-0.34 0.46 -0.243 -0.177 0.356 -0.299 0.148 0.194 0.173 -0.226
Next, CS and HCAS will be defined formally, and similarity measures for correlation sets and regions will be introduced. Definition 2. (Correlation Sets – CS) A CS is a set of signed attributes that capture correlation patterns. Definition 3. (Highest Correlated Attribute Sets – HCAS) HCAS are sets of correlation sets and they are used to summarize correlation relationships of regions. Each region has a HCAS of cardinality k since our framework retains only k PCs. Each CS is associated with a single PC (principal component). HCAS are constructed for each region to summarize their regional correlation patterns. HCAS are used to describe and compare the correlation patterns of regions. Example: For the PCA result in Table1 the following HCAS will be generated: {{Mo-,Cl+,SiO4+}, {As-,V- SiO2-}, { As- Mo+, F+}} Next, we define operations to manipulate CS.
Regional Pattern Discovery in Geo-referenced Datasets Using PCA
725
Definition 4. (Operations on CSs) Two operations are defined on CSs. They are; 1. csign (“complement sign”) changes the signs of a CS. e.g. csign({ A+, B-, C- })={ A-, B+, C+ } 2. uns (“unsign”) removes the signs of attributes in a CS. e.g. uns({ A+, B- ,C- })= { A, B, C } Definition 5. (Correlation Sets Similarity – simCS) The similarity between two correlation sets, CSi and CSj, is estimated using following equation: simCS (CSi , CS j ) =
max{| CSi ∩ CS j |,| CSi ∩ csign(CS j ) |} . | uns(CSi ) ∪ uns (CS j ) |
(4)
simCS(CSi, CSj) is assessed by comparing CSi with CSj and comparing CSi with csign(CSj) and by taking the maximum set size obtained for the two intersections and dividing it by the number of objects in the union of unsigned CSi and CSj. Basically, simCS(CSi, CSj) takes two factors into consideration when comparing two CSs: 1. Agreement with respect to attributes that contribute to variance. 2. Agreement in correlation with respect to common attributes. Examples: a. simCS({A-.B-},{A+,B+})=1 b. simCS({A+,B+},{A+,B-})=0.5 c. simCS({A+,B+,C+,D+,E+},{A+,B+,C-,D-,E-})=0.6 Next, we define a similarity measure to assess the similarity of two regions with respect to their k PCs. Let us assume that the HCASs of region R1 and region R2 are {CS1,…,CSk} and {CS'1,…,CS’k}, respectively, and that the principal components of the regions have eigenvalues λ1,…,λk and λ’1,…, λ’k, respectively. Definition 6. (PC Similarity Matrix – PCS) Let PCS be a k x k similarity matrix whose entries pcs(i, j) store the similarity between ith correlation set of region R1 and jth correlation set of region R2 weighted by the eigenvalues of the associated principal components.
PCS(i,j)=simCS(CSi, CS’j)*δI,j.
where
δi, j =
(5)
λ i + λ' j k
k
∑λ + ∑λ i =1
i
j =1
' j
We use δij to weigh in the contribution of a correlation set of a PC to the overall similarity based on its eigenvalue. If the eigenvalue is high, then its contribution when assessing similarity between regions is higher compared to other sets with lower eigenvalues. Next, using PCS, we introduce the regional similarity measure.
726
O.U. Celepcikay, C.F. Eick, and C. Ordonez
Definition 7. (Regional Similarity – SimR) Let perm(k) be the set of all permutations of numbers 1,..k, the similarity between two regions R1 and R2 is defined as follows: k
simR (R1 ,R 2 )= max y=(y1 ,....yk )∈perm(k) ∑ PCS (i, yi )
(6)
i =1
This similarity function computes an injective mapping from k principal components of region R1 to k principal components of region R2 which maximizes correlation set similarity weighted by the eigenvalues of the associated principal components. After the best injective mapping ψ has been determined, similarity is computed by adding the similarities of principal component i of region R1 with ψ(i) in region R2 for i=1,…,k. Basically, simR( ) finds the best one-to-one mapping that aligns the principal components of the two regions to provide the best match with respect to the similarity. It should be noted that k is usually very small; typically 2-6, rarely larger than 10; therefore, maximizing similarity over all permutations usually can be done quickly. For larger k values, some greedy, approximate versions of the similarity function can be developed. Example: Let us assume that the PCS of two regions (R1 and R2) is as follows: (k=3 and S1, S2, S3 belongs to region R1 and T1, T2, T3 belongs to region R2): PCS S1 S2 S3
T1 0.4 0.0 0.1
T2 0.1 0.1 0.2
T3 0.1 0.3 0.0
Since k is 3, there will be 6 one-to-one mappings between the principal components of the two regions. The similarity calculations that will be conducted to determine the similarity of R1 and R2 are shown below: R1
Mappings
R2
Calculations
In this case, the 2nd mapping {S1ÆT1, S2ÆT3, S3ÆT2} maximizes the sum of similarities. So, we obtain SimR(R1, R2) = 0.9
Regional Pattern Discovery in Geo-referenced Datasets Using PCA
727
3.4 Post Processing via Regression Analysis We additionally employ regression analysis models in the post processing phase to analyze regional dissimilarities. We use the OLS (Ordinary Least Squares) regression to investigate the impact of our independent variables on the dependent variable (e.g. arsenic concentration in arsenic experiments). OLS was chosen because it minimizes the mean squared error; thus, it is the best liner efficient estimator [19]. Our framework first applies regression analysis on global data (global regression); then, after it discovers regions, it retrieves the top k regions ranked by their interestingness and applies regression analysis on those regions (regional regression). The results of regional regression are compared with the results of global regression to reveal regional differences
4 Experimental Evaluation 4.1 A Real World Case Study: Texas Water Wells Arsenic Project Arsenic is a deadly poison and even long-term exposure to very low arsenic concentration can cause cancer [17]. So it is extremely crucial to understand the factors that cause high arsenic concentrations to occur. In particular, we are interested in identifying other attributes that contribute significantly to the variance of arsenic concentration. Datasets used in the experiments were created using the Texas Water Department Ground Water Database [17] that samples Texas water wells regularly. The datasets were generated by cleaning out duplicate, missing and inconsistent variables and aggregating the arsenic amount when multiple samples exist. Our dataset has 3 spatial and 10 non-spatial attributes. Longitude, Latitude and Aqufier ID are the spatial attributes and Arsenic(As), Molybdenum(M), Vanadium(V), Boron(B), Fluoride(F), Silica(SiO2), Chloride(Cl), Sulfate(SiO4) are 8 of the non-spatial attributes which are chemical concentrations. The other 2 non-spatial attributes are Total Dissolved Solids (TDS) and Well Depth (WD). The dataset has 1,653 objects. 4.2 Experimental Parameters Table 2 summarizes the common parameters used in all experiments and the ones specific to the individual experiments. These parameter values were chosen after many initial experiments as the parameter settings that provide the most interesting results. β is a parameter of the region discovery framework which controls the size of the regions to be discovered. s and t are the parameters of CLEVER algorithm. min_regon_size is a controlling parameter to battle the tendency towards having very small size regions with maximal variance. Regions with size below this parameter receive a reward of zero.
728
O.U. Celepcikay, C.F. Eick, and C. Ordonez Table 2. The parameters used in the experiments
Common parameters Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5
s=50, t=50, α=0.33 min_region_size = 8, β= 1.7 min_region_size = 9, β= 1.6 min_region_size = 20, β= 1.7 min_region_size = 16, β= 1.6 min_region_size = 16, β= 1.01
In our pre-processing phase to select the best k value for the experiment, we use 70% variance as the threshold, a percentage based on the comments from domain experts who maintain that this is a good threshold for detecting correlations among chemical concentrations. Other feedback from domain experts for the Water Pollution Experiment suggested that the arsenic dataset is not globally highly correlated; hence, setting the variance threshold to 70% is a good fit. The preprocessing phase, using 70% as the threshold, indicated that 3 or 4 are the best values for k. We report the results for k=3 in this section. The threshold used in constructing correlation sets of HCAS was chosen in accordance with domain experts’ feedback as 0.33. 4.3. HCAS and Similarity Results HCASs for the experiment 1 are shown in Table 3 which lists the top 5 regions ranked by their interestingness values. These sets suggest that there are regional patterns involving highly correlated attributes, whereas globally (Texas-wide) almost all attributes are members of HCAS and are equally correlated; a situation which fails to reveal strong structural relationships. Analyzing the correlation sets and region similarity helps us to identify regions that display variations over space. For example, with respect to the second principal component of region 15 we observe a positive correlation between Molybdenum and Vanadium and a negative correlation between Molybdenum and Fluoride neither of which exists globally, Moreover, the negative correlation between Molybdenum and Fluoride only exist in region 15 and is not observed in the other four regions. In general, such observations are highly valuable to domain experts, because they identify interesting hypotheses and places for further investigation. Table 4 shows the regional similarity matrix and Table 5 depicts the similarity between the 5 regions and the global data (Texas). Table 3. HCAS sets for the Top Ranked Regions
Region ID Texas Region 0 Region 1 Region 13 Region 21 Region 15
HCAS Sets for the first 3 PCs {As-,Mo-,B-,Cl-,SO4-} {As+,V+,Fl+,SiO2+} {As-, Mo-,SiO2+} {Cl-,SO4-} {As-,Mo-,V-} {Fl-,SiO2-} {B+,FL+, Cl+,SO4+} {As-, V-,SiO2-} {Mo+,SiO2-} {B+, Cl+,SO4+} {As+, Mo-,SiO2+ } {As-,Mo-,V-} {Mo+, B+, SiO2- } {As-, V-, Cl+} {As+,Fl+,Cl+,SO4+} {B-,Cl-,SO4-} {Mo-,V-,Fl+} {As-,V-,SiO2-}
Regional Pattern Discovery in Geo-referenced Datasets Using PCA
729
Table 4. Similarity Matrix of Regions for Experiment 1
Region 0 Region 1 Region 13 Region 21 Region 15
Region 0 0.00 0.61 0.88 0.62 0.65
Region 1 0.61 0.00 0.90 0.64 0.59
Region 13 0.88 0.90 0.00 0.40 0.92
Region 21 0.62 0.64 0.40 0.00 0.40
Region 15 0.65 0.59 0.92 0.40 0.00
Table 5. Similarity Vector of Regions to Global Data (Texas) for Experiment 1
Exp#1
Region 0
Region 1
Global
0.59
0.70
Region 13 0.84
Region 21
Region 15
0.59
0.66
Discussion. The HCAS similarity matrix and similarity vector in Tables 4 and 5 reveal that the HCAS similarity measure capture the true similarity between correlation patterns of regions. For example, we observe that region 13 and region 15 are the most similar regions. This is the result of following mappings: {B+, Cl+, SO4+} Æ {B-, Cl-, SO4- } ( PC1 of Region 13 Æ PC1 of Region 15 mapping ) {As+, Mo-, SiO2+} Æ {As-, V-, SiO2-} ( PC2 of Region 13 Æ PC3 of Region 15 mapping ) {As-, Mo-, V- } Æ {Mo-, V-, Fl+ } ( PC3 of Region 13 Æ PC2 of Region 15 mapping )
The discovered regions also maximize the cumulative variance captured through the first k principal components. The variance values for the top 5-ranked regions in experiment 2 are given in Table 6. The global data has a 57% cumulative variance which indicates that the attributes in the global data are not very highly correlated. But the regions discovered by our approach capture a much higher variance which is an indication that our framework successfully discovers regions with highly correlated attributes. Table 6. Cumulative Variance Captured by the first k PCs in Experiment 2
Region Texas Region 48 Region 28 Region 17 Region 22 Region 32
Variance Captured 57.10% 84.81% 77.19% 73.15% 72.30% 70.39%
Size 1655 30 19 39 16 44
One could argue that any method that divides data into sub-regions increases the variance captured since lower numbers of objects are involved. This is true to some extend but additional experiments that we conducted suggest that the regions discovered by the region discovery framework are significantly better than randomly selected regions. In particular, we ran experiments where we created regions at
730
O.U. Celepcikay, C.F. Eick, and C. Ordonez
random and computed the variance captured for those regions. Due to space limitations, we only provide a brief summary of the results here. The highest variance captured using random regions is 72% with 16 objects whereas in our approach it is 84% with 30 objects. In general, the regional variance captured using our framework was at an average 9.2% higher than the variance captured by random regions. 4.4 Post Processing via Regression Analysis Results The post processing phase first applies regression analysis to the global data by selecting Arsenic as the dependent variable and the other 7 chemical variables as the independent variables. The OLS regression result shows that Molybdenum, Vanadium, Boron, and Silica increase the arsenic concentration, but Sulfate and Fluoride decrease it Texas-wide. Next, it retrieves the list of the top-ranked regions and applies the regression analysis to regions. The result of the global regression and one example of regional regression analysis are shown in Tables 7 and 8, respectively. Table 7. Regression Result for Global Data
As Mo V B
Coef. 0.101 0.211 0.0027
Std.Er 0.0204 0.0048 0.0003
t 4.95 43.55 9.49
Fl
-0.6693
0.159
-4.34
SiO2 0.0726 0.0115 6.3 Cl 0.0008 0.0008 0.97 SiO4 -0.001 0.0007 -1.87 const -1.696 0.4902 -3.46 R-squared –Value: Adjusted R-squared Value
P>|t| 0 0 0 0 0 0.331 0.062 0.001 68% 68%
Table 8. Regression Result for Region 10
As Mo V B
Coef. 0.7297 0.234 -0.007
Std.Er 0.2731 0.031 0.004
t 2.67 7.52 -1.74
P>|t| 0.013 0 0.094
Fl
-4.996
3.4254
-1.46
0.156
-0.8 2.91 -3.34 1.84
0.428 0.066 0.192 0.076 95.03% 93.73%
SiO2 -0.071 0.0886 Cl 0.0138 0.0071 SiO4 -0.019 0.0142 const 7.3982 4.0134 R-squared –Value Adjusted R-squared Value
Regional Pattern Discovery in Geo-referenced Datasets Using PCA
731
Discussion. R-Squared value is equal to 68.3% for the state of Texas, which means 68.3% of the arsenic variance can be explained by other 7 chemical variables for Texas-wide data. R-squared value increased from 68% to 93.73% in Region 10, which indicates that in this region there exist stronger correlations between arsenic and the other variables. Also globally, Chloride (Cl) and Sulfate (SiO4) are not significant as predictors for arsenic concentration; but in this region, they are significant. Conversely, Boron, Fluoride, and Silica are globally significant and highly correlated with arsenic, but this is not the case in Region 10. This information is very crucial to domain experts who seek to determine the controlling factors for arsenic pollution, as it can help to reveal hidden regional patterns and special characteristics for this region. For example, in this region, high arsenic level is highly correlated to high Sulfate and Chloride levels, which is an indication of external factors that play a role in this region, such as a nearby chemical plant or toxic waste. Our framework is able to successfully detect such hidden regional correlations. Our approach can be viewed as using different regression function for different regions which shows similarity to the approach used in Geographically Weighted Regression (GWR) [12]. In GWR, a weight function which is a function of spatial location is used to differentiate regression functions for different locations, whereas in our work we first discover highly correlated regions maximizing a PCA-based based fitness function and then create regional regression functions for each region. The global and regional regression results show that the relationship of the arsenic concentration with other chemical concentrations spatially varies and is not constant over space which proves the need for regional knowledge discovery. In other words, there are significant differences in arsenic concentrations in water wells across regions in Texas. Some of these differences are found to be due to the varying impact of the independent variables on the arsenic concentration. In addition, there are unexplained differences that are not accounted for by our independent variables, which might be due to external factors, such as toxic waste or the proximity of a chemical plant. 4.5 Implementation Platform and Efficiency The components of the framework described in this paper were developed using an open-source, Java-based data mining and machine learning framework called Cougar^2[6], which has been developed by our research group. All experiments were performed on a machine with 1.79 GHz of processor speed and 2GB of memory. The parameter β is the most important factor with respect to run time. The run times of the experiments with respect to the β values used are shown in Figure 3. For example for β=1.01, it takes about 30 minutes to run the experiment, whereas it takes about 2 hours to run for β =1.6. We observed that more than 70% of the computational resources are allocated for determining regional fitness values when discovering regions. Even though our framework repeatedly applies PCA to each explored region combination until no further improvement is made, it is still efficient compared to approaches in which PCA is applied that many times using other statistical tools.
732
O.U. Celepcikay, C.F. Eick, and C. Ordonez
Fig. 3. Run Times vs. β Values
5 Conclusion This paper proposes a novel framework to discover regions and automatically extract their corresponding regional correlation patterns that are globally hidden. Unlike other research in data mining that uses PCA, our approach centers on discovering regional patterns and provides a comprehensive methodology to discover such patterns. The proposed framework discovers regions by employing clustering algorithms that maximize a PCA-based fitness function and our proposed post-processing techniques derive regional correlation relationships which provide crucial knowledge for domain experts. We also developed a generic pre-processing method to select the best k value for the PCA-based fitness function for a given dataset. Additionally, a new similarity measure is introduced to estimate the structural similarity between regions based on correlation sets that are associated with particular regions. This similarity measure is generic and can be used in other contexts when two sets of objects have to be compared based on other information (e.g. eigenvectors) that has been derived from their first k principal components. The proposed framework was tested and evaluated in a real world case study that analyzes regional correlation patterns among arsenic and other chemical concentrations in Texas water wells. We demonstrated that our framework is capable of effectively and efficiently identifying globally hidden correlations among variables along with the sub-regions that are interesting to the domain experts. As far as the future work is concerned, we are planning to conduct extensive comparative study of our regional patterns and the co-location patterns reported in [10] for the same dataset. We are also working on developing different PCA-based fitness functions that put more emphasis on the dependent variable with the goal of developing regional regression techniques.
Regional Pattern Discovery in Geo-referenced Datasets Using PCA
733
References 1. Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Deriving quantitative models for correlation cluster. In: Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, PA, pp. 4–13 (2006) 2. Anselin, L.: Spatial Econometrics: Methods and Models. Kluwer, Netherlands (1988) 3. Böhm, C., Keiling, K., Kröger, P., Zimek, A.: Computing Clusters of Correlation Connected Objects. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Paris, France (2004) 4. Celik, M., Kang, J., Shekhar, S.: Zonal Co-location Pattern Discovery with Dynamic Parameters. In: Proc. of 7th IEEE Int’l Conf. on Data Mining, Omaha, Nebraska (2007) 5. Choo, J., Jiamthapthaksin, R., Sheng Chen, C., Celepcikay, O.U., Giusti, C., Eick, C.F.: MOSAIC: A proximity graph approach for agglomerative clustering. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2007. LNCS, vol. 4654, pp. 231–240. Springer, Heidelberg (2007) 6. Cougar^2 Framework, https://cougarsquared.dev.java.net/ 7. Cressie, N.: Statistics for Spatial Data (Revised Edition). Wiley, New York (1993) 8. Data Mining and Machine Learning Group, University of Houston, http://www.tlc2.uh.edu/dmmlg 9. Ding, W., Jiamthapthaksin, R., Parmar, R., Jiang, D., Stepinski, T., Eick, C.F.: Towards Region Discovery in Spatial Datasets. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS, vol. 5012, pp. 88–99. Springer, Heidelberg (2008) 10. Eick, C.F., Parmar, R., Ding, W., Stepinski, T., Nicot, J.P.: Finding Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets. In: Proc. 16th ACM SIGSPATIAL International Conference on Advances in GIS (ACM-GIS), Irvine, California (November 2008) 11. Eick, C.F., Vaezian, B., Jiang, D.J.: Discovering of interesting regions in spatial data sets using supervised clustering. In: Proc. of the 10th European Conference on Principles of Data Mining and Knowledge Discovery, Berlin, Germany (2006) 12. Fotheringham, A.S., Brunsdon, C., Charlton, M.: Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. John Wiley, Chichester (2002) 13. Johnson, R.A.: Applied Multivariate Analysis. Prentice Hall, Englewood Cliffs (1992c) 14. Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986) 15. Openshaw, S.: Geographical data mining: key design issues. In: GeoComputation 1999: Proceedings Fourth International Conference on GeoComputation, Mary Washington College, Fredericksburg, Virginia, USA, July 25-28 (1999) 16. Simpson, E.H.: The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, B 13, 238–241 (1951) 17. Texas Water Development Board, http://www.twdb.state.tx.us/home/index.asp 18. Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3(1) (1991) 19. Woolridge, J.: Econometric Analysis of Cross-Section and Panel Data, pp. 130, 279, 420– 449. MIT Press, Cambridge (2002) 20. Yu, S., Yu, K., Tresp, V., Kriegel, H.P., Wu, M.: Supervised Probabilistic Principal Component Analysis. In: Proceedings of the 12th ACM SIGKDD, pp. 464–473. ACM, New York (2006)
Memory-Based Modeling of Seasonality for Prediction of Climatic Time Series Daniel Nikovski1 and Ganesan Ramachandran2 1
2
Mitsubishi Electric Research Laboratories, Cambridge MA 02139, USA [email protected] Department of Electrical and Computer Engineering , University of Florida, Gainesville, FL 32611, USA [email protected]
Abstract. The paper describes a method for predicting climatic time series that consist of significant annual and diurnal seasonal components and a short-term stochastic component. A memory-based method for modeling of the non-linear seasonal components is proposed that allows the application of simpler linear models for predicting short-term deviations from seasonal averages. The proposed method results in significant reduction of prediction error when predicting time series of ambient air temperature from multiple locations. Moreover, combining the statistical predictor with meteorological forecasts using linear regression or Kalman filtering further reduces prediction error to typically between 1o C over a prediction horizon of one hour and 2.5o C over 24 hours.
1
Introduction
Many processes of practical interest in everyday life, such as climate variation (air temperature and humidity) and electrical power demand have very significant seasonal components that are driven by natural phenomena such as the Earth’s rotation around its axis and the Sun. These seasonal components render the time series non-stationary, and complicate the estimation of suitable prediction models for several major applications such as planning the generation of electricity and determination of its price, as well as the optimal scheduling of the operation of air conditioners, heating devices, domestic appliances, etc. The accurate prediction of such time series would result in efficient utilization of capital equipment, as well as positive environmental impact. Due to the high practical significance of this class of problems, many forecasting approaches have been tried. Within the classical time series prediction methodology that is based on auto-regressive moving average (ARMA) models, a possible method for handling non-stationarity is to difference the time series as many times as necessary to make the resulting time series stationary [4]. Such models are also known as integrated ARMA (ARIMA) models. However, if the seasonal component itself is non-linear, after differencing, the resulting time series might exhibit non-linear dependencies, which would preclude the use of low-order linear prediction models for modeling. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 734–748, 2009. c Springer-Verlag Berlin Heidelberg 2009
Memory-Based Modeling of Seasonality for Prediction
735
It has also been discovered that direct application of more advanced machine learning techniques, such as neural networks, to the prediction of such time series can often result in poor accuracy, despite their high flexibility and ability to model dynamic systems [5]. This has been attributed to both numerical optimization difficulties, as well as to possible mismatches between the model and the physical process that generated the time series. A much more physically realistic approach consists of decomposing a seasonal time series as a sum of explicit seasonal components and a random noise component, and modeling these components separately. The two models need not be of the same type: for example, the models of the seasonal component can be non-linear, while the model for the random deviations can be linear. This approach corresponds well to the physical nature of some of the phenomena listed above. Using the classical decomposition model, we represent a time series Xt produced by one of these phenomena by a sum of a seasonal component st and a random noise component Yt , in the absence of a trend [4]: Xt = st + Yt . We then hypothesize that for the listed phenomena, the random noise component Yt is stationary, and can be predicted solely from its past values Yt−1 , Yt−2 , . . . , Yt−w for some width w of a window of past values, and that the seasonal component has a fixed period h: st = st+kh for all integers k. The problem then reduces to modeling well the two parts of the decomposition, st and Yt . In the remainder of the paper, we propose a memory based method for the estimation of the seasonal component st for the case of seasonal time series with annual and diurnal components, and present experiments on a number of data sets for ambient air temperature in multiple parts of the USA. We also describe how the predictions of the proposed model can be combined with meteorological forecasts in real time.
2
Memory-Based Modeling of Seasonality
The motivating application for this method is the prediction of ambient temperature at a specific location, for example outside of a residential or commercial building, from a database of historical readings at that location. This temperature will determine the actual thermal load that would be experienced by heating and air conditioning equipment, and is essential for the optimal scheduling of its operation. Typical prediction horizons of interest are around 24 hours. In most areas of the world, the ambient temperature is subject to very large variations due to two cyclical components. The first one is the change of seasons caused by the rotation of the Earth around the Sun (annual component). The second one is the change of night and day, caused by the rotation of the Earth around its axis (diurnal component). In addition to these two seasonal components, a random component exists that is caused by meteorological phenomena such as cold and warm fronts, cloud cover, wind, solar activity, etc. This component is irregular, but fairly inert and persistent — such conditions usually persist for intervals commensurate with the prediction horizon of interest.
736
D. Nikovski and G. Ramachandran
In the classical decomposition framework, the two seasonal components would be modeled separately to produce average annual temperatures and average daily temperatures, to be subtracted from the original time series in order to deseasonalize it. For a number of reasons, this approach would not work with temperature time series. First, the daily variation of temperatures at a particular location usually does not depend only on the time of the day, but also on the day of the year. The reason is that depending on how high the Sun is, some parts of the building would be in the shadow or not, thus strongly affecting the air temperature there. (The curve traced by the Sun along the sky at the same time every day for an entire year is called analemma, and its vertical variation is around 46.878o, or twice the angular tilt of the Earth.) Because of this, the two seasonal components should be modeled together, and many prediction methods estimate the average temperature for a specified combination of date and time of the day. One simple way to achieve this is to do calendar averaging: for any combination of date and time of the day, for example 3pm on January 23, compute the average of all readings from a historical database of temperatures that have been recorded at 3pm on January 23 of any year. The second reason classical decomposition, including calendar averaging, would not work well, is that the period of rotation of the Earth around the Sun is not an integer number of days. Rather, the exact period of rotation is 365.25636042 solar days, also known as a sidereal year, i.e., measured with respect to the background stars. As it is well known, the fractional part of one quarter of a day is corrected by means of a leap year every four years. The practical consequence of this is that the concept of average temperature at a specified combination of date and time of the day does not actually make sense. It is not correct to speak of the average temperature at 3pm on January 23, because depending on which year this day is in, the Earth might be at significantly different positions along its orbit around the Sun, and hence the impact of the Sun on the climate would be different. For example, on January 23, 2009, the Earth’s position with respect to the Sun will be closer to that of January 24, 2008, rather than to that of January 23, 2008, due to the fact that 2008 was a leap year. (If February 29, 2008 did not exist, January 23, 2009 would have been dated January 24.) However, it will also be closer to that of January 23, 2005, rather than January 24, 2005. In order to account for this mechanism, we propose an alternative memorybased estimation method called sidereal averaging. This method never computes explicit temperature estimates for a general combination of date and time of day; rather, it consults the database of historical readings only after a query time is given on a specific day of a specific year. The algorithm then retrieves and averages, for each year of data in the database, the temperature on the day when the position of the Earth along its orbit around the Sun was closest to its position on the query day, time, and year. In its characteristics, this algorithm is similar to other memory-based machine learning algorithms, such as k-nearest neighbors. The novelty in this algorithm is the distance measure used, that is,
Memory-Based Modeling of Seasonality for Prediction
737
the distance between corresponding positions of the Earth along its orbit around the sun. We investigated the effect of the sidereal vs. calendar averaging methods in an experimental study, as described in Section 4. In both cases, we modeled the random component after deseasonalizing by means of low-order ARMA models.
3
Combining Statistical and Meteorological Forecasts
The methods described in the previous two sections, including the proposed method for sidereal averaging, are statistical machine learning methods: they use a database of past examples to build a predictive model, using various machine learning tools. However, for the case of temperature prediction, there is another very important source of forecasts: the governmental meteorological agencies in practically every country of the world. In recent years, detailed forecasts have been made available in real time using convenient information and communication infrastructure. For example, the National Weather Service (NWS) of the United States has been offering weather forecasts for the entire territory of the country as a standard web service since 2004. Using these forecasts in real-time prediction would be very desirable. However, these forecasts have a significant disadvantage: they are produced for a relatively small number of locations, typically airports, and even the closest location to the target place for prediction might have significantly different weather patterns. The question, then, is how to combine the local statistical prediction with the regional meteorological forecast. This problem has also been subject to intensive research. Kawashima et al. proposed a curve fitting method based on the high and low temperature of the forecast [2]. Shaheen and Ahmed extended the method to include the current temperature as well [3]. Linear regression methods have also been tried, for example using the form: ¯ t + bZt + c, T t = aX
(1)
¯ t is the temperature predicted by where Tt is the combined forecast at time t, X the statistical method, and Zt is the temperature according to the meteorological forecast, possibly for a fairly different location. The regression coefficients a, b, and c can be estimated from a relatively small dataset of past values for the ¯ and Z. They can also be continuously re-estimated, for three variables T , X, example from the values immediately preceding the current moment in time. Another variation includes regression parameters al , bl , cl that are dependent on the prediction horizon l = 1, L, to account for the varying ratio between the prediction errors of the statistical and meteorological forecasts that is typically encountered in practice. Here L is the longest prediction horizon, measured in time steps. Another method for combining forecasts is based on a Kalman filter [1]. The idea is to treat the meteorological forecast as a correction factor for the local
738
D. Nikovski and G. Ramachandran
forecast, through a gain matrix Kt that is re-estimated continuously: ¯ t + Kt (Zt − H X), ¯ Tt = X
(2)
where H is a selection matrix, and the variables of interest are vectors of ¯ t = [X ¯ t+1 , X ¯ t+2 , . . . , X ¯ t+L ]T , dimensionality L: Tt = [Tt+1 , Tt+2 , . . . , Tt+L ]T , X T Zt = [Zt+1 , Zt+2 , . . . , Zt+L ] . It can also be shown that the Kalman filter is a special case of a linear regression method, where the regression coefficients are estimated differently, and also the statistical dependency between prediction at different horizons can be modeled, too.
4
Experimental Verification of Prediction Methods
In order to train and evaluate the described prediction methods, hourly temperature data over 13 years (1995-2008) was obtained from the National Climatic Data Center (NCDC) of the USA for the eastern half of the country, for a total of 26GB of data. The NCDC data had been recorded at most international airports and some local airports in the USA, and the National Weather Service also provides online forecasts for most of the same locations in XML format. The forecasts are available for the times of the day of 02:00, 05:00, 08:00, 11:00, 14:00, 17:00, 20:00, and 23:00 hours with prediction horizon of the next 5 days. Updated observations are available hourly. Since historical data and forecasts were available only at airports, we performed the analysis on pairs of airports, treating one of the airports as the prediction target location, and the other one as the source of meteorological forecast (and vice versa). The pairs of airports had a distance of between 30 and 50 miles between each other, in order to represent the typical distance from a residential or commercial building to the nearest airport. After examination of the data, four pairs of locations were selected for experimental verification, combining northern/southern latitude and coastal/continental climates. The pairs are shown in Table 1. The forecasts and the observed temperatures were recorded into a SQL server over a period of four weeks. Table 1. Airport pairs for experimental verification Northern BOS Boston Logan Airport Coastal OWD Norwood Memorial Airport CLE Cleveland Hopkins Airport Continental BKL Burke Lakefront Airport
Southern MIA Miami International Airport TMB Kendall Tamiami Airport ATL Atlanta Hartsfield Airport FTY Fulton County Airport
After modeling and subtracting the seasonal component using either calendar or sidereal averaging, we interpolated the resulting deviations from the seasonal average at a time interval of Δt equal to 1 hour, and modeled the resulting time
Memory-Based Modeling of Seasonality for Prediction
739
series of deviations Yt by means of ARMA models. We experimented with models of order varying from (AR=1,I=0,MA=0) to (AR=3,I=1,MA=1) for fitting the deviations, using the Time Series package TS in the statistical environment R, and discovered that even the simplest autoregressive model AR(1) of order one was very successful at modeling the deviations. For example, its prediction error for the deviations at Boston Logan Airport after sidereal averaging was only 1.5% higher than that of the best ARIMA model. Furthermore, for the AR(1) model, the single regression coefficient r was typically very large for all time series Yt , around r = 0.98. This suggests that the deviations from normal seasonal average temperatures typically persist for a fairly long time, and prediction over horizons of up to 24 hours is indeed practically possible. As regards the relative performance of sidereal vs. calendar averaging, Figures 1 through 4 show that in all cases sidereal averaging is either much better than calendar averaging (OWD, TMB, CLE, ATL, BKL), or the same (BOS, MIA). Figures 1 through 4 also show an interesting pattern — although the prediction error of statistical prediction methods increases with the time horizon, as expected, the error of the meteorological forecast does not. This can be explained by the completely different methodology used by meteorological agencies, but still the accuracy of such forecasts at relatively long prediction horizons is remarkable. At the same time, their accuracy at short prediction horizons is much worse than that of the statistical predictors, which is an excellent justification for methods that attempt to combine statistical and meteorological forecasts. Regarding the improvements in accuracy that can be achieved by such methods, Figures 5 through 8 show that in most cases (BIOS, CLE, ATL, BKL) the combined forecast by either linear regression or a Kalman filter is much more accurate than either one of the individual forecasts. In some cases (FTY, TMB, MIA) there is no significant difference for horizons longer than 6 hours, but still the combined prediction is more accurate for horizons shorter than that, and there is only one single site (OWD) where the combined predictions are significantly less accurate than the meteorological forecast. Furthermore, systematic and significant differences between the performance of linear regression and Kalman filtering cannot be observed, which means that linear regression should be preferred in practical systems for its ease of implementation. (In all of these experiments, only sidereal averaging was used for combination with the meteorological forecast, since the previous set of experimental results established its superior performance over calendar averaging.) In absolute terms, the performance of the combined predictors can be summarized as follows. Accuracy starts around 1o C to 1.5o C for a prediction horizon of 1 hour, and grows to between 2o C and 2.5o C during the next 3 to 10 hours, almost never exceeding 3o C over the longest prediction horizon of interest, 24 hours. Compared with the variability of daily and annual temperatures which span an interval of around 50o C for most climates, this accuracy can be considered practically very useful.
740
D. Nikovski and G. Ramachandran
3 1
2
RMSE( in Celcius )
4
5
6
BOS
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
3 1
2
RMSE( in Celcius )
4
5
6
OWD
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
Fig. 1. Comparison between seasonal averaging methods in a northern coastal climate
Memory-Based Modeling of Seasonality for Prediction
741
3 1
2
RMSE( in Celcius )
4
5
6
MIA
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
3 1
2
RMSE( in Celcius )
4
5
6
TMB
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
Fig. 2. Comparison between seasonal averaging methods in a southern coastal climate
742
D. Nikovski and G. Ramachandran
3 1
2
RMSE( in Celcius )
4
5
6
CLE
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
3 1
2
RMSE( in Celcius )
4
5
6
BKL
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
Fig. 3. Comparison between seasonal averaging methods in a northern continental climate
Memory-Based Modeling of Seasonality for Prediction
743
3 1
2
RMSE( in Celcius )
4
5
6
ATL
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
3 1
2
RMSE( in Celcius )
4
5
6
FTY
0
Site Sidereal AR1 Site Calendar AR1 Forecast Same as current 5
10
15
20
Prediction Horizon(in hours)
Fig. 4. Comparison between seasonal averaging methods in a southern continental climate
744
D. Nikovski and G. Ramachandran
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
BOS
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
OWD
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
Fig. 5. Comparison between forecast combination methods in a northern coastal climate
Memory-Based Modeling of Seasonality for Prediction
745
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
MIA
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
TMB
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
Fig. 6. Comparison between forecast combination methods in a southern coastal climate
746
D. Nikovski and G. Ramachandran
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
CLE
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
BKL
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
Fig. 7. Comparison between forecast combination methods in a northern continental climate
Memory-Based Modeling of Seasonality for Prediction
747
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
ATL
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
1.5 0.5
1.0
Prediction RMSE in C
2.0
2.5
3.0
FTY
0.0
ForecastRMSE Prediction with Site Sidereal RMSE Kalman with Site Sidereal RMSE Linear with Site Sidereal RMSE 0
5
10
15
20
Prediction horizon(in hours)
Fig. 8. Comparison between forecast combination methods in a southern continental climate
748
5
D. Nikovski and G. Ramachandran
Conclusion and Future Work
We have proposed a novel method for modeling of the annual and diurnal seasonality of time series, and have demonstrated experimentally that it significantly improves the accuracy of prediction of the temperature of ambient air when combined with low-order ARMA models of the deseasonalized time series. The method is similar to other memory-based machine learning techniques such as k-nearest neighbors (kNN), and uses as distance function the difference between pairs of positions of the Earth’s along its orbit around the Sun. (In this case, the number of neighbors is equal to the number of years for which training examples have been collected.) Due to this similarity, it might be expected that algorithms that are counterparts to other memory-based machine learning methods, such as locally weighted polynomial regression, or kNN with more neighbors might improve the accuracy of seasonal modeling even further [6]. In practice, this would mean including observations from more days into the averaging process, possibly using variable weights. Also note that this experimental analysis does not prove that the sidereal averaging method is better than the calendar method for the purposes of modeling of seasonal components; it merely indicates that the sidereal method is significantly better when followed by low-order ARMA modeling of the remaining random component. We have also demonstrated that two linear methods for combining of the statistical prediction with meteorological forecasts further reduces prediction error significantly. In these experiments, the meteorological forecast from the nearest airport was always used, but this approach can be extended in the future to using forecasts from multiple locations, which might improve accuracy even further in densely populated areas. Finally, the described method should also be applicable to other time series whose dynamics are determined entirely or partly by the motion of the Earth and the Sun, for example daily light, humidity, electrical power demand, etc.
References 1. Nagai, T.: A method for revising temperature and humidity prediction using additional observations and weather forecasts. In: Proceedings of Building Simulation 2007, pp. 245–252 (2007) 2. Kawashima, M., Dorgan, C.E., Mitchell, J.W.: Hourly thermal load prediction for the next 24 hours by ARIMA, EWMA, LR, and an artificial neural network. ASHRAE Transactions 101(1), 186–200 (1995) 3. Shaheen, N.I., Ahmed, O.: A simple methodology to predict local temperature and humidity. ASHRAE Transactions 104, Part 1A, 451–459 (1998) 4. Brockwell, P., Davis, R.: Introduction to Time Series and Forecasting, 2nd edn. Springer, Heidelberg (2002) 5. Faraway, J., Chatfield, C.: Time series forecasting with neural networks: a comparative study using the air line data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(2), 231–250 6. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, Heidelberg (2001)
A Neural Approach for SME’s Credit Risk Analysis in Turkey Gülnur Derelioğlu1,2, Fikret Gürgen2, and Nesrin Okay3 1
Yapı ve Kredi Bankası A.Ş. Information Technology Management, Üsküdar, 34700, İstanbul, Turkey [email protected] 2 Computer Eng. Dept. Bogazici University, Bebek 34342, İstanbul, Turkey [email protected] 3 Dept. Of Management Bogazici University, Bebek 34342, İstanbul, Turkey [email protected]
Abstract. This study presents a neural approach which cascades a neural classifier which is multilayer perceptron (MLP) and a neural rule extractor (NRE) for real-life Small and Medium Enterprises (SMEs) in Turkey. In feature selection stage, decision tree (DT), recursive feature extraction (RFE), factor analysis (FA), principal component analysis (PCA) methods are implemented. In this stage, the RFE approach gave the best result in terms of classification accuracy and minimal input dimension. Then, in classification stage, a MLP that is used for preprocessing is followed by a NRE. The MLP makes a decision for customers as being “good” or “bad” and the NRE reveals the rules how the classifier reached at the final decision. In the experiments, Turkish SME database has 512 samples. The proposed approach compared with k-NN and SVM classifiers. It was observed that the MLP-NRE was slightly better than SVM and local k-NN. Keywords: Credit risk analysis (CRA), Small and Medium Enterprises (SMEs), multilayer perceptron (MLP), neural rule extraction (NRE), support vector machines (SVM), k-nearest neighbor (k-NN).
1 Introduction Credit Risk Analysis (CRA) is a challenging data mining problem in financial analysis domain. Recently, it takes more importance since credit volume in real market has shown great increase and economical fluctuations has become more often. In Turkey, 95% of real enterprises are accepted as small and medium enterprises (SME) that reveals the importance of them in national economy [1]. Not only in Turkey, but also in many developing countries in the world, especially in recent years, SME credits have been gaining much more importance according to their high growth in financial world. In contrast with its increasing growth rate in the world-wide financial sector, there is not enough research for SMEs’ CRA. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 749–759, 2009. © Springer-Verlag Berlin Heidelberg 2009
750
G. Derelioğlu, F. Gürgen, and N. Okay
Credit Risk (CR) is a general term which implies to future losses. CRA aims to decrease future losses by estimating the potential risk and eliminating the new credit proposal if the risk is higher than a defined tolerance value. This is also called as CR classification which labels a customer as “good” if he could pay the loan back, otherwise as “bad”. In our study, we propose a cascaded multilayer perceptron (MLP) and neural rule extraction (NRE) system that classifies customers to make a final credit decision as being “good” or “bad” and forwards the result to a rule-extraction stage to reveal under which condition the final decision is made for a good customer. As a preprocessing step, various dimension reduction techniques are used to select a minimal input dimension to input to the classifiers. Although, small subset of the original SME portfolio is used as an input, when the real, huge portfolio size is taken into consideration; the dimension reduction becomes indispensable phase of the proposed method. In the dimension reduction phase, Decision Tree (DT) and Recursive Feature Elimination with Support Vector Machine (SVM-RFE) are applied for feature selection, and then Factor Analysis (FA) and Principal Component Analysis (PCA) are applied for feature extraction. After the preprocessing, the input features are forwarded to the classification level, here MLP, SVM and k-NN classifiers are employed to select optimal combination in terms of classification performance and dimension. In the rule-extraction level, CRED is applied to extract decision rules from trained MLP. The work flow of our proposed method is given in Figure 1. 1.1 Previous Work CRA is an appealing topic where a 1% improvement in accuracy, which seems insignificant, will reduce losses in a large loan portfolio and save billions of euro. Thus, there have been many techniques proposed for CRA: Neural Networks (NN), SVM, LR and k-NN algorithms are widely used on different dataset and provided good prediction accuracy. Although many researches [2-7] have been done on CRA for large corporates, personal credits and credit cards, there are only a few works for SME CRA. SMEs show different behavior than corporate and individuals. Altman and Sabato remind that analysis on German and French SMEs show that SMEs are riskier with lower assets correlation with each other than large corporate [8]. Thus, developed models for corporate would not be suitable for SMEs. Altman and Sabato developed a default prediction model based on logistic regression using US SMEs’ data and indicated that their proposed method outperformed Multivariate Discriminant Analysis. Fantazzini and Figini also proposed Random Survival Forest Model which gave slightly better performance than classical logistic model on the real-life dataset obtained from Creditreform [1]. 1.2 Dataset The real-life dataset is provided by Yapı ve Kredi Bankası A.Ş. [9] which consists of SMEs information collected from credit portfolio until 2007. Only a small subset of the original portfolio is taken randomly, without any sampling methodology, thus we do not affirm that the dataset reveals all portfolios behavior perfectly.
A Neural Approach for SME’s Credit Risk Analysis in Turkey
751
Fig. 1. Rule base obtained from trained MLP on dataset2
Dataset has 512 samples with 27 features and a binary class variable either good or bad. Class distribution is not homogeneous as 144 customers (28%) were good and 368 customers were bad. 6 of the features are categorical and the others are continuous variables. These features mainly cover 4 different types of information: demographical, financial, risk and delinquency information. The remainder of this paper is organized as follows: Firstly, in section 2 dimension reduction techniques are outlined, credit risk classification methodologies are described in section 3, rule-base extraction is provided in section 4. Lastly, experimental results are discussed in section 5.
752
G. Derelioğlu, F. Gürgen, and N. Okay
2 Dimension Reduction In data mining applications, the time and space complexity of any classifier or regressor directly depends on the input data size [10]. Dimensionality Reduction techniques can be applied to the input data to obtain a reduced representation of the dataset without losing the integrity of the original data [11]. These techniques can be divided into two different groups: Feature Selection which aims to obtain a subset of the original dataset without losing any information and Feature Extraction which aims to find a new set of features that are the combinations of the original variables. 2.1 Feature Selection Feature Selection techniques are also called as Subset Selection. Many approaches have been proposed for feature selection such as the well known methods of DT and SVM-RFE. Decision Tree: DT is a hierarchical data structure for supervised learning and is used for both classification and regression. DT learning is based on divide-and-conquer which starts at the root node by splitting the input space into two subspaces and continues recursively with the corresponding subset until a leaf node is obtained. Learning ends when the best split is reached. The measure of the good split is impurity which is determined as if all instances of the branch are labeled as the same class. The final univariate tree consists of the most relevant features and discards irrelevant ones. In this work, J48, which is a C4.5 tree [12] implemented in Weka [13], is used as the first feature selection method. Recursive Feature Elimination with Support Vector Machine: RFE utilizes the generalization capability embedded in SVM. RFE keeps the independent features containing the original dataset information while eliminating weak and redundant features [14]. However, the subset produced by SVM-RFE is not necessarily the ones that are individually most relevant. Only taken together the features of a produced subset are optimal informative [15]. The working methodology of SVM-RFE is based on backward selection where algorithm starts with whole features and iteratively eliminates the worst one until the predefined size of the final subset is reached at each iteration, and the remaining features must be ranked again [16]. We apply Weka SVM-RFE algorithm with square of weight as ranking criterion and at each step the feature which causes minimum variation in the SVM cost function is removed from feature space. 2.2 Feature Extraction Feature extraction methods replace the original variables with a smaller set of underlying information by using linear transformation without loss of information [17]. In this work, Principal Component Analysis (PCA) is used and results are compared to Factor Analysis (FA) results to reveal the underlying factors of original dataset.
A Neural Approach for SME’s Credit Risk Analysis in Turkey
753
Principal Component Analysis: PCA tries to maximize variance of features and use covariance matrix of input variables to obtain eigenvector and their corresponding eigenvalues. In PCA, to determine the optimal number of dimension, proportion of variance is used which is preferred to be higher than a predefined threshold value. Proportion of variance is the proportion of the sum of highest k eigenvalues to sum of all eigenvalues. We use PCA implemented in Matlab [18] where threshold value is taken as 0.90 for proportion of variance. Factor Analysis: FA assumes that observable variables are linear combinations of underlying factors and error terms. In FA, the key measurement is correlation between observable variables. If two variable is highly correlated that indicates these two are related by factors. FA determines the optimal number of factors after calculation of the eigenvectors and corresponding eigenvalues from input correlation matrix by 2 different criterions: proportion of variance as in PCA or Kaiser Criterion [19] where factors with eigenvalues greater than 1 are chosen. The FA reduced dataset is obtained by projection of factor loadings, which represents the correlations of variables with the factors, and estimated covariance on observable variables. In this work, SAS Enterprise Guide tool [20] is used with Kaiser Criterion to determine the underlying factors then FA reduced dataset is obtained by Matlab.
3 Credit Risk Classification Credit Risk Classification is the most common CRA method. In this work, we only focus on data mining techniques to classify our SME portfolio and apply k-NN, MLP and SVM on the original and the reduced datasets. 3.1 K-Nearest Neighbor K-Nearest Neighbor (k-NN) is a non-parametric classifier which makes the final decision for a new instance according to the k closest training data points’ majority class. K is generally chosen as an odd number to minimize confusion between two neighboring classes. The measure of closeness is in terms of d dimensional input space and there are different measurements such as Euclidean Distance which is a linear distance or Mahalanobis Distance whose calculation is based on the variation in each component of the points [21]. For experiments, WEKA is used to apply k-NN on real life financial dataset with k = 5 and Euclidean Distance is used as closeness measure. 3.2 Multilayer Perceptron Multilayer Perceptron (MLP) is a nonparametric neural network structure and used for both classification and regression. MLP learning process starts at input layer without any calculation. Briefly, hidden units nonlinearly transform the d dimensional input space to h dimensional space where the output units produce the output values as a linear combinations of the h dimensional nonlinear activation values computed by hidden units. In each epoch, weighted sum of input variables are sent as inputs to
754
G. Derelioğlu, F. Gürgen, and N. Okay
hidden units then nonlinear activation function is applied. In back-propagation algorithm, output value of each layer is used for previous layer weight updates. This process continues until one of the stopping criterions is reached; either maximum number of iterations or minimum error rate. We apply Weka Multilayer Perceptron function, with learning rate 0.8 and maximum 500 epochs. Only 1 hidden layer is preferred with predefined number of hidden units for reducing the complexity. Sigmoid function is used as activation function, which classifies the sample as bad customer if produced output is greater than 0.5. 3.3 Support Vector Machine SVM is a discriminant-based method, also used for both classification and regression. In classification, SVM tries to find the optimal hyperplane which maximizes the distance between data points from different classes. The distance from the hyperplane on each side is called as margin and the aim is to maximize the margin. The classification result of SVM is determined as if output is greater than 0 then class is 1 indicating bad customer else 0 indicating good customer. SVM can also handle non-linear problem by mapping the input space into nonlinear space by non-linear transformation. Different kernels such as Polynomial or Radial Basis Kernel is widely-used for transformation. We use SVM for classification purpose with different kernels and compare the results with each other, also with other classifier. As SVM tool, we use WEKA SVM (SMO) tool with polynomial kernel.
4 Classification Rule-Base Development Credit Risk Classification techniques work online and do not give any information how they arrive at the final decision. To reveal under which circumstances a customer is assigned as good or bad, we use neural rule-extraction from a trained MLP. We prefer to use CRED (Continuous/Discrete Rule Extractor via Decision Tree Induction) as a result of that the dataset includes both continues and discrete variables thus. CRED is composed of 4 steps [22]. Step1: Train a neural network with 3 layers: an input layer, a hidden layer and an output layer. Step2: Build a hidden-output tree where the input variables are activation values of hidden units and output variable is target value produced by neural network. Then, extract the rules, called as intermediate or hidden rules from composed decision tree and simplify each intermediate rule by removing useless literals and eliminate overlapping rules. For each remaining rules, generate functions which cover hidden nodes’ boundaries. Step3: Build a new decision tree for each intermediate function generated in step2. Each decision tree should corresponds to one function where input variables are obtained from the original input variable and output values are their discrete target produced by neural network. After composed of decision tree for each condition, extract the final rules which produce the target class.
A Neural Approach for SME’s Credit Risk Analysis in Turkey
755
Step4: If necessary, simplify and eliminate redundant rules. The final set is rulebase which describes the relationships between input variables and target. In this work, Weka Multilayer Perceptron is used for training MLP, J48 is used for obtaining decision tree and corresponding rules in Step2 and Step3.
5 Experimental Results 5.1 Performance Metrics Different performance metrics are used in this research. For the classification results exhibition, confusion matrix is preferred. The entries in the confusion matrix have the following meanings: True Positive (TP) is the number of correct predictions that a customer is good. False Negative (FN) is the number of incorrect predictions that a customer is bad. False Positive (FP) is the number of incorrect predictions that a customer is good. True Negative (TN) is the number of correct predictions that customer is bad. For performance comparison, firstly, accuracy (Acc) is preferred which is the proportion of the total number of correct predictions. Acc ranges from 0 to 1 where Acc is closer to 1 then the performance of classifier increases. In classification phase, in addition to accuracy, Mathews Correlation Coefficient (Mcc) is preferred which indicates the quality of classifier for binary class problem especially when two classes are of very different sizes [23]. Mcc ranges from -1 to 1 where -1 corresponds to inverse classification, 0 corresponds to average classification performance and +1 represents perfect classification. Mcc = (TP * TN – FP * FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN)) (3) The third measure that is used to reveal classifier performance is Misclassification rate (ms_rate), which is the proportion of misclassified instances to the total number of classified instances. Another performance metrics we use when comparing the results is expert’s view. In real market, the aim is increasing the credit amount when decreasing the risk. From this point of view, the ratio of real bad customers who are classified as good becomes very important. Hence, we use fp rate and fn rate. fp rate is false positives’ rate which indicates the proportion of number of bad customers who classified as good to total number of bad customers and fn rate is false negatives’ rate which is the proportion of number of good customers who classified as bad to total number of good customers. In knowledge extraction phase, performance measure to understand how applicable the obtained rule-set is defined as precision. Precision corresponds to accuracy in classification, which is the proportion of truly classified instances to total number of instances which meets rule’s conditions. 5.2 Dimension Reduction Results Decision Tree: C4.5 decision tree is applied on original dataset with 10-fold cross
validation and eliminates 15 variables when reached the best splits. The reduced dataset is called as Dataset1 in the following part of this work.
756
G. Derelioğlu, F. Gürgen, and N. Okay
Recursive Feature Elimination with Support Vector Machine: SVM-RFE is applied on the original dataset. When number of features is determined as 13, Dataset2 is obtained and when number of features is defined as 7, Dataset3 is obtained. Principal Component Analysis: PCA is applied on original dataset where proportion of variance is defined as 0.90 then 1 dimensional dataset is obtained which will be mentioned as Dataset4 since now. Factor Analysis: FA is applied on our dataset according to Kaiser Criterion and as a result 9 factors are revealed. Obtained factors are: • • • • • • • • •
Factor1 is the combination of customer existing risk information. Factor2 is the combination of customer delinquency information. Factor3 is the combination of customer historical information. Factor4 is the combination of customer historical delinquency information. Factor5 is the combination of customer credit and corresponding guarantee information. Factor6 is the combination of customer demographic information. Factor7 is the combination of customer financial information. Factor8 corresponds to specific customer demographic information. Factor9 is the maximum delinquency information.
From FA, 9 dimensional dataset is obtained which is called as Dataset5 in the following subsections. Furthermore, when FA is applied on Dataset2 for only releasing the underlying factors, 4 factors are obtained which are the same as Factor1, Factor2, Factor9 and Factor7. As seen from the results, factors extracted from Dataset2 are a subset of the factors extracted from original dataset. This result also indicates that Dataset2 almost carries the information behind the original dataset. 5.3 Classification Results Credit risk classification phase determines the best classifier. Thus, the performance results of 3 different classifiers; k-NN, MLP and SVM on 6 datasets are given in the Table 1. When the classifiers’ performances are compared, Mcc, fp_rate and fn_rate are taken into consideration. According to the results, we see that Dataset2 produces significantly good results for all three classifiers however the last two datasets which are produced by feature extraction algorithms are failed. Thus, we continue with dataset2 in the following parts of this research. k-NN produces the highest classification accuracy however the results can be affected by even a small economical fluctuation. Thus, instead of k-NN, MLP is chosen as optimal classifier which gives good classification performance and also outperforms SVM. Hence the other parts of our proposed method are composed on MLP.
A Neural Approach for SME’s Credit Risk Analysis in Turkey
757
Table 1. Classifiers’ Results on all datasets
Classifier
k-NN
MLP
SVM
Original Dataset Dataset1
Performance Measures FN Rate FP Rate Acc Mcc 0,680556 0,089674 0,7441 0,286 0,652778 0,103261 0,7422 0,2907
Dataset2 Dataset3
0,444444 0,479167
0,095109 0,11413
0,8066 0,7832
0,4961 0,4355
Dataset4 Dataset5
0,847222 0,833333
0,149457 0,173913
0,6543 0,6406
0,0042 -0,0086
Original Dataset Dataset1
0,527778 0,451389
0,13587 0,154891
0,7539 0,7617
0,3591 0,4008
Dataset2 Dataset3
0,548611 0,784722
0,125 0,029891
0,7559 0,7578
0,3561 0,3037
Dataset4 Dataset5
1 1
0 0
0,7188 0,7188
-
Original Dataset Dataset1
0,840278 0,840278
0,013587 0,01087
0,7539 0,7559
0,289 0,2994
Dataset2 Dataset3
0,840278 0,840278
0,008152 0,008152
0,7578 0,7578
0,3104 0,3104
Dataset4 Dataset5
1 0,993056
0 0
0,7188 0,7207
-
Dataset
5.4 Rule-Base Extraction Results This phase reveals how the classifier reached at the final decision. CRED is applied on Dataset2 and in the second step, 3 hidden rules are obtained where only 1 rule corresponds to the target class: “If hidden1 <= 0.99986 and hidden4 > 0.62039 then good”, where hiddeni corresponds to the i-th hidden neuron’s activation value. For two boundaries of hidden rule2, two functions are generated. In the third step two decision trees are composed each of which correspond to one question generated. After simplification of each rule, rule-base given in Figure 2 is obtained to determine under which conditions a customer is classified as good.
(If a2 <= 4) || (If A2 > 4 and A12 > 2 and A7 <= 80000 and A4 > 1223 and A9 <= 1 and a6 > 0) || (If A2 > 4 and A12 > 2 and A7 <= 80000 and A4 > 1223 and A9 > 1) || (If A2 > 4 and A12 > 2 and A7 > 80000 and A8 <= 28) || (If A2 > 4 and A3 <= 9320 and A6 > 0 and A10 <= 1) then Good
Fig. 2. Rule base obtained from trained MLP on dataset2
758
G. Derelioğlu, F. Gürgen, and N. Okay
When these 5 rules are applied on dataset2, the performance results obtained are given in detail in Table 2. These results indicate, the target rule-base we develop mostly reveal the decision criterion of the classifier. For example, Rule1 and Rule 3 have a precision of 0.96 and 1.0. The other rules Rule2, 4 and 5 have less precision on the decision of the classifier. Table 2. Rule-base performance results for dataset2 Rules Rule1 Rule2 Rule3 Rule4 Rule5
Precision 0.96 0.5641 1 0.6023 0.3863
6 Conclusions In this research, we focus on credit risk analysis which becomes very important in real financial market in the recent years. We proposed a modular comprehensive credit risk analysis method for Turkish SME customer portfolio, which covers both customer classification and rule-base extraction. From the experimental results, it is obvious that the proposed model well-performs on the sample space however as stated earlier, we do not affirm that the model would well-perform on the whole portfolio. The obtained results cannot be compared to any other free dataset because SMEs behavior changes according to many different conditions from country’s financial regulations to the economical fluctuations of data collection duration. Acknowledgement. Finally, we express our thanks to Boğaziçi University BAP 09A102P project and Yapı ve Kredi Bankası A.Ş. also special thanks to whom participate our research.
References 1. Fantazzini, D., Figini, S.: Random Survival Forests Model for SME Credit Risk Measurement, Methodology and Computing in Applied Probability, January 15, 2009, vol. 11, pp. 29–45. Springer, Heidelberg (2009) 2. Yang, C.G., Duan, X.B.: Credit Risk Assessment in Commercial Banks Based on SVM using PCA. In: International Conference on Machine Learning and Cybernetics. Kunming, vol. 2, pp. 1207–1211 (2008) 3. Zhou, J., Bai, T.: Credit Risk Assessment using Rough Set Theory and GA-Based SVM. In: The Third International Conference on Grid and Pervasive Computing-Workshop, pp. 320–325 (2008) 4. Wei, L., Li, J., Chen, Z.: Credit Risk Evaluation Using Support Vector Machine and Mixture of Kernel. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4488, pp. 431–438. Springer, Heidelberg (2007)
A Neural Approach for SME’s Credit Risk Analysis in Turkey
759
5. Galindo, J., Tamayo, P.: Credit Risk Assessment Using Statistical and Machine Learning: Basic Methodology and Risk Modeling Applications. Computational Economics 15, 107– 143 (2000) 6. Kaya, M.E., Gürgen, F., Okay, N.: An Analysis of Support Vector Machines for Credit Risk Modeling. In: Applications of Data Mining in E-Business and Finance. Frontiers in Artificial Intelligence and Applications, vol. 177, pp. 25–33 (2008) 7. Gaganis, C., Pasiouras, F., Spathis, C., Zopounidis, C.: A Comparison of Nearest Neighbors, Discriminant and Logit Models for Auditing Decisions. Intelligent Systems in Accounting, Finance and Management 15, 23–40 (2007) 8. Altman, E.I., Sabato, G.: Modeling Credit Risk for SMEs: Evidence from the US Market, http://pages.stern.nyu.edu/~ealtman/ 9. Yapı ve Kredi Bankası A.Ş. 10. Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004) 11. Han, J., Kamber, M.: Data Mining Consepts and Techniques. Academic Press, London (2001) 12. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman, San Francisco (1993) 13. WEKA, Datamining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/ 14. Chen, X., Jeong, J.C.: Enhanced Recursive Feature Elimination. In: IEEE Sixth International Conference on Machine Learning and Applications, pp. 429–435 (2007) 15. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1-3), 389–422 (2002) 16. Thang, Y., Zhang, Y., Huang, Z.: Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis. IEEE/ACM Transactions on Computational Biology And Bioinformatics, 365–381 (2007) 17. Tsai, F.S., Chan, K.L.: Dimensionality Reduction Techniques for Data Exploration. In: Proc. IEEE 6th International Conference on Information, Communications and Signal Processing, Singapore, December 10-13, pp. 1–5 (2007) 18. Mathworks, http://www.mathworks.com/ 19. Oreški, D., Peharda, P.: Application of Factor Analysis in Course Evaluation. In: Proceedings of the ITI 2008 30th Int. Conf. on Information Technology Interfaces, June 23-26, pp. 551–556 (2008) 20. SAS, http://www.sas.com 21. Weinberger, K.Q., Lawrence, K.S.: Fast Solvers and Efficient Implementation for Distance Metric Learning. In: Proceedings of the 25 th International Conference on Machine Learning, Finland, vol. 307, pp. 1160–1167 (2008) 22. Sato, M., Tsukimoto, H.: Rule Extraction from Neural Networks via Decision Tree Induction. In: Proceedings of IJCNN 2001, International Joint Conference on Neural Networks, vol. 3, pp. 1870–1875 (2001) 23. Lund, O., Nielsen, M., Lundegaard, C., Keşmir, C., Brunak, S.: Immunological Bioinformatics. MIT Press, London
Assisting Data Mining through Automated Planning Fernando Fern´ andez1 , Daniel Borrajo1, Susana Fern´ andez1 , 2 and David Manzano 1
Universidad Carlos III de Madrid, Legan´es, Spain [email protected] http://www.plg.inf.uc3m.es 2 Ericsson Espa˜ na, Madrid, Spain
Abstract. The induction of knowledge from a data set relies in the execution of multiple data mining actions: to apply filters to clean and select the data, to train different algorithms (clustering, classification, regression, association), to evaluate the results using different approaches (cross validation, statistical analysis), to visualize the results, etc. In a real data mining process, previous actions are executed several times, sometimes in a loop, until an accurate result is obtained. However, performing previous tasks requires a data mining engineer or expert which supervises the design and evaluate the whole process. The goal of this paper is to describe MOLE, an architecture to automatize the data mining process. The architecture assumes that the data mining process can be seen from a classical planning perspective, and hence, that classical planning tools can be used to design the process. MOLE is built and instantiated on the basis of i) standard languages to describe the data set and the data mining process; ii) available tools to design, execute and evaluate the data mining processes.
1
Introduction
Data mining (DM) is a complex task. It typically begins with a query or request of some knowledge, whose answer could be inferred from some data. For instance, the marketing department of a telecommunication company could ask: given the pattern of use of SMSs of a client, could his/her pattern of use of MMSs be predicted? Obviously, to answer this question, a lot of information could be used, like the SMS and MMS use patterns of similar clients or the profile of the user. The DM process typically relies on a trial and error process in which engineers execute different actions, like extracting data from the data bases, flattening the data, executing some filters, applying different clustering, classification or regression methods, visualizing results, etc. All these actions can be executed using many different alternatives for each task. For instance, there are plenty of filtering, classification or clustering techniques. Usually, some sub-sequences of this process are manually repeated until the results are good enough. To support people in the analysis of data and/or to improve the DM process, some P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 760–774, 2009. c Springer-Verlag Berlin Heidelberg 2009
Assisting Data Mining through Automated Planning
761
researchers claimed that part of this process can be automatized using different approaches [1,2,3,7,12,13]. This document describes an architecture, called mole, to automate a datamining process through current AI planning techniques. This architecture is based on: i) the use of the Predictive Model Markup Language (PMML) to describe the data-mining input, methodology, and result; ii) the use of the Planning Domain Definition Language (PDDL) to describe the data-mining process as a planning task; iii) the execution of a planner to automatically provide a sequence of DM actions to be performed, and iv) translators from PMML to PDDL and viceversa. The architecture builds upon the following ideas: – A DM task can be formalized using PMML. – Such DM task can be automatically rewritten as a planning domain and problem described in PDDL. – Different planning systems could be used to generate a plan from the domain and problem description which, in fact, represents a DM process. – The generated plan can be translated into PMML back (or any other related language), so that it can be interpreted by a DM engine and, therefore, can be executed. – The output of the DM process can be evaluated with two main goals. First, to analyse its usefulness for the current DM query, and, if needed, perform additional DM processes; second, to acquire knowledge about the DM process itself, so such knowledge can be used in the future to perform new DM processes more efficiently. Section 2 describes the different languages used to formalize the knowledge involved in the DM process. Section 3 describes the architecture. Section 4 evaluates the architecture by implementing an example of use with state of the art systems, specifically pltool [8] and weka [15]. Section 5 presents related work, and Section 6 introduces some conclusions and future research lines.
2
The Languages
This section describes the different standard languages used in the mole architecture. First, we describe PMML, which is used to define DM inputs, processes and output. Then, we introduce PDDL, the Planning Domain Definition Language, which is used to describe the data mining task from the planning perspective, and that is used as an standard in the planning community. 2.1
The Predictive Model Markup Language
The Predictive Model Markup Language (PMML) is a markup language for statistical and data mining models. It is known by the DM community, so we will focus here on the relevant aspects in relation to our work. It is based on XML, and it is composed of five main parts:
762
F. Fern´ andez et al.
– The header contains general information about the file, like the PMML version, date, etc. – The data dictionary defines the meta-data, or the description of the input data or learning examples. – The transformation dictionary defines the functions applicable over the input data, like flattening, aggregation, computation of average or standard deviation, normalization, principal component analysis (PCA), etc. In our case, this knowledge defines the actions that can be applied over the data, which will be defined in the planning domain file. – The mining build task describes the configuration of the training run that produces the model instance. PMML does not define the content structure of this part of the file, so it could contain any XML value. This mining build task can be seen as the description of the sequence of actions executed to obtain the model, so from the perspective of planning, it can be understood as a plan. This plan would include the sequence of operators that should be executed over the initial data set to obtain the final model. – The model describes the final model generated after the data mining process, i.e. after executing the mining build task. There are different models that can be generated depending on the data analysis technique used, ranging from bayesian, to neural networks or decision trees. Depending on the type of model the description will use a different XML description. The learning algorithms, or model learning approaches, that are applicable over the transformed data are not included in the PMML file. These PMML models implement some functions, and depending on the function, they may introduce different constraints over the data (which will have to be considered in the planning domain file). For instance, a RuleSetModel can be used both for classification or regression, but depending on the implementation itself, only one of such functions may be available. If, for instance, only regression has been implemented, then the class attribute of the data set must be continuous if we want to apply the RuleSetModel. The complete list of models and functions defined in the PMML standard are defined in Table 1. Different models can implement different functions and, as introduced above, the applicability of the model to a function may depend on the implementation of the model itself and the advances of the state of the art, so they are not constrained “a priori”. A complete PMML file contains the information about the five parts. However, during the data mining process, not all the data is available. Therefore, we can distinguish three different kinds of PMML files, according to the step in the mole architecture in which they are filled: 1. PMMLd : is a PMML file containing only the header, the data dictionary and the transformation dictionary. This can be considered as the input of the data mining process. 2. PMMLp : is a PMML file containing the same information as in PMMLd , plus the mining build task. In fact, the mining build task can be seen as the steps or process that should be followed to obtain a model using the data described in the PMMLd file.
Assisting Data Mining through Automated Planning
763
Table 1. Models and Functions defined in the PMML standard Models AssociationModel ClusteringModel GeneralRegressionModel MiningModel NaiveBayesModel NeuralNetwork RegressionModel RuleSetModel SequenceModel SupportVectorMachineModel TextModel TreeModel
Functions
AssociationRules Sequences Classification Regression Clustering
3. PMMLc : is a complete PMML file, including the input, process and model parts of the file. 2.2
The Planning Domain Definition Language
An automated planning task is defined by two elements: (1) a set of actions that represents the state-transition function of the world (the planning domain) and (2), a set of facts that represent the initial state together with the goals of the automated planning task (the planning problem). These two elements are typically represented in languages coming from first-order logic. In 1998 the Planning Domain Definition Language (PDDL) was developed for the First International Planning Competition (IPC).1 Since that date, PDDL has become the standard representation language for the automated planning community. According to the current PDDL specification [6], an action in the planning domain is represented by: (1) the parameters or typed elements involved in the execution of the action; (2) the action preconditions, so the action becomes applicable; and (3) the action effects, the changes in the state after the action application. Figure 1 shows an example of an action described in PDDL. The action represents the typical DM operator for training a model given a data set. Thus, the elements involved are the model to be built (chosen from the list of available models, such as a treeModel), an instance of such model (such as C4.5), the data set, the name of the field taken as class, and the test mode that will be used. The preconditions check, among other things, that the field belongs to such data set and that it is categorical, or that the model implements a classification function. The effect of the operator is that the instance of the model has been trained, and that some variable or fluents have changed the value, as the execution time, the understandability of the model, or the accuracy of the model obtained. 1
www.icaps-conference.org includes links for competitions.
764
F. Fern´ andez et al.
(:action train-classification :parameters (?mi - ModelInstance ?m - Model ?d - DataSet ?fi - FieldName ?dt - DataType ?t - TestMode) :precondition (and (learnable ?mi) (is-model ?mi ?m) (implements ?m classification) (is-field ?fi ?d) (dataDictionaryDataField ?fi categorical ?dt) (eval-on ?d ?t)) :effect (and (is-classification-model ?mi ?d ?fi) (not (preprocess-on ?d)) (not (learnable ?mi)) (increase (accuracy) (accuracy-model ?m)) (increase (understandability) (understandability-model ?m)) (increase (exec-time) (* (model-time ?m) (dataDictionaryNumberOfFields)))))
Fig. 1. A PDDL Operator for generating a classification model
3
The mole Architecture for Planning a Data Mining Process
This section describes the mole architecture, an architecture for automatically planning a DM process. First, we introduce the general architecture, and then we will describe its inputs/outputs as well as its components. 3.1
The General Architecture
The complete mole architecture is shown in Figure 2. It receives as input a DM task. Different tasks may require different DM schemas, so the most similar previously stored schema will be chosen by a DM Task Selector. Examples of such schemas are: a classification task with few unknown values; or a task in which classes are unknown, but a classifier needs to be built; ... Then, a planning domain and problem is generated, and the planner is executed. The planner outputs a plan, which is used as a DM workflow. This workflow is used by a DM Engine, which executes it generating the answer to the user. This answer can be in terms of a classifier, a predicted error, or the class of a given example (if specified in the input). The user may generate a feedback to the system, that can be used by mole to improve the planning process in the future by using machine learning techniques. A more detailed description of the modules follows. 3.2
Input and Output of the Architecture
The input of the architecture is a DM task. The goal of the architecture is to interpret such task, and produce an answer that satisfies the user objectives.
Assisting Data Mining through Automated Planning
765
The interpretation of the task is done by the DM Schema Selector, as it will be explained in Section 3.3. Examples of tasks could be: recommend a service to a user given some previous customer service consumption, group users with respect to the services they use, predict customers phone usage given their previous usage patterns and similar customers usage patterns, etc. Once the architecture generates a model related to the DM task, the user can pose many specific queries to the model (as, for instance, give data from a specific customer), and the architecture will use the generated model and a DM tool (weka for instance) to answer the query (a recommendation for instance). The DM task will normally include a data set. The output of the architecture is two fold. On the one hand, consists of a DM model that is able to answer DM queries in the proposed task, together with the evaluation results of such model (for instance, prediction accuracy). Also, it can generate answers to queries by using that model and the DM Analizer. Such answers may satisfy the user or not, so the user could generate a feedback to the system, that can be used by the DM Analyzer to ask the planner for a new answer, or to generate learning episodes that would improve future planning execution.
Data−mining Task & Query
DM SCHEME
Data−mining Scheme d PMML
SELECTOR
MACHINE LEARNING
Data−mining Answer
DOMAIN & PROBLEM GENERATOR
Planning Domain & Problem PDDL
DM ANALYZER
Learning Episodes PLANNER
PLTOOL DM ENGINE
(SAYPHI)
WEKA Model & Results PMML
Data−mining Workflow p PMML
c DM
Data−mining Model & Results
VISUALIZER
WEKA
Fig. 2. Overview of the mole architecture
766
3.3
F. Fern´ andez et al.
The DM Schema Selector (DMSS)
The goal of the DMSS module is to select a DM schema from the data base that is similar to the input user task. Selecting a similar task to the input one is a challenging problem, and may require the use of different techniques. The DM Schema Data-Base (DMSDB) could be considered as a set of cases (as in Case-Based Reasoning) that can be generalized, stored, recovered and adapted to new cases. The way to generalize, store, recover and adapt the schemas are the research goals of this module. A simpler approach is to build a search mechanism or a task-centric browser that allows the user to select the closer schema to its query. Then, the selection of the schema is delegated to the user or to a DM engineer. The implementation of this approach requires the following steps: 1. Definition of a set of different DM schemas. This can be tackled with a knowledge extraction process with the end-users of the system, and may include defining a taxonomy. 2. Definition of a language to represent the DM schemas. The selection of the language depends on the taxonomy and complexity of use-cases. In Section 2.1 we propose to use the Predictive Model Markup Language (PMML). 3. Design and programming of an interface with the users that help them select/browse etc. the DM schemas, and choose the best one for the proposed task. In our current system, we do not have yet an implementation of this module, but there has been related work that address similar tasks as in [2,1,7,13]. We will build on those works to provide reasonable solutions to the user. 3.4
Domain and Problem Generator
This module takes as input a DM schema, and generates a planning domain and problem that will be used by the planner to generate a DM workflow. The knowledge required to generate the planning domain and problem comes from several sources, as the DM schema, the capabilities of the Data Mining Engine, etc. The Domain and Problem Generator defined in the architecture takes as input a PMML file, and outputs the planning domain and problem. However, this process can not be seen as a direct translation from PMML to PDDL, but the generation of the PDDL from different information sources. These information sources are the PMML file (data and transformation dictionaries), the PMML standard and the implementation of the DM engine. In addition, expert knowledge could be also included. Lastly, in the PDDL file there exists some planning knowledge which must also be supplied. All the information sources used in this process are described next: The PMML Standard. There is a lot of knowledge that can be extracted from the PMML standard itself. Thus, this knowledge is translated into PDDL, independently of the problem being tackled. For instance, the standard defines that there are models and functions. A model is a way to represent
Assisting Data Mining through Automated Planning
767
the knowledge, like a neural network or a decision tree. A function is a data mining task, like classification, regression, etc. Thus, in the PDDL domain file, both models and functions are two kinds of objects (types in PDDL), and there are several instances of such objects defined as constants in the domain. The DM Engine. The DM engine is the module that executes the DM algorithms over the data set. Therefore, the engine constrains the operations that can be executed. Such knowledge can be included in a PDDL file as static predicates that indicate what algorithms implement what functions. For instance, we could define a predicate in PDDL as: (implement ?model - ?function - function) and then define in the initial state of the problems: (implement associationModel associationRules) The Data Dictionary. This knowledge depends on the specific data available for each DM process, and for each use-case. For instance, in a recommender system some attributes might appear, such as Previous Recommendation In addition, some facts can be derived from the data dictionary. For instance, the predicate: (dataDictionaryDataField ?f - FieldName ?o - OpType ?d - DataType) is used to define that in the data dictionary there is a field, ?f, which allows a type of operation ?o and that is of type ?d. The Transformation Dictionary. PMML defines several possible transformations over the data. However, the available transformations depend on the implementation of the architecture, as described above. Additionally, PMML allows the designer to define new transformations. Thus, we can assume that all the transformations will be given in the PMML file and we can define the additional ones in the corresponding domain/problem files. Planning Expert Knowledge. The PDDL files will also include knowledge about the planning process. For instance, if we have a function that normalizes the data, we will define a predicate, (normalized ?d - DataSet), that indicates whether the data set ?d is normalized or not. In addition, planning actions are created to perform the data mining activities. Some of them are derived from the DM theory itself, like to evaluate or test a model over some training data. Some operators could have different sources, like to apply a specific function (apply-transformation-normalization). 3.5
The Planning Module
The planning module takes the planning domain and problem, and outputs a plan or DM workflow. There is a wide variety of languages to describe planning domains and problem files, and the language may depend on the used planner, but most stat-of-the-art planners use the PDDL language, as was described before. Since most planners that use the PDDL standard participate in the competition, they also use a standard output language (plan language). For each action in the solution, it should specify:
768
F. Fern´ andez et al.
<starttime>: ( <param_1> <param_2> ... <param_k>) [] where: – startime is the time when the action should be executed (not relevant for DM purposes in our current implementation) – param 1 param 2 . . . param k is the list of parameters required by the operator – action-duration is the duration of the executed action Since PMML uses XML, we adhere to a language used by the probabilistic track of the 6th IPC, held in 2008, so we translate each plan to its XML counterpart. Figure 3 shows an example of a plan in XML. The description of a plan follows the XML schema is described in Figure 4. Since PMML only requires that the Mining Build Task is defined in XML, the XML plan translated from the one generated by the planner can be integrated directly in the PMML file. This will generate the P M M Lp version. The plan ...
Fig. 3. XML definition of a plan <xs:element name="Plan"> <xs:complexType> <xs:sequence> <xs:element ref="Operator" minOccurs="0" maxOccurs="unbounded"/> <xs:attribute name="planCost" type="xs:double" use="required"/> <xs:attribute name="planLength" type="xs:double" use="required"/> <xs:attribute name="totalTime" type="xs:double" use="required"/> <xs:element name="Operator"> <xs:complexType> <xs:sequence> <xs:attribute name="data=" type="st:string" use="required"/>
Fig. 4. XML schema of a plan
Assisting Data Mining through Automated Planning
769
can also be translated to other languages, such as DM engines specific languages. For instance, weka provides the kfml language for specifying the DM workflow, that can later be executed within the tool. kfml is an XML based language used to describe knowledge flows in weka. We have built a prototype version of the translator that takes as input the file containing a plan in IPC format, and the weka data file (in ARFF format). The translator generates the KFML file, which can be loaded in weka. 3.6
The DM Engine
The DM Engine (DME) receives a DM workflow (plan), and uses it over the given data set. The goal of the DME is to execute the workflow, and generate a model from the data. Additionally, the DME could generate an evaluation report of the model that could be used by the DM Analizer to generate the user answer. Depending on the DME used, it may offer functionalities to the user like: 1. Generate log files that can be used by a learning system able to improve future planning processes (see learning module). 2. Generate a graphical description of the DM process, allowing the user to modify it. This graphical description can be visualized with the DM Workflow Visualizer. 3.7
The DM Workflow Visualizer
This module allows a user to visualize the DM workflow. Additional functionalities are modification, loading or saving workflows. In Section 4 we propose to use the weka Knowledge Flow tool which allows executing the DM workflow from the visual interface. Therefore, the same tool can be used both for visualizing the DM process and executing it. 3.8
The DM Analizer
The DM Analizer (DMA) generates the DM answer taking as inputs the model generated by the DME and the user query. For instance, if the DME generates a decision tree, and the query specifies an example to be classified, the DMA would return the class of the example according to the model. 3.9
The Learning Module
The goal of the learning module is to improve the planning process with experience. Different learning opportunities have been found in the literature with respect to planning [16]. In the specific case of mole we could use techniques for: improving the planning process (such as macro-actions, control rules, or policies); improving the quality of the solution (plan); or automatically generating the domain files (not really applicable in this case). Some of them are already integrated in the planning system that we are using [8,4,5,9], though we do not report on them yet given that we have not performed experiments.
770
4
F. Fern´ andez et al.
An Evaluation of the mole Architecture: Using weka and PLTool
This section describes the current instantiation of the mole architecture that we have implemented. Figure 2 showed the architecture, but also how some of its modules have been implemented. The figure includes what representation language (PMML or PDDL) is used for the knowledge flow in the architecture. It also includes an example of instantiation of some modules, specifically using weka [15] as the DM Engine and Visualizer and PLTool [8] as the planning component. We will describe in more detail some of this instantiation. 4.1
Inputs: PDDL files
We have generated the PDDL domain and problem files. To generate these files, a PMML file corresponding to a service recommender task has been used. A high level description of such task was also used as expert knowledge, which is an additional knowledge source, as it was described previously. So far, the domain and problem files have been manually defined. As an example of a DM problem, we defined the following set of goals: (and (normalized initialDataSet) (DMevaluated tree initialDataSet outByte training-set resu) (visualized-model tree graph) (visualized-result resu text)) The initial state contains information about the data set, its attributes, all the available DM models and instances of those models, the cost of each of those models, etc. 4.2
PLTool
Once the PDDL files are generated, we use PLTool to generate the plans. PLTool is a GUI that allows to use any of several planning systems (such as Metric-FF [11], lpg [10], ipss [14], or Sayphi [4]). Some of them incorporate learning techniques that can also be used within the tool. When using the Sayphi planner, it generates the plans in two different formats: – XML format, as described in Section 2 – IPC format, a format typically used in the planning competitions A plan generated by Sayphi in the IPC format for the PDDL domain and problem files included in the appendix is: 0: (LOAD-DATASET INITIALDATASET ) [1] 1: (APPLY-TRANSFORMATION-DISCRETIZATION INITIALDATASET DOUBLE CONTINUOUS OUTBYTE DOUBLE ) [1] 2: (APPLY-TRANSFORMATION-NORMALIZE INITIALDATASET DOUBLE CONTINUOUS START INTEGER ) [1]
Assisting Data Mining through Automated Planning
771
3: (APPLY-TRANSFORMATION-ATTRIBUTE-SELECTION INITIALDATASET DOUBLE CONTINUOUS START INTEGER DISCRETIZE ) [1] 4: (DATASETPREPARATION INITIALDATASET SPLIT ) [1] 5: (TRAIN-CLASSIFICATION TREE TREEMODEL INITIALDATASET OUTBYTE INTEGER SPLIT ) [1] 6: (VISUALIZE-MODEL TREE TREEMODEL GRAPH INITIALDATASET OUTBYTE ) [1] 7: (TEST-CLASSIFICATION TREE INITIALDATASET INITIALDATASET OUTBYTE SPLIT RESULT ) [1] 8: (VISUALIZE-RESULT TREE INITIALDATASET OUTBYTE SPLIT RESULT TEXT ) [1]
The plan includes several actions that were not explicitly specified in the goals of the planning problem description. For instance, it applies a normalization, because this action is required for attribute selection. It also applies a normalization of the class attribute, because in the original data set the class is continuous, and the goal is to obtain a decision tree, which requires nominal classes. Figure 5 shows how weka represents graphically the plan, after being translated into kfml. Then, this DM workflow can be executed directly using the weka data mining engine. But, it could also be modified by a data mining engineer in order to obtain more accurate results.
Fig. 5. An Image of weka Knowledge Flow Interface
772
5
F. Fern´ andez et al.
Related Work
There has been previous work whose aim is similar to mole. As an example, inlen [12] can automatically pursue different data mining tasks according to a high-level plan. However, it is the user who defines the plan. Other systems, as aide [2], use a different planning approach closer to reaction: they continuously interact with the user suggesting the next action to be performed. This work presents three features that are interesting for this kind of tool, that we would like to expore in the future. First, they build upon a plan library (as it is also the case of [13,1]). In our case, this is related to the first discussed module (DM Schema Selection). Second, they use a hierarchical approach to planning as also used in [7] (higher level tasks, such as pre-processing data, can be mapped into lower level tasks, such as normalization followed by attribute selection). We have preferred to use a more standard language as PDDL (hierarchical planners cannot use PDDL directly), given the current population of planning systems using PDDL is much bigger. Also, using hierarchical planning means manually defining the decompositions, which might not be as trivial as it seems when interaction of DM processes appears. Third, they use a mixed-initiative reactive approach that interleaves planning and execution. In our case, we could devise such a scheme by extensively using interaction through the tool’s GUI, suggesting the next step to be performed, and executing it if the user so demands. ida [1] use an ontology to describe the knowledge on tasks, subtasks, and primitive actions. This would be similar to the use of a hierarchical planner. An interesting idea in this work is the dynamic generation of a ranking of solutions (plans in our case). We can also return this ranking by allowing planners to generate multiple solutions (for instance, Sayphi implements several search techniques with that capability, as DFBnB, or some kind of Anytime A∗ ). Since all planning solutions have associated values depending on the specified metric to be optimized (such as learning time, understandability of the output, accuracy, ...), solutions can be easily ranked. However, as other authors claim, obtaining multiple solutions can be costly. Finally, other works, as [3], also use hierarchical planning techniques, but they focus on a specific domain (image analysis).
6
Conclusions
DM is a complex task that encompasses the use of many different techniques, each appropriate for a specific subtask within the KDD process. When a user wants to extract interesting or relevant knowledge from a data set, s/he should select and apply the different DM techniques in order to obtain good results. However, this is a difficult task given the amount of different techniques available for each step of the DM process, each with specific costs (such as running time) or benefits (not always it is useful to apply normalization to numerical data). We have presented an architecture, mole, that integrates several modules, and whose goal is to alleviate the DM task and help the users. We have a first implementation of the architecture that integrates a planning tool, PLTool,
Assisting Data Mining through Automated Planning
773
which incorporates several planning and learning techniques (not to be mixed with machine learning techniques for DM), and weka, that incorporates many different DM capabilities.
Acknowledgements This work has been partially supported by the Spanish MEC project TIN200508945-C06-05, the Spanish MICIIN project TIN2008-06701-C03-03, the regional project CCG08-UC3M/TIC-4141, and the MOLE project granted by Ericsson.
References 1. Bernstein, F.P.A., Hill, S.: Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification. IEEE Transactions on Knowledge and Data Engineering 17(4) (2005) 2. Amant, R.S., Cohen, P.R.: Evaluation of a semi-autonomous assistant for exploratory data analysis. In: Proc. of the First Intl. Conf. on Autonomous Agents, Marina Del Rey, CA, pp. 355–362. ACM Press, New York (1997) 3. Chien, S.A., Mortensen, H.B.: Automating image processing for scientific data analysis of a large image database. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 854–859 (1996) 4. de la Rosa, T., Garc´ıa-Olaya, A., Borrajo, D.: Using cases utility for heuristic planning improvement. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS, vol. 4626, pp. 137–148. Springer, Heidelberg (2007) 5. de la Rosa, T., Jim´enez, S., Borrajo, D.: Learning relational decision trees for guiding heuristic planning. In: Proceedings of ICAPS 2008, Sydney, Australia. AAAI Press, Menlo Park (2008) 6. Edelkamp, S., Hoffmann, J.: The language for the 2004 international planning competition (2004) 7. Engels, R.: Planning tasks for knowledge discovery in databases; performing taskoriented user-guidance. In: Proc. of the 2nd Int. Conf. on KDD (1996) 8. Fern´ andez, S., Borrajo, D., Fuentetaja, R., Arias, J.D., Veloso, M.: PLTOOL. A KE tool for planning and learning. Knowledge Engineering Review Journal 22(2), 153–184 (2007) 9. Garc´ıa-Dur´ an, R., Fern´ andez, F., Borrajo, D.: Learning and transferring relational instance-based policies. In: Taylor, A.F.M., Driessens, K. (eds.) Working Notes of the AAAI 2008 workshop on Transfer Learning for Complex Tasks, Chicago, IL, USA, pp. 19–24. AAAI Press, Menlo Park (2008); Technical Report WS-08-13 10. Gerevini, A., Saetti, A., Serina, I.: Planning through stochastic local search and temporal action graphs. Journal of Artificial Intelligence Research 20, 239–290 (2003) 11. Hoffmann, J.: The Metric-FF planning system: Translating “ignoring delete lists” to numeric state variables. Journal of Artificial Intelligence Research 20, 291–341 (2003) 12. Michalski, R.S., Kaufman, K.A.: Discovery planning: Multistrategy learning in data mining. In: Proceedings of the Fourth International Workshop on Multistrategy Learning, pp. 14–20 (1998)
774
F. Fern´ andez et al.
13. Morik, K., Scholz, M.: The MiningMart Approach to Knowledge Discovery in Databases. In: Intelligent Technologies for Information Analysis, pp. 47–65. Springer, Heidelberg (2004) 14. Rodr´ıguez-Moreno, M.D., Oddi, A., Borrajo, D., Cesta, A.: IPSS: A hybrid approach to planning and scheduling integration. IEEE Transactions on Knowledge and Data Engineering 18(12), 1681–1695 (2006) 15. Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco (2000) 16. Zimmerman, T., Kambhampati, S.: Learning-assisted automated planning: Looking back, taking stock, going forward. AI Magazine 24(2), 73–96 (Summer 2003)
Predictions with Confidence in Applications Mikhail Dashevskiy and Zhiyuan Luo Computer Learning Research Centre Royal Holloway, University of London Egham, Surrey TW20 0EX, UK [email protected], [email protected]
Abstract. Many applications require predictions with confidence. We are interested in Confidence Machines which are algorithms that can provide some measure on how confident they are that their output is correct. Confidence Machines are quite general and there are many algorithms solving the problem of prediction with confidence. As predictors we consider Venn Probability Machines and Conformal Predictors. Both of these algorithms rely on an underlying algorithm for prediction and in this paper we use two simple algorithms, namely the Nearest Neighbours and Nearest Centroid algorithms. Our aim is to provide some guidelines on how to choose the most suitable algorithm for a practical application where confidence is needed. Keywords: Prediction with Confidence, Machine Learning, Venn Probability Machine, Conformal Predictors, Probabilistic Prediction, Confidence Machines.
1 Introduction In many practical applications people want to know what will happen in the future. To cope with the huge amount of data, researchers employ computers and computer learning algorithms in order to process the data and learn on it. Modern computer learning algorithms achieve good performance, but some of them lack information on how confident they are that the output they give is correct. The problem of classification (the case of prediction, where the number of possible outcomes is finite) has many applications and approaches. One can, for example, use Bayesian techniques ([6]), Decision Trees ([7]) or other popular methods. However, most of these techniques do not provide confidence with their predictions and even if they provide it, it is not clear how to choose an adequate algorithm for a particular application. We approach the problem by simply choosing some algorithms and experimentally comparing them in different settings. In this paper we consider two prediction algorithms which provide some information on how reliable the output is. This information can be used further by some decision making systems. Firstly, we consider Venn Probability Machines, a probabilistic predictor which can be used in the problem of classification. This algorithm is based on the idea of dividing examples in groups using some measure of their similarity and then calculating probability that the new example has one of the possible labels. This predictor provides lower and upper bounds on the probability that its output is correct. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 775–786, 2009. c Springer-Verlag Berlin Heidelberg 2009
776
M. Dashevskiy and Z. Luo
Secondly, we consider Conformal Predictors (in the classification setting) and compare them with Venn Probability Machines. The algorithm of Conformal Predictors (CP) is based on the idea of using a nonconformity measure (formal definition of this measure is given in Section 3). Note that Conformal Predictors do not give probabilities on their correctness, but some confidence, which reflects the quality of the predictions. However, researchers have been using these confidence measurements for a long time now and these measurements are considered as useful information about predictions. One of the main differences between Venn Probability Machines and Conformal Predictors is that Venn Probability Machines give one label as their output whilst Conformal Predictors give region predictions (it means that their output can consist of zero, one, and so on up to the total number of possible labels elements). We can also force Conformal Predictors to give only one label as their output by simply choosing the label with the highest confidence. However, in this case we will not have a guarantee on the performance of the algorithm. Conformal Predictors can be used on data under the i.i.d. (independent and identically distributed) assumption which is a common restriction on data in Machine Learning. In order to have more uniform data, we randomly permute the datasets used in our experiments.
2 Venn Probability Machine Venn Probability Machines are based on the idea of dividing examples into groups and, when a new object arrives, somehow assigning it to one of the groups. Then we can use the frequencies of labels in the group containing the current object as probabilities for the new object’s label. More formally, let X be the object space and Y the label space (the space of the classes) and Z = X × Y . And let N be a predefined number of examples used as the input by the Venn Probability Machine. Venn taxonomy is a sequence An , n = 1, . . . , N , where each An is a finite partition of the space Z (n−1) × Z. The expression Z (n−1) × Z here means that An does not depend on the order of the first n−1 arguments. Let An (ω) be the element of the partition (An ) that contains ω ∈ Z (n−1) × Z. With every taxonomy (An ) we associate a Venn Probability Machine. After choosing a taxonomy, we can start making predictions. Each prediction is a class associated with the current example by the Venn Probability Machine (VPM). Consider the following protocol. Let us call the examples’ possible class (Pn ). On each step reality makes a move xi , the classifying system has to assign the new example to one of the classes Pn . Firstly, we consider a class y ∈ Y and partition {(x1 , y1 ), . . . , (xn , y)} into categories, assigning two points in Z to the same category if and only if An ({z1 , . . . , zi−1 , zi+1 , . . . , zn }, zi ) = An ({z1 , . . . , zj−1 , zj+1 , . . . , zn }, zj ) where zi = (xi , yi ), i = 1, . . . , n − 1 and zn = (xn , y). In this case we assume that {z1 , . . . , zk } means a multiset (a bag), i.e. a set of elements, where each element has a multiplicity, i.e. a natural number indicating how many memberships it has in the multiset.
Predictions with Confidence in Applications
777
The category T containing zn = (xn , y) is nonempty. Let py be the empirical probability distribution of the labels in this category T : py {y } :=
|{(x∗ , y ∗ ) ∈ T : y ∗ = y }| , y ∈ Y |T |
This is a probability distribution on Y . The Venn Predictor determined by the taxonomy is the multiprobability predictor Pn := {py : y ∈ Y }. The set Pn consists of between one and |Y | distinct probability distributions on Y . For each probability distribution, for each possible label we get the probability(frequency) that this label is the correct classification of the current example. from this matrix (with the number of rows equal to the number of columns equal to the power of label space) we can get a probability interval for each label that this label is true (as the interval with the minimum and the maximum probabilities of this event. All this information is the output of the Venn Probability Machine. The prediction here is the label with the highest probability to be correct (e.g. with the largest mean of the probability interval), see Algorithm 1. More information on Venn Probability Machines and Conformal Predictors (briefly described in the next section) can be found in [9]. Algorithm 1. Venn Probability Machine Algorithm Require: Parameter (An ) {taxonomy} for n = 2, 3, . . . {on each step of the algorithm, we start counting the steps from 2 because there is no sense to make classification without a training set} do for j = 1, . . . , N umberOf Labels{for each possible label} do Assign yn = j{let us consider that the current example has label j} Calculate An ({z1 , . . . , zn−1 }, zn ) {let us calculate in which category the current example falls} Create set B = {zi : An ({z1 , . . . , zi−1 , zi+1 , . . . , zn }, zi ) = An ({z1 , . . . , zn−1 }, zn )}{choose the category, where the current example belongs to} k =i} Calculate αi = #{zk ∈B:y {calculate the frequencies of each label in the category} #B Give (mini αi , maxi αi ) as the probability interval that the current example has label j end for end for
2.1 Region Prediction for Venn Probability Machine At each step Venn Probability Machines provide one prediction corresponding to the highest probability. Confidence Machines considered in the next section of this paper, namely Conformal Predictors, can output region predictions. To compare the two algorithms we want to convert probabilistic predictions of Venn Probability Machine into confidence region predictions. Intuitively, we use the fact that the probability predictions are estimates of conditional probabilities and also assume that labels are mutually exclusive. Therefore summing these predictions becomes a conditional probability of a conjunction of labels, which can be used to choose the labels to include in the
778
M. Dashevskiy and Z. Luo
confidence region predictions at the desired confidence level. Consider the case when a Venn Probability Machine outputs a sequence p 1 , p2 , . . . , pl , where pi is the probability that label i is correct. Let us consider the case where p1 > p2 > . . . > pl . If it is not true we can always rename the labels. Now to make a region we carry out the following procedure: we output all labels i such prediction i that j=1 pj ≤ 1 − , where is the significance level, i.e. the level of mistakes we can tolerate, as a possible label. For the label n such that n−1 j=1 pj ≤ 1 − , but n p > 1 − we need a more sophisticated approach: we will output label n as j j=1 n−1 a possible label if and only if (rand() > (1 − − j=1 pj )/pn ), where rand() is a random number between 0 and 1. The reason for this condition is that we want the probability for a label from the output to be correct with a predefined probability 1 − . A more rigorous description of these steps is given in Algorithm 2.
Algorithm 2. Region Predictions for Venn Probability Machines Algorithm Require: Parameter significance level > 0; (p1 , p2 , . . . , pl ) : p1 > p2 > . . . > pl P =0 i=1 EXIT FLAG = 0 while (P < 1 − ) AND (EXIT FLAG == 0) do if P + pi > 1 − then EXIT FLAG = 1 if (rand() > (1 − − P )/pi ) then add label i to the set of predictions end if else P = P + pi add label i to the set of predictions end if i= i+1 end while
3 Conformal Predictors Conformal Predictors is a Machine Learning technique of making predictions (classifications) according to how similar the current example is to the representatives of different objects’ classes. The algorithm is based on the idea of Nonconformity Measure, a function which gives some measure of dissimilarity an example to other examples. The higher the value of this function on an example, the more unlikely that this example belongs to the selected group of examples. To exploit this idea, we can assign different labels to the current example, calculate the dissimilarity of this example to other examples with the same label, calculate the dissimilarity of this example to the examples with
Predictions with Confidence in Applications
779
other labels and using this value classify the example. In order to describe Conformal Predictors more precisely we need to introduce some formal definitions. First we define a Nonconformity Measure. For each n = 1, 2, . . . we define the function An : Zn−1 × Z → R as the restriction of A to Zn−1 × Z. The sequence (An : n ∈ N) is called a Nonconformity Measure. To give an idea on what a Nonconformity Measure is, we need to use the term bag which was defined in the previous subsection. Given a Nonconformity Measure (An ) and a bag of examples z1 , . . . , zn we can calculate the nonconformity score: αi = An (z1 , . . . , zi−1 , zi+1 , . . . , zn , zi )
(1)
for each example in the bag. Here we use the notion from the previous section: zi = (xi , yi ), xi ∈ X, the Object Space, yi ∈ Y , the Label Space, Z = X × Y . The nonconformity score is a measure of how dissimilar an example is from a group of examples. For example, in the case of using 1-NN algorithm as the underlying algorithm in Conformal Predictors (the algorithm used to calculate the nonconformity score), the nonconformity score can be calculated as An (z1 , . . . , zn , z) =
mini∈{1,...,n},yi =y d(xi , x) mini∈{1,...,n},yi =y d(xi , x)
it means that we consider an example as nonconforming if it lies much further from the selected group of objects than to all other examples (not form the selected group). Now, to use the nonconformity scores we want to compare the current example (which we are trying to classify) and the examples from the training set. In this way, we can see how the current example fits into the “whole picture” of the dataset. Now we introduce the term p-value. The number |{j = 1, . . . , n : αj ≥ αn }| n
Algorithm 3. Conformal Predictor Algorithm for Classification Require: Parameters {Significance Level} for n = 2, 3, . . . {on each step of the algorithm} do for j = 1, . . . , N umberOf Labels{for each possible label} do Assign yn = j{let us consider that the current example has label j} Calculate αi , i = 1, . . . , n{Calculate nonconformity scores using (1)} i ≥αn }| Calculate pj = |{i=1,...,n:α {Calculate p-value corresponding to the current posn sible label} Output j as a predicted label of the current example with p-value pj if and only if pj > end for end for
780
M. Dashevskiy and Z. Luo
is called p-value of the object zn = (xn , yn ). The Conformal Predictor receives a significance level (a small positive real number reflecting the confidence we want to achieve, it equals to one minus Confidence Level) as a parameter, to output a label, the Conformal Predictor calculates p-values of the pairs (xn , possible label) and outputs all such pairs that their p-values are greater than the confidence level. To have a guarantee on the performance we have to introduce a modification of Conformal Predictors, namely Smoothed Conformal Predictors. This algorithm calculates p-value, which is called smoothed p-value, as follows: pn =
|{j = 1, . . . , n : αj > αn }| + τn |{j = 1, . . . , n : αj = αn }| , n
where (τ1 , . . . , τn ) is a vector of independent random variables uniformly distributed on [0, 1]. If a sequence of examples is generated by an exchangeable probability (i.e. the probability of occurrence of two sequences where two examples are swapped is the same) and the probability of a prediction system to make a mistake is independent at each step and equal to a predefine level, then such a prediction system is called exactly valid. Proposition 1. (see Proposition 2.4, [9]) Any smoothed conformal predictor is exactly valid. This is the theoretical guarantee on the performance of smoothed conformal predictors. Conformal Predictors output region predictions, i.e. sets of possible labels for the new object. If we want to choose only one label as the prediction it is reasonable to choose the label corresponding to the highest p-value. In this case, however, we do not have any theoretical guarantee on the performance of the algorithm.
4 Underlying Algorithms In this section we describe two underlying algorithms which we used to build Venn taxonomies and nonconformity measures. These algorithms, namely the Nearest Neighbours and Nearest Centroid algorithms are widely used in Machine Learning (see [1]) and have somehow similar nature, i.e. they both make a prediction based on a distance from the object to other objects. 4.1 Nearest Neighbours The Nearest Neighbours (NN) algorithm is one of the simplest methods in Machine Learning for making predictions. For example, when the number of NN is 1 (the simplest case), this algorithm consists in finding the current label’s nearest neighbour using a predefined distance function and assigning its label to the current example’s label. This algorithm can be expanded on the case with several NN. In this case, the algorithm finds n NN, where n is the number of nearest neighbours, and according to its labels make a decision on the current example’s label, e.g. the algorithm can choose the most common label among the found label set. More information on the NN algorithm can be found, for example in [9].
Predictions with Confidence in Applications
781
In our experiments with Venn Probability Machines we used two types of taxonomies based on the NN algorithm. One type, called “voting”, says that two objects belong to the same taxonomy if and only if for each object the most frequent label among its N (parameter of the NN algorithm) is the same. The second type of taxonomies (called “all info”) says that two objects belong to the same taxonomy if and only if for each of them the ordered sequence of its N nearest neighbours’ labels is the same. In this paper we use the Euclidean distance measure: n d(a, b) = (ai − bi ). i=1
For Conformal Predictors we want to scale the distance to the nearest neighbour and compare relative values, but not distances. We do it as follows: αi = αi (xi , yi ) =
minyj =yi d(xj , xi ) . minyj =yi d(xj , xi )
In the case of several nearest neighbours we take their mean value instead of xj . Again we look for nearest neighbours with the same label as (xi , yi ) to calculate the upper term of fraction and with different labels to calculate the lower term of fraction. 4.2 Nearest Centroid Unlike the NN algorithm, the Nearest Centroid (NC) algorithm does not require computing distances between each pair of objects. Instead it measures the distance between an object and the centroid (i.e. the mean value) of all objects belonging to the same class. For Venn Probability Machines we can say that two objects belong to the same taxonomy if and only if their nearest centroids coincide. The nonconformity score can be calculated as follows: αi = αi (xi , yi ) =
d(μyi , xi ) , miny =yi d(μy , xi )
where μy is the centroid of all objects with label y, i.e. their mean value.
5 Experiments We use three datasets in our experiments, namely the United States Postal Service (USPS), a Network Traffic Classification (NTCD) and the Gamma Telescope (GTD) dataset. All these datasets are publicly available. USPS is a well-known dataset which is used by many researchers to run, test and compare their prediction algorithms. The dataset is a set of hand-written digits in the form of 16 × 16 matrixes with the entries in the interval (−1, 1) that describe the brightness of individual points and the example’s class (a digit 0..9). It consists of 7291 training examples and 2007 test examples. It is well known that this dataset is not uniform.
782
M. Dashevskiy and Z. Luo
In order to make the experiments more fair, the dataset was randomly permuted and we took 1000 examples as the training set and 500 examples as the test set. The Network Traffic Classification dataset (NTCD) is a collection of thousands of examples from which used only first 1500 examples from randomly permuted entry03.weka.allclass.arff (the details on this dataset can be found in [4]), 1000 as the training set and 500 as the test set. The collection consists of hand-classified network packages with 10 parameters(we used the datasets after feature selection, obtained by other researchers [[4]]). The possible labels include WWW, EMAIL, FTP, P2P and other types of network traffic. The packages’ parameters include Server Port Number, Maximum of total bytes in IP packet, Maximum of bytes in (Ethernet) packet and others. As it is pointed out in [5] the simplest type of network traffic classification, which is by port number, is not as good as Machine Learning techniques and here we try to contribute to solving this problem. The Gamma Telescope dataset (GTD) is Monte-Carlo generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope (see [2]). Each observation consists of 10 parameters, such as major and minor axes of ellipse and 10-log of sum of content of all pixels, and a label 0 (signal) or 1 (background), i.e. there are only two classes. As with the other datasets, we again take 1000 examples as the training set and 500 examples as the test set. The underlying algorithm for both Conformal Predictors and Venn Probability Machines was chosen to be the Nearest Neighbours (with the number of neighbours equal 1 and 3) and Nearest Centroid algorithms. Both confidence machines were run in two modes: when they give only one prediction as an output and when they are allowed to give region predictions as an output.
6 Results In this section we show the experimental results and discuss them. The section is divided into two subsections, one being devoted to one-label predictions and the other to region predictions. We ran the experiments in the online setting, i.e. we ran the algorithms on the training sets first and then at each step after making a prediction we were adding the current object with its true label to the training set. This construction allows the prediction system to improve its performance over time. 6.1 One-Label Predictions Here we compare the two confidence machines when they are allowed to give only one prediction. Table 1 provides a summary of the results. It can be seen that the third dataset is very difficult to classify even though it consists only of objects from two classes. It can also be noticed that the performance of Conformal Predictors and Venn Probability Machines is very similar. 6.2 Region Predictions Tables 2 and 3 show the results for the case where the confidence machines are allowed to give multiple predictions. Here we are more interested in comparison so-called efficiency, or in our case the average number of predicted labels at each step. Ideally the
Predictions with Confidence in Applications
783
Table 1. Error rates for CP and VPM, one-label prediction Algorithm VPM, voting-NN-1 VPM, voting-NN-3 VPM, all info-NN-1 VPM, all info-NN-3 VPM, centroid CP, NN-1 CP, NN-3 CP, centroid
USPS 0.076 0.082 0.076 0.124 0.184 0.076 0.064 0.188
NTCD 0.018 0.024 0.018 0.03 0.07 0.024 0.016 0.224
GTD 0.242 0.224 0.242 0.224 0.268 0.244 0.226 0.268
Table 2. Error rates for CP and VPM, region prediction, 90% confidence level Algorithm VPM, voting-NN-1 VPM, voting-NN-3 VPM, all info-NN-1 VPM, all info-NN-3 VPM, centroid CP, NN-1 CP, NN-3 CP, centroid
USPS 0.114 0.112 0.112 0.174 0.126 0.12 0.104 0.104
NTCD GTD 0.088 0.096 0.08 0.11 0.09 0.09 0.102 0.094 0.096 0.088 0.13 0.084 0.114 0.1 0.114 0.114
Table 3. Average number of predicted classes (out of 10, 8 and 2 classes respectively) for CP and VPM, 90% confidence level Algorithm VPM, voting-NN-1 VPM, voting-NN-3 VPM, all info-NN-1 VPM, all info-NN-3 VPM, centroid CP, NN-1 CP, NN-3 CP, centroid
USPS 0.98 1.05 0.98 1.07 2.08 0.91 0.93 1.29
NTCD 0.93 0.94 0.93 0.94 1.26 0.87 0.89 1.31
GTD 1.61 1.55 1.58 1.44 1.59 1.43 1.30 1.39
number should be zero and we would like our prediction system to output only one label at each step. However, here we are interested in having a guaranteed performance. From Table 3 it can be seen that Conformal Predictors are slightly more efficient than Venn Probability Machines. Figure 2 shows an example of classical experimental results for Venn Probability Machines on USPS data in online setting. In the long-run the total number of errors lies between the cumulative lower and upper bounds. This property of Venn Probability Machines guarantees the prediction quality. A decision making system which employs a
784
M. Dashevskiy and Z. Luo
100 90 upper bound lower bound errors
80 70
errors
60 50 40 30 20 10 0
0
50
100
150
200
250 examples
300
350
400
450
500
Fig. 1. Cumulative number of errors and upper and lower bounds, USPS, Nearest Centroid, single predictions
60
significance level 1% significance level 5% significance level 10%
50
errors
40
30
20
10
0
0
50
100
150
200
250 examples
300
350
400
450
500
Fig. 2. Number of errors for different significance levels, NTCD, Nearest Centroid, region predictions
Predictions with Confidence in Applications
785
Venn Probability Machine can use the information on the probability that the prediction is correct to improve its performance, for example by ignoring a prediction if such probability is too low. Figure 2 represents a classical experimental result for Conformal Predictors. The graph of the number of errors over time should be a straight line and the percentage of errors at each point should approximately equal to the significance level. In many applications we can estimate a threshold on the number of errors we can tolerate. Conformal Predictors is a guaranteed method (i.e. there exist a theorem saying that the performance is guaranteed, see [9]) of making predictions with a predefined number of errors we can tolerate. As it can be seen from the figure the graphs of the number of errors are not straight lines but this is due to statistical fluctuations as the result on Conformal Predictors is probabilistic.
7 Conclusion In this paper we compared two types of Confidence Machines, namely Venn Probability Machines and Conformal Predictors. We investigated how these algorithms perform on different datasets and in different settings. The main conclusion which can be drawn from the experimental results is that a researcher willing to choose between these two algorithms should consider the one which has the theoretical guarantee on its performance, i.e. if a single prediction is required then it is reasonable to choose the Venn Probability Machine and in the case of region predictions it is better to choose Conformal Predictors. However, it is important to choose the underlying algorithm carefully as if it cannot perform well on the initial dataset both Venn Probability Machines and Conformal Predictors will perform poorly. Another conclusion we could draw from the results is that the Nearest Centroid algorithm performs worse than the Nearest Neighbours algorithm, at least on our datasets, and should be avoided where possible. The results presented in this paper suggested several directions for further research. The research will be concentrated in two areas; firstly on combining several underlying algorithm (for example by methods of Prediction with Expert Advice, see [8]) so that the overall performance of a Confidence Machine does not depend on the particularities of the dataset. The other area of interest is finding a way of combining the two Confidence Machines considered in this paper so that we have a guarantee on performance for both one-label and region predictions. It is clear that we did not cover the full range of algorithms providing confidence with their output. It would be beneficial to add more algorithms, such as Bayesian methods or Probabilistic SVMs, to this study but we leave it for further research.
Acknowledgement This work was supported by EPSRC through grant EP/E000053/1, “Machine Learning for Resource Management in Next-Generation Optical Networks”.
786
M. Dashevskiy and Z. Luo
References 1. Bellotti, T., et al.: Qualified Predictions For Microarray And Proteomics Pattern Diagnostics With Confidence Machines. International Journal of Neural Systems 15(4), 1–12 (2005) 2. Heck, D., et al.: A Monte Carlo code to simulate extensive air showers. Forschungszentrum Karlsruhe FZKA 6019 (1998) 3. Moore, A.W.: Discrete content-based classification — a data set. Technical report, Intel. Research, Cambridge (2005) 4. Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques. In: SIGMETRICS (2005) 5. Moore, A.W., Papagiannaki, D.: Toward the accurate identification of network applications. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 41–54. Springer, Heidelberg (2005) 6. Partridge, D., et al.: Classification with Confidence for Critical Systems. Developments in Risk-based Approaches to Safety 6, 231–239 (2007) 7. T´oth, N., Pataki, B.: On Classification Confidence and Ranking Using Decision Trees. In: Proceedings of 11th International Conference on Intelligent Engineering Systems (2007) 8. Vovk, V.: A game of prediction with expert advice. Journal of Computer and System Sciences 56, 153–173 (1998) 9. Vovk, V., Gammerman, A., Shafer, G.: Algorithmic learning in a random world. Springer, Heidelberg (2005)
Aligning Bayesian Network Classifiers with Medical Contexts Linda C. van der Gaag1 , Silja Renooij1 , Ad Feelders1 , Arend de Groote2, Marinus J.C. Eijkemans2,3 , Frank J. Broekmans2, and Bart C.J.M. Fauser2 1
2
Department of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands Department of Reproductive Medicine and Gynaecology, Utrecht Medical Centre, Heidelberglaan 100, 3584 CS Utrecht, The Netherlands 3 Department of Public Health, Erasmus University Medical Center, P.O. Box 2040, 3000 CA Rotterdam, The Netherlands
Abstract. While for many problems in medicine classification models are being developed, Bayesian network classifiers do not seem to have become as widely accepted within the medical community as logistic regression models. We compare first-order logistic regression and naive Bayesian classification in the domain of reproductive medicine and demonstrate that the two techniques can result in models of comparable performance. For Bayesian network classifiers to become more widely accepted within the medical community, we feel that they should be better aligned with their context of application. We describe how to incorporate well-known concepts of clinical relevance in the process of constructing and evaluating Bayesian network classifiers to achieve such an alignment. Keywords and Phrases: learning Bayesian network classifiers, logistic regression, medical alignment.
1 Introduction Bayesian network classifiers are stochastic models that describe the relationship between one or more feature variables and a class variable, and provide for establishing posterior probabilities of the various classes for a given instance of the feature variables. Numerous applications of Bayesian network classifiers exist. Yet, within the medical field where most diagnostic problems can be considered classification problems, such classifiers are hardly ever used. Stated informally, in a diagnostic medical problem, patients have to be assigned to one of a usually small number of distinct diagnostic classes based upon the patient’s characteristics. A similar observation holds for many problems that are prognostic in nature. In the domain of reproductive medicine, for example, patients have to be classified as elective or non-elective for single embryo transfer upon in vitro fertilisation. To support physicians in taking classification decisions about individual patients, the most commonly employed models in the medical community are based on the technique of logistic regression. Logistic regression serves to construct, from a set of available patient data, a model which, like a Bayesian network classifier, describes the relationship between the feature variables involved and a class variable, and provides for establishing posterior probabilities of the various classes. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 787–801, 2009. c Springer-Verlag Berlin Heidelberg 2009
788
L.C. van der Gaag et al.
Bayesian network classifiers have a number of advantages over logistic regression models and their generalisations, which should render them attractive alternatives for the medical field. A major advantage of Bayesian network classifiers lies in their ability to give reliable classification results even if evidence is available for only a subset of the feature variables. Bayesian network classifiers moreover provide a graphical representation of the independences between the modelled variables, which allows for transparency and ease of interpretation of the models and their parameters. Bayesian network classifiers further range from the simplest type of model, the naive Bayesian classifier which makes strong independence assumptions concerning the feature variables involved, through the slightly more sophisticated TAN classifier which allows restricted dependences between the feature variables, to full Bayesian networks modelling the intricate dependence structure that actually holds in an application domain. Classifiers of varying complexity can thus be modelled within a single framework. Even though it is known from theory that first-order logistic regression models perform at least as good as naive Bayesian classifiers for larger data sets, many researchers have reported comparable or even better performance of the Bayesian network classifier for smaller data sets [12,14]. In this paper we describe our first steps aimed at the adoption of a Bayesian network classifier in the domain of reproductive medicine. At our disposal we had a small data set from patients undergoing single embryo transfer upon in vitro fertilisation. From this data set, a first-order logistic regression model had been constructed for the problem of predicting ongoing pregnancy [15]. From the same data set, we also constructed a naive Bayesian classifier and studied its performance compared to that of the logistic regression model. Logistic regression models developed to support physicians in making patient-specific classification decisions, are typically evaluated using well-known concepts of clinical relevance. These concepts include the area under the ROC curve, or AUC, and the model’s sensitivity and specificity characteristics. The AUC gives an indication of quality, averaged over all possible threshold probabilities for assigning an instance to a particular class. For use in practice, a fixed decision threshold is chosen based upon knowledge of the consequences of misclassification. With this threshold, the model has an associated sensitivity and specificity, where the sensitivity is the percentage of true positives predicted by the model and the specificity is the percentage of true negatives. Bayesian network classifiers generally are not evaluated using the concepts of clinical relevance mentioned above, but using classification accuracy as an indication of quality instead, where classification accuracy refers to the percentage of instances that are correctly classified by the model. The importance of communicating clinical relevance of a constructed model should not be underestimated, however: concepts of clinical relevance help convey to the physician a detailed assessment of the quality and relevance of patient-specific decisions based upon the model. We feel that the limited acceptance of Bayesian network classifiers in the medical community can be attributed to at least some extent to the lack of associated indicators of clinical relevance. For Bayesian network classifiers to become more widely accepted, we feel more specifically that they should be better aligned with the medical contexts in which they are to be used. In this paper, we describe how failure to use concepts of clinical relevance can result in medically unacceptable Bayesian network classifiers. We further show that
Aligning Bayesian Network Classifiers with Medical Contexts
789
incorporating such concepts, both in constructing and in evaluating Bayesian network classifiers, can be instrumental in achieving a much better alignment. In fact, we can report comparable results for the logistic regression model and the naive Bayesian network classifier only after alignment. The paper is organised as follows. In Section 2, we review Bayesian network classifiers and compare them, theoretically, to logistic regression models. In Section 3 we describe our domain of application and the data that we had available for our alignment study; in addition, we describe the different concepts relevant for alignment. In Section 4 we elaborate on the incorporation of the concepts of clinical relevance in the processes of constructing and evaluating Bayesian network classifiers. The results for the naive Bayesian classifiers constructed from our data set are presented in Section 5. We end with our concluding observations in Section 6.
2 Bayesian Network Classifiers and Logistic Regression Stochastic classifiers in general provide for addressing problems in which an instance of a set of feature variables has to be assigned to a value of the class variable. These classifiers in essence establish the posterior probability distribution over the class variable given the instance; the class value to be assigned to the instance is determined from this distribution using a decision rule. While quite a number of stochastic classification paradigms exist [9], we focus in this paper on Bayesian network classifiers and logistic regression models. We begin by reviewing different types of Bayesian network classifier and compare them to logistic regression models. 2.1 Bayesian Network Classifiers Bayesian network classifiers build upon a Bayesian network for establishing posterior probability distributions over their class variable. Such a network is a concise representation of a joint probability distribution over the set of variables involved. For the purpose of classification, this set is divided into a set of feature variables, a singleton set with the class variable, and a set of intermediate, or hidden, variables. Bayesian network classifiers vary in complexity from general models posing no restrictions on the dependences between the variables, to very simple models with highly constrained dependency structures. Two well-known simple Bayesian network classifiers are the naive Bayesian classifier and the TAN classifier [4]. These models both assume an empty set of hidden variables. The naive Bayesian classifier in addition assumes mutual independence of the feature variables given the class variable; the TAN classifier, or tree augmented network classifier, allows a tree-like dependency structure over its feature variables. Because of their simplicity, naive Bayesian classifiers are being developed for a wide range of application domains and, despite their simplicity, often very good performance is reported [3,4]. Throughout this paper, we assume that the class variable Y is a binary variable, with a positive class value denoted by y and a negative class value denoted by y; ¯ we use y to refer to either class value. The set of feature variables is denoted by X; x is used to denote a specific instance of this set. The naive Bayesian classifier now models the
790
L.C. van der Gaag et al.
joint probability distribution Pr(X,Y ) over its variables in terms of parameters p(Xi | Y ) specified for its feature variables Xi ∈ X, and p(Y ) specified for the class variable Y . Its independence assumptions result in the following parametrisation: Pr(X,Y ) = p(Y ) · ∏i p(Xi | Y ) While full Bayesian networks are typically handcrafted, at least partially, Bayesian network classifiers are generally constructed automatically from a data set. Algorithms for this purpose use a measure for optimising the model’s quality; this measure is used, more specifically, for deciding upon the dependences to be included between the variables. For a naive Bayesian classifier, for example, only dependences between feature variables and the class variable are allowed; a TAN classifier allows the inclusion of dependences among feature variables as well. Examples of quality measures are a model’s accuracy and its minimum description length (MDL). The quality of a model in view of the data can only be established if the model is fully specified, that is, if it includes estimates for all numerical parameters involved. These parameters are estimated as simple frequency counts, which serve to maximise the log-likelihood of the model given the data. The quality measure that is used as an optimisation criterion upon constructing the model often is also exploited for comparing classifiers. Upon learning Bayesian network classifiers, the quality of a model is optimised not just by including appropriate dependences, but also by including only the most relevant feature variables. Data sets often contain more variables than are strictly necessary for the classification task at hand and the more or less redundant variables could result in an undesirable bias [8]. A feature-selection method carefully selects, from the data set, the variables that serve to improve the model’s quality the most. For this purpose, various methods exist [9]. Here, we focus on the so-called wrapper approach to feature selection and assume that a greedy forward-selection method is used for choosing the feature variables to be included. In this approach, feature variables are iteratively added to an initially empty model until its quality given the data no longer increases. Bayesian network classifiers use Bayes’ rule for establishing the posterior probability distribution Pr(Y | X) over their class variable: Pr(Y | X) =
Pr(Y, X) Pr(X | Y ) · Pr(Y ) = Pr(X) ∑y Pr(X | y ) · Pr(y )
The actual classification amounts to assigning an instance x , for a (sub)set of feature variables X ⊆ X, to a particular class, based upon the posterior distribution computed for the instance. The decision rule that is commonly used for this purpose is the winnertakes-all rule which, for a binary class variable, amounts to assigning the instance to the class whose posterior probability exceeds the threshold probability of 0.5. To conclude, if the performance of the constructed model is evaluated against the same data set as that from which the model is learned, its performance will tend to be overestimated as a result of overfitting the model to the data. To correct for this effect of overfitting and to estimate the model’s performance on unseen data, often ten-fold cross validation is used.
Aligning Bayesian Network Classifiers with Medical Contexts
791
2.2 Logistic Regression Logistic regression models are much more commonly used within the medical community than Bayesian network classifiers, even though there are quite a number of similarities between these types of model. A logistic regression model, like a Bayesian network classifier, is a model over a class variable Y and a set of feature variables X. The model captures the conditional probability distribution over the class variable directly as a function of the feature variables Xi ∈ X. Logistic regression models also range from simple models imposing a linear function on the feature variables, to more complex models involving higher-order terms to describe interactions between the feature variables. The first-order logistic regression model captures the conditional probability distribution Pr(Y | X) in terms of a linear function of the feature variables Xi ∈ X through Pr(y | X) = (1 + exp(−β0 − ∑i βi · Xi ))−1 in which βi denote the model’s parameters. Logistic regression models are always constructed automatically from data. The loglikelihood of the model given the available data then is maximised by obtaining appropriate estimates for the parameters βi . While for the parameters of a Bayesian network classifier, a closed-formula solution exists, the optimisation problem involved in finding the parameters for a logistic regression model does not have such a solution. The parameter values therefore are established using an iterative method. As for Bayesian network classifiers, furthermore, upon constructing a logistic regression model methods for feature selection and for correcting for the effect of overfitting are applied. A logistic regression model provides for directly computing the posterior probability Pr(y | x) for the positive class value given an instance x, by filling in the appropriate values for all the feature variables X. The decision rule used with the model again is based upon a threshold for this posterior probability. The value of this decision threshold is typically based upon knowledge of the consequences of the different types of misclassification in the domain of application. 2.3 A Theoretical Comparison Naive Bayesian classifiers and first-order logistic regression models essentially index the same set of conditional probability distributions, in the sense that for any combination of parameter values of a first-order logistic regression model there exists a combination of parameter values for a naive Bayesian classifier that describes the same distribution Pr(Y | X), and vice versa (provided that Pr(Y, X) is strictly positive) [11]. Yet, given a particular data set, naive Bayesian classification and logistic regression will | X), because the parametypically not result in the same estimated distribution Pr(Y ter values for the Bayesian network classifier are chosen so as to maximise the loglikelihood of the joint probability distribution over the variables whereas the parameter values for the logistic regression model are chosen so as to optimise the log-likelihood of the conditional distribution. If in learning a naive Bayesian classifier, its parameter values are computed iteratively so as to maximise the log-likelihood of the conditional distribution, called discriminative learning, the resulting model would in essence capture the same distribution as the first-order logistic regression model.
792
L.C. van der Gaag et al.
Several researchers have argued that discriminative learning is more appropriate for classification purposes than generative learning in which the log-likelihood of the joint distribution is maximised, since we are interested in predicting the class for a given instance and not in their joint probability [4]. Others, however, argue that such a conclusion may be premature [12]. For models fitted to infinite data sets, the asymptotic classification accuracy of a first-order logistic regression model is never smaller than that of a naive Bayesian classifier. This basically implies that given a large enough data set the naive Bayesian classifier will not outperform the first-order logistic regression model. The regression model would typically do better when the independence assumption underlying the naive Bayesian classifier does not hold in the data set, that is, when there are strong associations among the feature variables [1]. On the other hand, although naive Bayesian classification asymptotically converges to a lower accuracy, it does so significantly faster than logistic regression. For smaller data sets, therefore, naive Bayesian classifiers can be expected to outperform first-order logistic regression models, as has been largely confirmed experimentally [12]. Similar observations hold for TAN classifiers on the one hand and logistic regression models with interaction terms for pairs of feature variables on the other hand. The observations cannot be extended, however, to Bayesian network classifiers involving more complex dependency structures over their feature variables. Such Bayesian network classifiers index essentially different sets of conditional probability distributions than logistic regression models with higher-order interaction terms and as a consequence may theoretically as well as effectively outperform any such regression model.
3 The Medical Context In this section we describe the concepts of clinical performance relevant for our case study of aligning Bayesian network classifiers with a medical context. We further briefly introduce the domain of reproductive medicine in which we conducted the case study, together with the data set and logistic regression model that we had available. 3.1 Clinical Performance Stochastic classification models in medicine are often evaluated using concepts of clinical relevance such as the area under the ROC curve and the model’s sensitivity and specificity characteristics. Given the importance of these concepts for aligning Bayesian network classifiers with medical contexts, we briefly review them. A Receiver Operator Characteristic, or ROC, curve visualises the performance of a classification model by plotting its sensitivity against one minus its specificity for all possible values of the decision threshold t for the model’s decision rule; some example ROC curves are shown in Fig. 1. The decision threshold serves for classifying an instance x as belonging to class y only if the posterior probability Pr(y | x) computed for the instance is at or above the threshold. Given this threshold, the classification model has an associated sensitivity and specificity. The sensitivity of the model is the probability that it correctly classifies a positive instance, that is, it is the percentage of positive instances x+ for which the classifier predicts that Pr(y | x+ ) ≥ t. The model’s
Aligning Bayesian Network Classifiers with Medical Contexts
793
1
sensitivity
0.8 0.6 0.4 0.2
NB4 LR
0 0
0.2
0.4 0.6 0.8 1- specificity
1
Fig. 1. ROC curves for the logistic regression model (dashed) and the naive Bayesian network classifier (solid) with four variables each (prior to correction for overfitting)
specificity is the probability that it correctly classifies a negative instance, that is, it is the percentage of negative instances x− for which Pr(y | x− ) < t is predicted. The area under the ROC curve, or AUC, in essence measures the ability of a classification model to discriminate between distinct classes [5]. More specifically, it captures the probability that a randomly chosen positive instance x+ and a random negative instance x− are correctly ranked, that is, it is the percentage of such pairs of instances for which the classifier predicts that Pr(y | x+ ) > Pr(y | x− ). The area under the curve is not related to a particular threshold, but gives an indication of a classifier’s quality averaged over all possible decision thresholds for assigning an instance to a particular class. In view of a fixed decision threshold, the quality of the model is captured by a single point on the ROC curve, which implies a specific sensitivity and specificity. The area under the curve may then no longer be an appropriate indication of the model’s performance. 3.2 In Vitro Fertilisation In vitro fertilisation, or IVF, is an assisted reproductive technique of embryo transfer used to help infertile couples conceive a child. There are many factors that determine whether or not IVF treatment results in an ongoing pregnancy, which include the age of the patient, the quality of the embryo, and the receptivity of the patient’s uterus. To increase the probability of pregnancy, it used to be common practice for IVF programmes to transfer multiple embryos. With the increasing success of the treatment, however, multiple embryo transfer involves an increased risk of multiple pregnancy, associated with pregnancy loss, obstetrical complications, prematurity, and neonatal morbidity with long term damage. As a means of forestalling the risks associated with multiple pregnancy, single embryo transfer is now being used. Applying single embryo transfer without any selection based on patient characteristics and embryo quality, however, has been shown to lead to a reduced probability of pregnancy per transfer. A patient may then need to undergo multiple treatments for a pregnancy to persist.
794
L.C. van der Gaag et al.
Multiple treatments involve additional costs as well as physical and emotional discomfort for the patient. To guide appropriate use of single embryo transfer, therefore, a patient-specific assessment of the expected result of the transfer should be made. In a recent study, a prognostic model was developed for establishing the probability of an ongoing pregnancy after single embryo transfer [15]. The data used for constructing this logistic regression model were derived from a randomised controlled trial on the effectiveness of in vitro fertilisation, in which 201 women with an indication for IVF treatment were randomised to a mild stimulation protocol [6]. For constructing the model, a subset of these data was composed, including only the data from women with at least two embryos suitable for transfer. The resulting data set included the data of 152 women who underwent single embryo transfer; in 42 of these women (28%), the treatment resulted in an ongoing pregnancy. In the data set, patient characteristics, treatment details, and embryo-quality related factors are recorded. The feature variables include such patient characteristics as female age, previous pregnancy, cause and duration of infertility, and body mass index. Further independent variables are related to the treatment and include the number of dominant follicles, the number of oocytes retrieved, the proportion of fertilised oocytes, the duration of the stimulation, the amount of administered recFSH per retrieved oocyte, and endometrial thickness. The remaining feature variables are related to embryo quality and include the grade of fragmentation, whether there was a topquality embryo available for transfer, and whether there were embryos available for cryopreservation. The number of independent variables equals 17, of which 11 variables are continuous, 3 are binary, and 3 are multi-categorical. For two of the variables, data were not complete, with 4% and 6% of the values missing, respectively. For these variables, single imputation was used by filling in the predictive mean after regression on all other variables. The variable designated as the class variable in the data set captures whether or not single embryo transfer results in an ongoing pregnancy. We now briefly review the performance characteristics of the first-order logistic regression model constructed from the data. The model includes four feature variables, which was imposed as the maximum number of variables to be included. The variables of the model are the patient’s body-mass index, the total amount of administered follicle stimulating hormone, the number of retrieved oocytes, and whether there was a top-quality embryo available for transfer. The ROC curve for the model is shown in Figure 1; its area under the curve equals 0.68, or 0.60 after correcting for the effects of overfitting. Using a decision threshold of 0.2, the model has a sensitivity of 0.90, or 0.86 after correction, and a specificity of 0.37, or 0.14 after correction. These characteristics are summarised in Tables 1 and 2. Table 1. Performance characteristics of the first-order logistic regression model and of the naive Bayesian network classifier, with four selected variables each
logistic regression model naive Bayesian classifier
AUC (corrected) sensitivity (corrected) specificity (corrected) 0.68 (0.60) 0.90 (0.86) 0.37 (0.14) 0.85 (0.58) 0.95 (0.66) 0.55 (0.50)
Aligning Bayesian Network Classifiers with Medical Contexts
795
Table 2. The variables included in the first-order logistic regression model and in the naive Bayesian network classifier with four selected variables each logistic regression model – number of retrieved oocytes top-quality embryo available – administered follicle stimulating hormone body-mass index
naive Bayesian classifier duration of infertility (discretised) number of retrieved oocytes (discretised) top-quality embryo available endometrial thickness (discretised) – –
4 Aligning Bayesian Network Classifiers We recall that building stochastic classification models involves selecting a subset of the available feature variables; this subset is often construed using a greedy forwardselection approach, in which feature variables are iteratively added to an initially empty model until its performance no longer increases. Upon constructing a Bayesian network classifier, performance is often measured by classification accuracy, which refers to the percentage of correctly classified instances. In this section, we argue why classification accuracy is an unacceptable measure of performance in our application domain and show how concepts of clinical relevance can instead be used for this purpose. 4.1 Classification Accuracy and Its Problems The measure of classification accuracy refers to the probability of correctly classifying an arbitrary instance x of the feature variables involved, where x is considered correctly classified if the class y assigned to x corresponds to its true class value y∗ . We recall that the assignment of a class value depends upon the threshold value t that is used with the classifier’s decision rule: an instance x is assigned to the class y whenever Pr(y | x) ≥ t. Alternatively, a classifier’s accuracy can be interpreted as the percentage of randomly chosen pairs of a positive instance x+ and a negative instance x− , for which Pr(y | x+ ) ≥ t and Pr(y¯ | x− ) ≥ t. The winner-takes-all decision rule commonly employed by Bayesian network classifiers implies a decision threshold of t = 0.5. While the measure of classification accuracy is widely used within the Bayesian network community, it is hardly ever used with logistic regression models in medical contexts. To elaborate on why classification accuracy often is inappropriate for measuring performance in medicine, we begin by observing that the accuracy of a classifier is highly dependent upon the threshold value that is used with the classifier’s decision rule. If instances x are assigned to a class y whenever Pr(y | x) ≥ t, then changing the value of the threshold t will change the number of instances assigned to class y . As a result, the sensitivity and specificity characteristics of the classifier also change. By writing classification accuracy as a function of the decision threshold t: accuracy(t) = sensitivity(t) · p(y) + specificity(t) · p(y) ¯
796
L.C. van der Gaag et al.
two problems of using it as a performance measure become apparent [2]. The first problem is that its dependence on the choice of decision threshold makes classification accuracy an inappropriate measure of performance in general. In fact, the threshold of t = 0.5 that is implied by the winner-takes-all rule is defendable only if the prior distribution of the class variable is close to uniform. In our domain of in vitro fertilisation for example, this property does not hold: single embryo transfer results in an ongoing pregnancy in just 28% of the patients. For this reason, the smaller decision threshold of 0.20 for predicting ongoing pregnancy was chosen for the logistic regression model. Since the decision threshold generally is not a parameter of the learning process, it is questionable whether feature selection based upon classification accuracy as a performance measure would result in an acceptable model. Note that this particular problem of the measure of classification accuracy is just technical and could be resolved by fitting the choice of decision threshold to the prior class distribution [7]. The second problem of using classification accuracy as a measure of performance is that it assigns fixed importance weights to the sensitivity and the specificity of a model. Dictated by the prior probability distribution Pr(Y ) over the class variable, it assigns fixed weights to the costs of the two types of misclassification. A uniform class distribution, for example, entails that the costs of misclassification are independent of the predicted class. For many medical contexts, however, the consequences of false positive errors may be very different from those of false negative errors. Moreover, for non-uniform priors, either the sensitivity or the specificity is automatically weighted more heavily by the measure of accuracy, independent of any medical considerations. In our domain of in vitro fertilisation, for example, the prior distribution over the class variable would assign a higher weight to the model’s specificity, that is, to correctly predicting non-implantation, than to the sensitivity, that is, to predicting an ongoing pregnancy upon transferring a single embryo. Experts in reproductive medicine indicate however, that the consequences of acting upon a false negative prediction are more severe than those for false positive predictions. A high sensitivity therefore is considered more important than a high specificity. This second problem may very well be the reason why classification accuracy is not used as a measure of performance for logistic regression models in medicine. The inappropriateness of classification accuracy as a performance measure has been recognised in other domains as well. In the machine learning community, the area under the ROC curve is being used already for some time now as a measure both for comparing classifiers and for constructing them. Since the area under the curve is not dependent on the decision threshold chosen and is invariant to the prior distribution over the class variable, it is more generally applicable as a performance measure for classification models than classification accuracy. In fact, it has been shown that Bayesian network classifiers constructed to maximise AUC, provide better ranking and probability estimates for the instances to be classified, and in addition even score better on classification accuracy than those optimised for that purpose [10]. 4.2 Clinical Alignment Bayesian network classifiers will only become an accepted alternative to logistic regression models in medicine, if their quality is at least comparable and is communicated in
Aligning Bayesian Network Classifiers with Medical Contexts
797
terms of clinical relevance. Although this observation does not necessarily affect the process of constructing Bayesian network classifiers, it has been recognised that if classifiers are evaluated using some quality measure, then it makes sense to optimise that very measure during automated construction. For this purpose, measures of clinical relevance have to be incorporated in the learning process of Bayesian network classifiers. Upon learning Bayesian network classifiers from data, the area under the curve can be readily included in a greedy forward-selection approach to feature selection. We recall that, with this approach, feature variables are iteratively added to an initially empty model until its performance no longer increases. When using classification accuracy as a performance measure, in each iteration, for each (remaining) feature variable, the increase it incurs in the classifier’s accuracy is computed, using the decision threshold of 0.5 of the winner-takes-all rule. When using the area under the curve for the optimisation criterion instead, we have to compute for each feature variable the increase it incurs in the classifier’s AUC. To this end, for each feature variable, the sensitivity sensitivity(t) and the specificity specificity(t) of the classifier, for n different values of the decision threshold t between zero and one, are determined. Note that for establishing these n points of the ROC curve, we have to compute the posterior distribution over the class variable only once. From the n points thus obtained, the area under the curve can be approximated by constructing trapezoids under the curve between every two consecutive points. It is readily shown that this approximation equals 1 ∑ sensitivity(ti ) + sensitivity(ti+1 ) · specificity(ti+1 ) − specificity(ti ) 2· i=1,...,n−1
where ti is the decision threshold that resulted in the ith sensitivity-specificity pair. We then select for inclusion in the classifier, the feature variable that results in the largest increase in AUC, if any.
5 Experimental Results Our first step into building a Bayesian network classifier for the domain of reproductive medicine has been to learn a collection of naive Bayesian classifiers from the available data. Based upon the theoretical results reviewed in Section 2, we could expect similar performance of the naive Bayesian classifier and the logistic regression model constructed from the data. In fact, since our data set is relatively small, we could even expect slightly better performance of the classifier. We used our Dazzle toolbox [13], for constructing various naive Bayesian classifiers. Before doing so, we had to discretise the continuous variables from the data set. For this purpose, knowledge was elicited from the domain experts who had been involved in the collection of the data. We would like to note that, since discretisation was done based purely on domain knowledge and was not aimed at optimising classifier performance, the resulting discretisation might not be the best situated for our classifiers. For the purpose of feature selection, we employed the greedy forward-selection approach outlined above, using the area under the curve for our optimisation criterion. In this section, we review the results that we obtained. For each constructed model, we report the area under the curve, as well as the sensitivity and specificity characteristics that result from using a decision threshold of 0.2 on the
798
L.C. van der Gaag et al.
entire data set; we further report corrected performance characteristics obtained using ten-fold cross validation. To allow for comparing the performance of the first-order logistic regression model constructed from the available data with that of our naive Bayesian classifiers in detail, we decided to construct classifiers with different numbers of feature variables. With a maximum of four variables, as was imposed on the regression model, the constructed naive Bayesian classifier includes the feature variables modelling the duration of the infertility, the number of retrieved oocytes, endometrial thickness and whether there was a top-quality embryo available for transfer. The area under the curve of this classifier equals 0.85, or 0.58 after correcting for the effects of overfitting. Using a decision threshold of 0.2, the classifier has a sensitivity of 0.95, or 0.66 after correction, and a specificity of 0.58, or 0.50 after correction. These characteristics are summarised in Tables 1 and 2. By comparing the characteristics after correction of the naive Bayesian classifier with those of the first-order logistic regression model, we find that the differences between their area under curve and their sensitivities are not significant; the specificity of the naive Bayesian classifier, however, is significantly higher than that of the regression model (using a Student t distribution with a significance level of α = 0.05). In addition to the naive Bayesian classifier with four variables, we constructed classifiers with fewer and with more variables. The results from all constructed classifiers are summarised in Table 3. With the restriction of a single feature variable, the constructed classifier includes just the duration of the infertility: the addition of this variable is found to increase the area under the curve of the initially empty classifier the most. When allowed a second feature variable, the learning algorithm includes the number of retrieved oocytes in addition to the duration of the infertility in the classifier. The feature variables modelling endometrial thickness and whether there was a top-quality embryo available for transfer are included as the third and fourth variable respectively. The fifth feature variable included in the model is the total amount of administered follicle stimulating hormone. If the inclusion of feature variables is continued until the classifier’s area under the curve no longer increases, a total of eight variables is included. In addition to the five variables mentioned above, also the feature variables modelling the grade of fragmentation of the embryo, the number of normally fertilised oocytes, and the patient’s age are included. The remaining variables are not included into the classifier since they in fact serve to decrease the classifier’s area under the curve. We note that upon constructing a naive Bayesian classifier, the contribution of each feature variable to the area under the curve is studied in view of the entire data set. The uncorrected AUC values reported in Table 3 therefore are the values used upon constructing the model. While the classifier’s area under the curve keeps increasing upon including a fourth and even further feature variables when the full data set is considered, the values that have been corrected for the effects of overfitting, also reported in Table 3, reveal a decrease in the expected area under the curve on unseen data. These observations support the conclusion that for our small data set selecting four or more feature variables would result in a naive Bayesian classifier that is overfitted to the data. When comparing the performance characteristics of the various constructed naive Bayesian classifiers, especially the corrected values for the area under the curve and the sensitivity suggest that the best classifier is the one that includes three feature variables.
Aligning Bayesian Network Classifiers with Medical Contexts
799
Table 3. Characteristics of naive Bayesian network classifiers with different numbers of variables # variables AUC (corrected) sensitivity (corrected) specificity (corrected) 0 0.50 (0.50) 1.00 (1.00) 0 (0) 1 0.69 (0.53) 0.93 (0.80) 0.31 (0.22) 2 0.76 (0.63) 0.93 (0.78) 0.45 (0.35) 3 0.80 (0.65) 0.90 (0.80) 0.55 (0.46) 4 0.85 (0.58) 0.95 (0.66) 0.58 (0.50) 5 0.86 (0.56) 0.93 (0.65) 0.63 (0.51) 8 0.89 (0.56) 0.95 (0.57) 0.67 (0.52)
Upon comparing the characteristics of this model with those of the first-order logistic regression model constructed from the data, we find that the differences between their area under curve and their sensitivity are not significant; the specificity of the naive Bayesian classifier, however, again is significantly larger than that of the regression model (using a Student t distribution with a significance level of α = 0.10). To conclude, we would like to illustrate the inappropriateness of using classification accuracy for measuring performance for our domain of application. We constructed yet another naive Bayesian network classifier from our data set with a maximum of four feature variables; for this classifier we used accuracy for the optimisation criterion. The corrected area under the curve of the classifier is 0.54. With the winner-takes-all rule, the corrected sensitivity is 0.13; the corrected specificity equals 0.84. We recall from Section 4 that experts in reproductive medicine indicate that a high sensitivity is more important than a high specificity. From this observation, we conclude that this accuracybased classifier would not exhibit medically acceptable performance as a consequence of its low sensitivity.
6 Concluding Observations While for many problems in medicine classification models are being developed, Bayesian network classifiers do not seem to have become as widely accepted within the medical community as logistic regression models. The advantages of Bayesian networks classifiers over logistic regression models in terms of a graphical representation, their ability to classify in the context of missing input values, and the possibility to model any dependence structure among their variables, in themselves clearly are not sufficient to gain acceptance in the medical community. To advance Bayesian network classifiers as alternatives to logistic regression models, we feel that it is important that comparison between the two types of model can be done in terms familiar to the medical community. In the medical domain, concepts of clinical relevance are used, such as the area under the curve and a model’s sensitivity and specificity. We have argued that for Bayesian network classifiers to become more widely accepted, they should be better aligned with their medical contexts and use these concepts of clinical relevance. Given an infinite data set and optimising accuracy, a naive Bayesian classifier cannot outperform a logistic regression model. Comparing a previously constructed regression model with a naive Bayesian network classifier for the problem of selecting
800
L.C. van der Gaag et al.
patients for single embryo transfer in reproductive medicine, we found that even for a small data set the naive Bayesian classifier can be outperformed by far by the logistic regression model, that is, if the former is constructed using classification accuracy as a performance measure. We have argued, however, that classification accuracy may not be appropriate for measuring performance in the medical domain. Serious problems are associated with using the measure for non-uniform distributions over the class variable and for unequal cost distributions over the different types of misclassification, which may in fact give rise to classification models of unacceptable medical behaviour. We have shown that concepts of clinical relevance can be readily taken into account upon constructing naive Bayesian classifiers from data. For our relatively small data set in reproductive medicine, we have shown that by doing so, naive Bayesian classifiers can result that exhibit at least comparable behaviour to logistic regression models. The promising results from aligning the simplest type of Bayesian network classifier to its medical context, have made at least our medical experts enthusiastic.
References 1. Anderson, J.A.: Logistic discrimination. In: Krishnaiah, P.R., Kanal, L.N. (eds.) Classification, Pattern Recognition and Reduction of Dimensionality. Handbook of Statistics, vol. 2, pp. 169–191. North-Holland, Amsterdam (1982) 2. Bradley, A.P.: The use of area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 1145–1159 (1997) 3. Domingos, P., Pazzani, M.J.: On the optimality of the simple Bayesian classifier under zeroone loss. Machine Learning 29, 103–130 (1997) 4. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Machine Learning 29, 131–163 (1997) 5. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982) 6. Heijnen, E.M.E.W., Eijkemans, M.J.C., de Klerk, C., Polinder, S., Beckers, N.G.M., Klinkert, E.R., Broekmans, F.J., Passchier, J., Te Velde, E.R., Macklon, N.S., Fauser, B.C.J.M.: A mild treatment strategy for in-vitro fertilisation: a randomised non-inferiority trial. Lancet 369, 743–749 (2007) 7. Lachiche, N., Flach, P.A.: Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, pp. 416–423. AAAI Press, Menlo Park (2003) 8. Langley, P., Sage, S.: Induction of selective Bayesian classifiers. In: Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pp. 399–406 (1994) 9. Larra˜naga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J.A., Arma˜nanzas, R., Santaf´e, G., P´erez, A., Robles, V.: Machine learning in bioinformatics. Briefings in Bioinformatics 7, 86–112 (2006) 10. Ling, C.X., Huang, J., Zhang, H.: AUC: a better measure than accuracy in comparing learning algorithms. In: Xiang, Y., Chaib-draa, B. (eds.) Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2003, pp. 329–341. Springer, Heidelberg (2003) 11. McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, Chichester (1992)
Aligning Bayesian Network Classifiers with Medical Contexts
801
12. Ng, A.Y., Jordan, M.: On discriminative vs. generative classifiers: a comparison of naive Bayes and logistic regression. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 605–610. MIT Press, Cambridge (2002) 13. Schrage, M.M., Van IJzendoorn, A., Van der Gaag, L.C.: Haskell ready to Dazzle the real world. In: Proceedings of the 2005 ACM SIGPLAN Workshop on Haskell, pp. 17–26. ACM Press, New York (2005), http://www.cs.uu.nl/dazzle 14. Twardy, C.R., Nicholson, A.E., Korb, K.B., McNeil, J.: Epidemiological data mining of cardiovascular Bayesian networks. Electronic Journal of Health Informatics 1(1) (2006) 15. Verberg, M.F.G., Eijkemans, M.J.C., Macklon, N.S., Heijnen, E.M.E.W., Fauser, B.C.J.M., Broekmans, F.J.: Predictors of ongoing pregnancy after single-embryo transfer following mild ovarian stimulation for IVF. Fertility and Sterility 89, 1159–1165 (2008)
Assessing the Eligibility of Kidney Transplant Donors Francisco Reinaldo1 , Carlos Fernandes2 , Md. Anishur Rahman4 , Andreia Malucelli3 , and Rui Camacho4 1
FEUP, Universidade do Porto Rua Dr. Roberto Frias, sn, 4200-465 Porto, Portugal and UnilesteMG - Centro Universit´ ario do Leste de Minas Gerais GIC - Grupo de Inteligˆencia Computacional Av. Presidente Tancredo Neves n. 3500, Bairro Universit´ ario, CEP 35170-056 Coronel Fabriciano, MG, Brasil [email protected] 2 Pontifical Catholic University of Paran´ a - PUCPR, PostGraduate Programme in Health Technology - PPGTS R. Imaculada Concei¸ca ˜o,1155 215-901 Curitiba PR, Brazil [email protected] 3 Pontifical Catholic University of Paran´ a - PUCPR, PostGraduate Programme in Health Technology - PPGTS R. Imaculada Concei¸ca ˜o,1155 80215-901 Curitiba PR, Brazil [email protected] 4 FEUP, Universidade do Porto, Rua Dr Roberto Frias s/n, 4200-465 Porto, Portugal
Abstract. Organ transplantation is a highly complex decision process that requires expert decisions. The major problem in a transplantation procedure is the possibility of the receiver’s immune system attack and destroy the transplanted tissue. It is therefore of capital importance to find a donor with the highest possible compatibility with the receiver, and thus reduce rejection. Finding a good donor is not a straightforward task because a complex network of relations exists between the immunological and the clinical variables that influence the receiver’s acceptance of the transplanted organ. Currently the process of analyzing these variables involves a careful study by the clinical transplant team. The number and complexity of the relations between variables make the manual process very slow. In this paper we propose and compare two Machine Learning algorithms that might help the transplant team in improving and speeding up their decisions. We achieve that objective by analyzing past real cases and constructing models as set of rules. Such models are accurate and understandable by experts. P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 802–809, 2009. c Springer-Verlag Berlin Heidelberg 2009
Assessing the Eligibility of Kidney Transplant Donors
1
803
Introduction
Organ transplantation can bring many benefits for individuals suffering from a variety of incurable diseases. Transplantation is an alternative medical solution to restore damaged functions in patients, thereby offering improved quality of life [1]. In recent years surgeries in which an organ or tissue is transplanted have seen a major increase in Brazil. In 2005, 15,527 organ and tissue transplants were successfully performed. This represents an 18,3% increase over 2003 figures. This growth is a consequence of an increased awareness among the Brazilian population, the competent performance of teams and institutions authorized by the Ministry of Health and the regulation of the National Transplant System (NTS) in accordance with prevailing legislation [2]. However, whilst the number of transplants has increased, the number of clinical transplant teams remains the same and the excessive workload may reach critical levels, making it more difficult to select the best donor for a given receiver. Organ transplantation requires a high degree of certainty and special care when making decisions, since the receivers immune system may attack and destroy transplanted tissue (organ rejection). It is therefore necessary to find a donor who has the highest possible compatibility with the receiver, and thus reduce rejection [3]. This is not a straightforward task because a variety of immunological and clinical variables influence the receiver’s acceptance of the transplanted organ. These variables may not be examined on a one by one basis but rather in an holistic manner, whereby one variable can interfere with another, potentiating or not the influence of each variable on the immune system. There is a complex relation between these variables. This network of relations is not entirely understood by the medical doctors. Currently the process of analyzing these variables involves a careful study by the clinical transplant team. As this is a manual process, it is also a slow one owing to the number of variables and the need to establish the influence relationship between them. This comparison is detailed and slow because the specialist has to compare the receiver’s entire history with possible donors, which requires specialized knowledge acquired through previous examples and continuous studies in the domain area. The clinical team uses previous cases to improve their knowledge and understanding concerning the decision on if a transplant should be performed or not. In our study we consider the use of Machine Learning (ML) algorithms to help the clinical transplant team. In the line of the human experts procedure the ML algorithms use a set of past cases (training examples) to construct models capable of correctly predict the outcome of unseen patient cases. In this paper we describe the use of a rule induction algorithm (CN2 [4,5]) and an Inductive Logic Programming (ILP) system[6] to construct the models that should help the transplant team decision process. We have used a database containing real information on previous transplants (receivers, donor and organ survival following transplant). The database used in the experiments is from the Pontifical Catholic University of Paran´ a, located in the city of Curitiba, in the state of Paran´ a in southern Brazil. The immunogenetics laboratory referred to is a national reference due to the volume of tests carried out and the quality of the techniques used. It is also a
804
F. Reinaldo et al.
reference for the training of specialists in the field of immunogenetics. The models constructed by the ML algorithms achieved 77% accuracy. This paper is structured as follows. Section2 presents a brief overview of kidney transplantation in Brazil, with emphasis on clinical and immunological evaluations. Rule induction and ILP techniques for extracting rules for correct decision making are explained in Section 3. Section 4 describes the experiments done and the results obtained. Section 5 concludes the paper and presents perspectives for future work.
2
Kidney Transplantation
Organ transplantation is a treatment procedure of great medical importance for thousands of people. Kidney transplantation is recommended in cases of chronic kidney failure at the terminal stage [7], when the receiver is undergoing dialysis or even at the pre-dialysis stage (pre-emptive transplantation) [8]. Transplantation can be offered to all kidney transplant candidates as an alternative aimed at reducing the incidence of vascular, cardiac, ocular and neurological complications inherent in diabetes [9]. Kidney transplantation is an important treatment option for the receiver both from the medical, and also from the social and economic point of view [10]. A Federal Decree regulates kidney transplants from live donors in Brazil [11]. The decree regulates Law No. 9,434, which makes provision for the removal of organs, tissues and parts of the human body for the purpose of transplantation and treatment. Pre-operation evaluations are necessary in order to verify the possibility of transplantation being performed and must take into consideration, for example, the duration of dialysis; the receiver’s history; blood characteristics (haemoglobin, coagulation, biochemistry, ABO and immunology) as well as a variety of other examinations (see [12] for a detailed description). 2.1
Clinical Evaluation
Nowadays there are few contraindications for kidney transplantation. The receiver’s age is no longer a reason for contraindication, as it was until relatively recently. Currently transplants are performed on newborn babies, including premature babies, and also on receivers more than 70 years old [10]. 2.2
Immunological Evaluation
Once the receiver has had all the clinical examinations, he or she then undergoes a series of immunological examinations. The purpose of these examinations is to assess the survival of the organ and the receiver for as long as possible, thus promoting the length and the quality of the receiver’s life, as well as avoiding wasting such a precious and scarce resource as an organ available for transplantation. In order to determine the histocompatibility between the receiver and the donor a study is made of the HLA (Human Leukocyte Antigens) complex,
Assessing the Eligibility of Kidney Transplant Donors
805
the principal characteristic of which is the high degree of polymorphism, which is found in a variety of cells [3]. Determining a receiver’s HLA is important for verifying the prospects of clinical recovery or whether the receiver will present chronic HLA associated diseases [13]. Currently a variety of sites perform analysis on HLA molecules owing to their conformation, using programs such as HLA Matchmaker which indicates acceptable mismatches between receiver and donor, principally in the case of receivers who are hypersensitive to transplantation [14]. Pre-operation immunological evaluation must also take into consideration antibody panel tests to assess the strength of the reaction and the characterization of the antibodies in circulation [15].
3
A Machine Learning Approach
In order to help the transplant team to better understand what makes a transplant successful, we have analyzed a data set of real cases using two Machine Learning (ML) algorithms. We have applied a propositional learner, CN2 to induce production rules, and an Inductive Logic Programming (ILP) system. The justification for our choice is based on the fact that we need a symbolic model to present to the expert (this excludes Artificial Neural Networks and statistical approaches). It is well known in Artificial Intelligence (AI) and psychology that production rules are a formalism that humans comprehend easily. We have also included ILP in the set because ILP is known to perform well with small data sets and to produce very simple models. We therefore compare the two alternatives in the construction of simple models to help the transplant team to understand better what makes kidney transplants successful. Apart from producing a comprehensible model ML algorithms may be used to indicate which are the most relevant variables to determine the outcome of transplantation procedures. A main objective of CN2[4] is to assist users in making correct decisions, enabling new or modified theories to be derived through observation. In this sense, CN2 tries to discover patterns in a data set and returns the “best” set of decision rules understandable by human beings. The CN2 Induction algorithm was developed at the Turing Institute as part of the Machine Learning Toolbox project. The algorithm was designed to be an efficient induction tool of simple decision rules in domains where there might be noise (errors due to transcription or errors due to an insufficient description language) [16,4]. This algorithm was previously built for a Unix platform but new improvements have also made CN2 available for Microsoft Windows platforms [17]. Since we are dealing with real data that has, possibly, some misclassified records and missing values, we need a robust ML tool such as CN2. Basically the CN2 mechanism works as follows: Observation of the Facts. Two data files are loaded to obtain and test the decision rules: a file with attribute and class information and; the examples file; Discovery of a Relationship between them. In an iterative way, the algorithm searches for the conditions that enable the “explanation” of a large
806
F. Reinaldo et al.
number of samples belonging to the same class. The concept description language is of the form: IF THEN predict . A is specialized by either adding a new conjunctive term or removing a disjunctive element in one of its selectors. The learning algorithm works in an iterative way, where for each new iteration the algorithm searches for a that examines a large number of samples in a unique class and few in other class. The system searches the by performing a pruned global-to-local search; Generalization. The system performs a general-to-specific search with pruning. When a is evaluated as good, the samples explained are removed from the training set and the rule IF THEN predict C is added to the end of the rule list. The last rule in the CN2 list is a “default rule” that classifies the all new samples based on the frequent class. This process is repeated until the satisfactory no longer exists. The latest improvements to the rule syntax include the possibility of expressing a rule in the form: IF ...THEN ...ELSEIF ...THEN ..., called rule lists [18]. The important point is that the semantics of each individual rule depend on the previous ones, i.e., to trigger a rule, all the previous rules must have failed. This feature of the rule list excludes the possibility of a clash during the classification process but makes it more difficult to be interpreted by humans. Inductive Logic Programming is a major field in Machine Learning with important applications in Data Mining. The fundamental goal of a predictive ILP system is to construct models (usually called hypotheses) given background knowledge B and observations (usually called examples in the ILP literature) E. It is usual to have two kinds of examples: positive examples (E + ) (instances of the target concept) and negative examples (E − ) (used to avoid overgeneralization). A characteristic of ILP is that both data and models (hypotheses) are expressed in a subset of First Order Logic providing and expressive language to encode data and hypotheses. The problem that a predictive ILP system must solve is to find a consistent and complete model H, in the form of a set of clauses, i.e., find a set of clauses that explain all given positive examples, while being consistent with the given negative examples. Since it is not usually obvious which set of clauses should be considered for inclusion in a theory, an ILP system must search through a very large space of hypotheses to find a set with the desired properties. See [19,20] for a detailed survey on ILP. In our experiments we used a well known ILP system called Aleph [6].
4 4.1
Experiments Experimental Settings
The undertaken experiments address two questions: (Question 1) are the expert provided discretization values adequate for a propositional learner to achieve
Assessing the Eligibility of Kidney Transplant Donors
807
good performance? and ; (Question 2) can we provide an algorithm that does not require discretization of values and produces a simple and accurate model? To answer these two questions we used a data set of 106 real cases of kidney transplant characterized by 12 attributes as shown in Table 1. The original transplant problem is a regression one where the class value is the number of months the transplant patient survived. We have converted the problem into a two class problem by establishing a threshold on the numerical class values. Above 60 months we considered that the transplantation was successful and below that value it was considered not successful. The threshold value was provided by an expert. With the specified threshold there were 29 successful transplants and 79 not successful ones. We conducted two sets of experiments to answer the questions stated above. In a first set we used the advise of an expert and discretized all numerical attributes. We used CN2 with a 5-fold cross validation to analyze that data set. We have tuned the CN2 parameters and the best results were obtained with a star o size 15. In the second set of experiments we used the Aleph ILP system to analyze the raw data . Again we used a 5-fold cross validation procedure. We have also tuned the Aleph system’s minpos, noise and minaccuracy parameters. The final theory was constructed using the following settings: noise = 4, minaccuracy = 0.8 and minpos = 4. Table 1. The attributes of the transplant data set file Attribute patient gender donor gender relationship patient age donor age compatibility abo compatibility hla a compatibility hla b compatibility hla dr crossmatch pregnancies patient pregnancies donor
4.2
type/possible values male, female; male, female; brother, son, uncle, mother, father, unrelated; integer; integer; yes, no, not informed; one compatibility, none, two compatibilities; one compatibility, none, two compatibilities; one compatibility, none, two compatibilities; positive, negative; integer; integer;
Experimental Results
The results are shown in Table 2. The first line concerns the use of CN2 with expert provided discretization values. The results achieved by CN2 are below a naive classifier that used the majority class. The result suggests that we should not rely on expert information concerning discretization of attribute values. We should wither use the raw data or use some ML method to descritize the values. The answer to Question 1 is therefore negative: we did not perform good using
808
F. Reinaldo et al.
[Rule 1] [Pos cover = 9 Neg cover = 3] survived(A) :patientAge(A,B), lteqAge(B,25), donorGender(A,m). [Rule 3] [Pos cover = 4 Neg cover = 2] survived(A) :patientExamDR(A,1), compatibility(A,sim), donorRelationship(A,’d.cadaver’). [Rule 5] [Pos cover = 10 Neg cover = 4] survived(A) :patientTxDate(A,B), patientConditionDate(A,C), diffDate(B,C,D). [Rule 7] [Pos cover = 8 Neg cover = 3] survived(A) :patientAge(A,B), lteqAge(B,23), donorNpregnancies(A,0). [Rule 8] [Pos cover = 4 Neg cover = 0] survived(A) :patientGender(A,f), donorAge(A,B), gteqAge(B,50).
Fig. 1. Rules constructed by Aleph
the exert suggested discretization values. Another negative result obtained with the first set of experiments is that the model is not simple (34 rules). On the other hand we can answer positively to Question 2: we can propose and algorithm that used the raw data and improves over the naive classifier that uses the majority class. Aleph not only achieve a better performance but also produced a very simple model with just 5 rules (see Figure 1). A careful analysis of the rules indicate that we have a lot to improve. Some of the rules (like rules 1 and 8) would not appear in a large size sample, they have a fair coverage possibly because of coincidences in the small number of positive cases. From the expert point of view the only interesting rule is rule 3 that uses blood type information and exams values. Table 2. Accuracy (Acc.) and number of rules (n. rules) of the constructed models. Majority class predictor is 74.5%. Question Algorithm Acc. n. rules Q1 CN2 72.6% 34 Q2 Aleph 77% 5
5
Conclusions and Future Work
In the work reported in this paper we addressed a complex and very useful problem, that of helping a kidney transplant team to improve the success rate of organ transplant. We have applied ML learning algorithms to real cases of kidney transplant. We have achieved the construction of a very simple model with an significant accuracy improvement over a naive classifier. However we still need to improve considerably usefulness of the model in order to be helpful to the transplant team. For that purpose we will collect more data.
Assessing the Eligibility of Kidney Transplant Donors
809
We have also to include further constraints in the ILP system in order to include a larger number of relations that the expert finds informative.
Acknowledgements This work has been partially supported by the project ILP-Web-Service (PTDC/EIA/70841/2006) and by Funda¸c˜ao para a Ciˆencia e Tecnologia. Francisco Reinaldo is funded by FCT grant SFRH/BD/17051/2004. Md. Anishur Rahman is funded by a FCT grant from project PTDC/BIA-PRO/72838/2006.
References 1. Altman, L.: Dist´ urbios imunes transplante (disturb immune transplant). Manual merck (2006), http://www.msd.brazil.com 2. SNT. Sistema nacional de transplantes (National Transplant System) (2007) 3. Abbas, A.: Imunologia Celular e Molecular (Cellular and Molecular Immunology), 3rd edn. Revinter (2000) 4. Clark, P., Boswell, R.: Rule induction with CN2: Some recent improvements. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 151–163. Springer, Heidelberg (1991) 5. Clark, P., Niblett, T.: The cn2 induction algorithm. Machine Learning 3(4), 261– 283 (1989) 6. Srinivasan, A.: Aleph manual 7. Fitzwater, D.: The outcome of renal transplantation in children without prolonged pre-tranplant dialysis. Clinical Pediatrics 30, 148–152 (1991) 8. Wolfe, R.A.: Comparison of mortality in all patients on dialysis awaiting transplantation, and recipients of a first cadaveric transplant. The New England Journal of Medicine 314, 1725–1730 (1999) 9. Berthoux, F.: Pre-emptive renal transplantation in adults aged over 15 years. Nephrology Dialysis Transplantation 11, 41–43 (1996) 10. Noronha, I.L.: Diretrizes em transplante renal (guidelines on renal transplantation) (2006), http://www.sbn.org.br/Diretrizes/tx 11. DF: Decreto federal num. 2.268 (federal decree issue 2,268) (June 30, 1997) 12. Baptista-Silva, J.: Transplante renal: cirurgia no receptor: adulto (2003), http://www.lava.med.br/livro 13. Antunes, L.: Imunologia Geral (General Immunology). Atheneu (1999) 14. Duquesnoy, R.: Hlamatchmaker: a molecularly based algorithm for histocompatibility determination. Human Immunology 63, 339–352 (2002) 15. Fernando, M.: Hlamatchmaker: a molecularly based algorithm for histocompatibility determination. Revista Brasileira de Cirurgia Cardiovascular (Brazilian Journal of Cardiovascular Surgery) 16 (2001) 16. Clark, P., Niblett, T.: Induction in noisy domains. In: Progress in Machine Learning–Proceedings of EWSL 1987: 2nd European Working Session on Learning, pp. 11–30 (1987) 17. Reinaldo, F., Siqueira, M.: CN2 for microsoft windows XP (2006) 18. Rivest, R.L.: Learning decision lists (1987) 19. Muggleton, S.: Inductive logic programming. In: Proceedings of the 1st Conference on Algorithmic Learning Theory, pp. 43–62 20. Muggleton, S., Raedt, L.D.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19(20), 629–679 (1994)
Lung Nodules Classification in CT Images Using Simpson’s Index, Geometrical Measures and One-Class SVM Cleriston Araujo da Silva1 , Arist´ofanes Corrˆea Silva1 , Stelmo Magalh˜aes Barros Netto1 , Anselmo Cardoso de Paiva1, Geraldo Braz Junior1 , and Rodolfo Acatauass´ u Nunes2 1 Federal University of Maranh˜ ao - UFMA, Av. dos Portugueses, SN, Campus do Bacanga, Bacanga 65085-580, S˜ ao Lu´ıs, MA, Brazil [email protected], [email protected], [email protected], [email protected], [email protected] 2 State University of Rio de Janeiro UERJ S˜ ao Francisco de Xavier, 524, Maracan˜ a 20550-900, Rio de Janeiro RJ, Brazil [email protected]
Abstract. In this paper, we present the Simpson’s Index, a feature used in Spatial Analysis and in Biology, specifically in Ecology to determine the homogeneity or heterogeneity of a certain species. This index will be investigated as a promising feature, since little observation has been done on the application of these features for the analysis of medical images, with three geometrical features, in the characterization of lung nodules as benign or malignant. Using One-Class SVM for classification we obtained sensibility rates of 100%, specificity 100% and accuracy of 100%. Keywords: Lung Nodule, Classification, Simpson’s Index, Geometrical Measures, Support Vector Machine.
1
Introduction
Lung cancer is the commonest type of cancer in the world and one of the most serious public-health problems in Europe and North America. Statistics from the NCI-National Cancer Institute [1] estimate 215.020 new cases and 161.840 deaths in the United States in 2008. In Brazil, lung cancer occupies first place in cancer deaths among men and the forth among women. It is estimated that, in 2008, according to INCA [2], pulmonary neoplasm will be responsible for 27.270 deaths (17,810 men and 9.460 women). These numbers correspond to an estimated risk of 19 new cases for each 100 thousand men and 10 for each 100 thousand women. This disease is one of the biggest problems nowadays concerning human health and the use of tobacco is still the main risk factor. The average five-year survival P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 810–822, 2009. c Springer-Verlag Berlin Heidelberg 2009
Lung Nodules Classification in CT Images
811
varies between 13% and 21% in developed countries and between 7% and 10% in emergent countries. Several researches have been done in order to improve the understanding of this disease, aiming to discover its origins, detect it on its initial stages, develop methodologies to detect it and propose more efficient ways of treatment with fewer consequences for the patient. One of the best chances to diagnose lung cancer is when an asymptomatic patient, normally a smoker, undergoes an X-Ray or Computerized Tomography (CT) exam. The helical CT has been calling attention, because besides revealing lung cancer in a precocious stage, often unperceivable to conventional chest radiography, it allows a tri-dimensional reconstruction of the image, almost always with nodular nature [3]. Using information obtained from medical images, computer aided systems (CAD - Computer-aided Detection or CADx - Computer-aided Diagnosis) have been developed aiming to help in nodule diagnosis from its detection to a possible classification of its nature as benign or malignant, for example, giving, this way, a second opinion to the radiologist and other specialists about the analysis of the exams.
2
Related Works
In image processing, many techniques have been developed for the task of diagnosing via image. The main emphasis still is the evaluation of the morphology of the nodules. Shah et al. [4] investigated the use of a CADx system in the task of distinguishing malignant and benign nodules using volumetric data of the acquired nodules before and after contrast injection. They classified the nodules using logistic regression and square discriminant analysis, with results based on the ROC curve of 0.92 and 0.69 respectively. El-Baz et al. [5], who has recently proposed a lung nodule diagnosis through the evaluation of the growth rate of the nodules in a certain period using 3-D registration of volume to quantize it, obtained 100% of accuracy in a confidence interval of 95%. Souza [6] evaluates a set of three geometrical features as a form of distinguishing between nodules and non-nodules using Support Vector Machine (SVM) as classifier. The results achieved by them were 100% of correct classification. In our work, we classify the nodule as benign or malignant. Texture analysis has been applied as quantitative technique to provide differential information between single lung nodules. As an example, Silva [7] presents the possibility of representation of texture features to distinguish lung nodules with use of geo-statistical functions as semivariogram, covariogram, correlogram and semimadogram and obtained results of 80% in accuracy. Liang et al [8]. proposed a CADx system by the use of Neural Networks where he used image registration techniques to extract the regions of interest and next obtained texture features, considering the method of spatial dependency of the gray levels, to train and test the Neural Network. For 20 patients who were analyzed, the results were of 100% of accuracy for the 15 training sets and 100% for the 5 test sets. Petkovska et al [9] shows an approach that combines the
812
C.A. da Silva et al.
registration of images and the extraction of texture features from CT images with iodized contrast. Silva [10], investigated other geo-statistical functions as the Moran’s Index and Geary’s coefficient, where he obtained good results for accuracy, above 90% and with sensibility of 96.55%. In another work, Silva [11] evaluated the use of Ripley’s K functions and obtained 97.4% of accuracy and 90% of sensibility as discriminant nodule measurements. In this paper, we propose a methodology for lung nodule diagnosis through use of bio-statistical character texture feature combined with 3D geometry measures of the nodules which will be an aid in the evaluation of the diagnosis by the physician. The work is organized the following way: Section 3 it is described how the images were obtained, the process of segmentation of the nodules in the images, a discussion about (analysis of texture) the use of Simpsons Index as discriminant measurement in the diagnosis of lung nodules, the geometry features used and the use of One-Class SVM as classification techniques. Section 4, the results are presented and discussed. In Section 5, we present the final remarks.
3
Methodology
The methodology proposed in this paper aims to classify single lung nodules into two groups: benign and malignant. To perform this task, this methodology was based on the steps seen in Fig. 1. The first step is the acquisition of the image, which was obtained from a patient’s chest CT exam. Step 2 is the segmentation of the tri-dimensional volume of the nodule using a region growing algorithm. Right after that, the representative features of the nodules are obtained by the use of the Simpson’s Index, that is, the texture analysis stage combined with the geometric features extraction. This index has not been used in applications of analysis of medical images in order to diagnose. The last step is the classification of the nodules as benign or malignant by One-Class SVM. One-Class SVM was chosen because it was little used in such applications. 3.1
Image Acquisition
In the Image Acquisition stage, the DICOM images were obtained through Computerized Tomography (CT) exams, authorized by the patients to be used in researches. The exams were performed by the technical team of Fernandes Figueira Institute and Pedro Ernesto University Hospital, both situated in the state of Rio de Janeiro, Brazil. The equipment used in the acquisition was a Helical GE Pro Speed tomography, under the following conditions: tube voltage of 120 KVp, tube current of 100 mA, image size of 512×512 pixels, average voxel size of 0.67mm × 0.67mm × 1.00mm and 12 bits quantization according to the DICOM storage standard applied [12]. A total of 39 nodules were obtained (30 benign and 9 malignant). It is important informing that the CT exams were performed without iodized contrast injection in the patients, which can be clinically used as diagnosis improvement, but also causes morbidity (diseases) and casual deaths
Lung Nodules Classification in CT Images
813
Fig. 1. Methodology Steps
due to allergic complications. We stress, still, that the results of a diagnosis as benign or malignant was further confirmed by a specialist doctor through histopathologic exam done with the surgical removal of a sample of the nodule or through radiology after three years of stability. This explains the reduced size of the database. In our work, we consider the maximum size of the nodule as the maximum diameter of the sphere which involves the farther points in the xy or z axis. According to this definition, the average diameter of benign nodules is 23.72mm(standard deviation of 13.34) and the average diameter of the malignant ones is 40.93mm (standard deviation of 17.86). Some nodules have diameter below or equal to 3mm. The set of data we used contains malignant nodules with average diameter bigger than benign, this is a normal occurrence. The general characteristic of malignant nodules being larger than benign ones is, normally, known in the specialized literature[14]. But this does not mean that there is a cutoff diameter to distinguish benign nodules from malignant ones. In studies, only percentages are determined. In a revision of study of patients with their lung nodules screening-detected or incidentally detected, the prevalence of malignancy is 6.28% in nodules with sizes from 5 to 10mm in diameter, and 64% to 82% in nodules with sizes superior to 20mm in diameter [13]. In our work, the smallest malignant nodule have 12mm in diameter and the smallest benign one have 7mm, but there are malignant nodules that present, for example, 28 and 29mm diameters, while there are benign nodules that with 35 and 36mm diameters. Although the diameter is a general reference and is included in the Bayesian method to help distinguishing benign nodules form malignant ones, it does not have a decisive value by itself [14]. The Simpsons Index, which adds 3D texture data, aims to provide a new contribution for this scenario.
814
3.2
C.A. da Silva et al.
Lung Nodule Segmentation
A semi-automatic segmentation process was performed by specialist doctors who use a Lung Nodule Analysis System, presently called Beb´ ui [15]. In this system, the automatic stage used a 3D region growing algorithm with voxel aggregation [16], which also provides to the user greater interactivity and controls segmentation by determining the parameters required as thresholds, initial and final slices and seed. Besides that, two additional tools were added, barrier and eraser, to provide more control on the segmentation procedure. The barrier is a cylinder put around the nodule by the user with the purpose of restricting the region of interest and avoiding the invasion of other lung structures by the aggregation of voxels. Eraser is a system resource that allows doctors to erase unwanted structures, before or after segmentation, to avoid correcting segmentation errors [15]. 3.3
Simpson’s Index
In this work, the analysis of texture is done through the aggregation relation of the individual points (voxels) computed through the Simpson’s Index, which is discussed in this section. Created in 1949, the Simpson’s Index is a second order statistical spatial feature that has been used by Ecology specialists to determine the biodiversity of species in a region [17]. Its main functionality is to summarize the representation of this diversity in a single value capable of qualifying this region as very heterogeneous or uniform. Simpson’s Index takes into consideration the richness of the species, that is, the number of species present in an area, and still, the regularity of such species, what is a measurement of the relative abundance of each species [18] [19]. With these considerations it is possible to analyze which community in a region is more diversified. The Simpson’s Index is the measurement of the probability of two individuals, randomly selected from a sample, to belong to the same i species among the j existing species in the sample, as in Equation 1 [20]. D=
j
p2i
(1)
i=1
where pi = nNi . For each i,is found the probability (pi ) for the occurrence of the i specie; ni represents the occurrence of individuals from the i species and N is the total of individuals in the sample. The index is normally used according to Equation 1 when the sample is obtained by sampling process, not being possible to exactly determine the number of individuals in this sample. For a finite sample, where the total amount of individuals is known, the Simpson’s Index can be obtained, still, through Equation 2 [21]. j ni (ni − 1) D = i=1 (2) N (N − 1)
Lung Nodules Classification in CT Images
815
The values obtained for the Simpson’s Index are in the interval from 0 to 1, where the value 0 represents infinite diversity in the sample and 1 means that there’s no diversity. Our proposal is the extraction of the Simpson’s Index, as a measurement for texture, taking each voxel found in the volume as being a possible specie under analysis. The distinguishing may be made possible due to the morphological behavior of each kind of nodule: benign nodules present, in most cases, a round or well defined shape, while malignant nodules, due to their capability of spreading to other organs present a spicate or less defined shape. As the obtaining of the index will occur in areas of interest, the small occurrence of voxels in a certain area of interest can be related to the shape of this nodule. This way, the benign nodules have a tendency to show a more homogeneous behavior, that is, less diversified in a certain region of study. 3.4
Geometrical Measures
The shape of a lung nodule may represent an important indicator of its malignancy or benignity, as we said before. With features geometrical measures is possible to extract and analyze further information identified or not identified by doctors. In this work, three 3D geometry features extracted from each nodule in our database. They are: Spherical Disproportion, Spherical Density and Sphericity. Spherical Disproportion is described in the Equation 3, Spherical Disproportion in Equation 4 and Sphericity in Equation 5. Other information about these measurements can be found in [6]. S=
A 4πR2
(3)
ϕ=
100.n V
(4)
2
E= 3.5
1
(6V ) 3 π 3 A
(5)
Support Vector Machine
Support Vector Machine (SVM) introduced by V. Vapnik in 1995 is a method to estimate the function classifying the data into two classes [22]. The basic idea of SVM is to construct a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized. The SVM term come from the fact that the points in the training set which are closest to the decision surface are called support vectors. SVM achieves this by the structural risk minimization principle that is based on the fact that the error rate of a learning machine on the test data is bounded by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis (VC) dimension [23].
816
C.A. da Silva et al.
Data statistical classification could be mapped into a problem of one, two or multi classes. In two-class classification, the data from two classes are available. It is supposed that the dataset is equally-balanced class samples. An unbalanced dataset could lead to poor results [24]. A common problem with this approach is that the decision boundary create by two classes SVM, could make a large misclassification rate if class samples are not well separable. One-class SVM constructs a classifier only from a set of labeled positive patterns, called positive training samples. One-class SVM performs better than two-classes SVM in separating the target data from others because generate a decision hypersphere limiting just one class from others that could exist [25]. The strategy is to map the data into the feature space and then try using a hypersphere to describe data and put most of then into the hypersphere. Therefore the methodology has to know only one class and unbalanced datasets could be used without performance problems. One-Class SVM. One-Class Support Vector Machine (SVM) was proposed by Sch¨ olkopf for estimating the support of a high dimensional distribution [26]. Given a training dataset without any class information, the One-Class SVM constructs a decision function that takes positive values in a small region capturing most of the data points, and the negative samples elsewhere [27]. The process starts with a training set of points: x = {xi |i = 1, 2, 3, ..., l}
(6)
where l is the number of samples. Consider that each sample x in the input space is mapped to a point z = φ (x) of a higher dimensional space. The goal is to separate, in the feature space, the data from the origin with maximum margin [28]. Given the training vectors xi without any class information, the primal form in [26] is:
1 T w .w w,b,ξ,ρ 2 T
−ρ+
l
ξi i=1 subject to w · φ (xi ) ≥ ρ − ξi ξi ≥ 0, i = 1, ..., l min
1 vl
(7)
where w is the normal vector of the hyperplane witch represents the decision boundary, ρ represent the threshold of function errors and ξi the are slack variables that penalize training errors. The trade off between the radius of the hyper-sphere and the number of training samples that it can hold is set by the parameter v[0, 1]. The dual form is: min 12 αT Qα subject to 0 ≤ αi ≤ 1, i = 1, ..., l eT α = vl
(8)
Lung Nodules Classification in CT Images
817
where Qij = K(xi , xj ) ≡ φ(xi )T φ(xj ). The kernel function K maps the input space into the feature space. The decision function is: l f (x) = sgn αi K (xi , x) − ρ (9) i=1
Classification of a new data point x is performed by computing the sign of the right side of Equation 9. An important family of kernel functions is the Radial Basis Function, more commonly used for pattern recognition problems, which has been used in this paper, and is defined by: 2
K(x, y) = e−γx−y
(10)
where γ > 0 is a parameter that also is defined by the user. 3.6
Validation the Classification Method
In order to evaluate the methodology concerning its power of characterizing the proposed groups, we tried to obtain the sensibility(Se), specificity(Sp) and accuracy(Ac) measurements for all analysis performed in the study. Sensibility is given by T P /(T P + F N ), specificity is obtained by T N /(T N + F P ), and accuracy is given by (T P + T N )/(T P + T N + F P + F N ), where T P is truepositive, T N is true-negative, F P is false-positive and F N is false-negative. This way, the malignant lung nodules correctly computed are reported as true positives.
4
Results
We performed the extraction of the texture features applying two different forms of analysis. In the first one, the features were obtained considering the areas of interest formed by concentric circles as shown in Fig. 2. Each area of interest represents an experiment for the method. In the second analysis, the area of interest was represented by circular rings as in Fig. 3. Our objective with these forms of extraction is to evaluate which one is the best way of using the texture feature, taking into consideration that the analysis by circles aims to determine differences in the diversity patterns for benign and malignant nodules in each region of study from the nodule bounds to its center. In the analysis by rings we aimed to evaluate the diversity in the border regions determined by two concentric circles. We determined the size of the circles by finding the central point, mass center, of each nodule and then, we calculated the distance of this central point to the most distant point of each nodule. This way, we get a R radius that represents a greater possible measurement for the construction of a circle or still, in the analysis by rings, the maximum allowed radius. From the R radius, we got the others values of radiuses as 1/6R, 1/3R, 1/2R, 2/3R and 5/6R. These are represented as R1, R2, R3, R4, R5 e R6 (value of R).
818
C.A. da Silva et al.
Fig. 2. Analysis applied to the nodule by means of concentric circles containing 6 radius
Fig. 3. Analysis applied to the nodule by means of circular rings containing 6 external radius
Next, the Simpson’s Index of Equation 2 was calculated in each region for a certain radius or ring. We made use of this index because we have quantitative and exact knowledge of the total number of individuals in the sample, that is, the total of voxels in each nodule. In order to increase the discriminatory power of the methodology, we obtained geometry measurements of the nodules that were reported in Section 3.4. Then, we performed the classification considering the Simpson’s Index extracted in each radius, for the analysis by circles, and in each ring, for the analysis by rings aggregating to each analysis the geometry measurements. A library for Support Vector Machines, called LIBSVM [28], was used for training and testing the One-Class SVM classifier. During the classification step, four different proportions for the training and test subgroups were used: 50/50, 60/40, 70/30 and 80/20, where the first number represents the percentage of cases used in training(Tr) and the second number represents the percentage of cases used in test(Te). The cases used in each subgroup were randomly selected from the total number of database. The results shown in Table 1 was obtained in each Tr/Te proportion for each region in analysis in circles and indicate that in the region of radius R1 and shape measurements we obtained a sensibility of 100%, specificity 100% and an accuracy of 100% in Tr/Te proportion of 60/40 and 70/30. The discussed region is considered as a central region in the morphology of the nodule. In this region, although the nodules, in their interior, show similar values of diversity,
Lung Nodules Classification in CT Images
819
a calcification for benign or a necrosis for malignant ones, the uniting of the texture measurement with the geometry measurements significantly improves the sensibility values. These results make the methodology to be promising in the diagnosis of benign and malignant lung nodules. Table 1. Results found for all group Tr/Te in each region in the analysis in circle
R1 R2 R3 R4 R5 R6
Tr/Te = 50/50 Se Sp A % % % 75 96.67 85.83 50 66.67 58.33 0 100 50 75 50 62.50 100 43.33 71.67 25 86.67 55.83
Tr/Te = 60/40 Se Sp A % % % 100 100 100 75 70 72.50 75 56.67 65.83 75 66.67 70.83 75 30 52.5 0 100 50
Tr/Te = 70/30 Se Sp A % % % 100 100 100 100 76.67 88.33 100 56.67 78.33 33.33 93.33 63.33 100 66.67 83.33 100 53.33 76.67
Tr/Te = 80/20 Se Sp A % % % 50 100 75 50 80 65 100 73.33 86.67 100 73.33 86.67 100 56.67 78.33 50 100 75
The results shown in Table 2 was obtained in each Tr/Te proportion for each region in analysis in rings and indicate that in ring A1 the best values of sensibility, specificity and accuracy were found: 100%, 80% and 90%, respectively in Tr/Te proportion of 80/20. The use of geometry aided to put this boundary region in evidence as discriminant between malignant and benign nodules. Table 2. Results found for all group Tr/Te in each region in the analysis in rings
A1 A2 A3 A4 A5
Tr/Te = Se Sp % % 0 100 50 76.67 25 90 75 60 50 63.33
50/50 A % 50 63.33 57.5 67.5 56.67
Tr/Te = Se Sp % % 50 70 75 73.33 25 90 75 60 75 63.33
60/40 A % 60 74.17 57.5 67 69.17
Tr/Te = 70/30 Se Sp A % % % 100 3.33 51.67 66.67 83.33 75 100 63.33 81.67 33.33 66.67 50 66.67 40 53.33
Tr/Te = 80/20 Se Sp A % % % 100 80 90 100 46.67 73.33 100 63.33 81.67 100 43.33 71.67 50 83.33 66.67
The Table 3 shown the results of the sensibility averages, specificity and accuracy obtained in each group Tr/Te for the analysis in circles. The best found result was of the group 70/30 that obtained values of 88.89% of sensibility, 74.44% of specificity and 81.67% of accuracy. The Table 4 display the results of the sensibility averages, specificity and accuracy obtained in each group Tr/Te for the analysis in rings. The best found result was of the group 80/20 that obtained values of 90% of sensibility, 63.33% of specificity and 76.67% of accuracy.
820
C.A. da Silva et al.
Table 3. Results found to averages of the sensibility, specificity and accuracy obtained in group Tr/Te for the analysis in circles Tr/Te Se (%) Sp (%) Ac (%) 50/50 54.17 73.89 64.03 60/40 66.67 70.56 68.61 70/30 88.89 74.44 81.67 80/20 75.00 80.56 77.78 Table 4. Results found to averages of the sensibility, specificity and accuracy obtained in group Tr/Te for the analysis in rings Tr/Te Se (%) Sp (%) Ac (%) 50/50 40.00 78.00 59.00 60/40 60.00 71.33 65.67 70/30 73.33 51.33 62.33 80/20 90.00 63.33 76.67
5
Conclusions
This paper presented the Simpsons Index as a measurement of the texture that represents the behavior pattern of single lung nodules. The classification was done by means of One-Class SVM. The best obtained results showed an accuracy of 100%, sensibility of 100% and specificity of 100%, using the Simpsons Index was used combined with geometry measurements. These results qualify the measurement as promising with the objective of characterizing nodules and help doctors and specialists in a stage of diagnosis and demonstrate the efficacy of the methodology with the diagnosis issue. We emphasize that the experiments were performed in a TC database without iodized contrast, what stresses the potential of the methodology since it avoids allergic effects in patients, as well as the use of more sophisticated and expensive equipment.
Acknowledgments We would like to thank the technical team of the Fernando Figueira Institute and the Pedro Ernesto University Hospital for the grant of the images used in this research; and to CAPES and CNPq for the financial support for this research.
References 1. NCI, N.C.I.: Lung cancer (2008), http://www.cancer.gov/cancertopics/types/lung 2. de Cˆ ancer, I.N.: Estimativa: 2008 incidˆencia de cancˆer no brasil (2008), http://www.inca.gov.br/estimativa/2008
Lung Nodules Classification in CT Images
821
3. da Silva, E.C.: Diagn´ ostico de n´ odulo pulmonar assistido por computador usando t´ecnicas de processamento de imagens, ´ındice de moran e coeficiente de geary. Master’s thesis, Universidade Federal do Maranh˜ ao, S˜ ao Lu´ıs (2007) 4. Shah, S., McNitt-Gray, M., Rogers, S., Goldin, J., Suh, R., Sayre, J., Petkovska, I., Kim, H., Aberle, D.: Computer aided characterization of the solitary pulmonary nodule using volumetric and contrast enhancement features. Academic Radiology 12, 1310–1319 (2005) 5. Gimel´farb, A.E.B.G., El-Ghar, R.F.M.A.: Computer aided characterization of the solitary pulmonary nodule using volumetric and contrast enhancement features. Academic Rodiology 12, 1310–1319 (2005) 6. Sousa, J.R.F.S., Silva, A.C., Paiva, A.C.: Lung structure classification using 3d geometric measurements and SVM. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 783–792. Springer, Heidelberg (2007) 7. Silva, A.C.: Algoritmos para Diagn´ ostico Assistido de N´ odulos Pulmonares Solit´ arios em Imagens de Tomografia Computadorizada. PhD thesis, Pontif´ıcia Universidade Cat´ olica do Rio de Janeiro - PUC-Rio (2004) 8. Liang, T.K., Toshiyuki, T., Nakamura, H., Ishizaka, A.: Automatic extraction and diagnosis of lung emphysema from lung ct image using artificial neraul network. In: SICE-ICASE International Conference, pp. 2306–2311 (2006) 9. Petkovsa, I., Mcnitt-Gray, S.K.S.M.F., Goldin, J.G., Brown, M.S., Kim, H.J., Brown, K., Aberle, D.R.: Pulmonary nodule characterization: A comparison of conventional with quantitative and visual semi-quantitative analyses using contrast enhancement maps. European Journal of Radiology 59, 244–252 (2006) 10. Silva, E.C., Silva, A.C., Paiva, A.C., Nunes, R.A.: Diagnosis of lung nodule using moran’s index and geary’s coefficient in computerized tomography images. Pattern Analysis & Applications 11, 89–99 (2007) 11. Silva, E.C., Silva, A.C., Paiva, A.C., Nunes, R.A., Gattass, M.: Diagnosis of solitary lung nodules using the local form of ripley’s k function applied to three-dimensional ct data. Computer methods and Programs in Biomedicine 90, 230–239 (2008) 12. NEMA, N.E.M.A.: Digital imaging and communications in medicine (dicom) (2004), http://www.nema.org/stds/2007-DICOM-FULLSET.cfm 13. Wahidi, M.M., Govert, J.A.M.D., Goudar, R.K., Gould, M.K., Miccrory, D.C.: Diagnosis and management of lung cancer accp evidence-based clinical 132(3), 94s–107s (2007); Chest Supplement, 681– 684 Practice Guideness (2nd edn.) 14. Nakamura, K., Yoshida, H., Engelmann, R., Macmahon, H., Katsuragawa, S., Ishida, T., Ashizawa, K., Doi, K.: Computerized analysis of the likelihood of malignancy in solitary pulmonary nodules with use of artificial neural networks. Radiology 214, 823–830 (2000) 15. Silva, A.C., Carvalho, P.C.P.: Sistema de an´ alise de n´ odulo pulmonar. In: II Workshop de Inform´ atica aplicada a Sa´ ude, Itajai, Universidade de Itajai, Itajai (2002), http://www.cbcomp.univali.br/pdf/2002/wsp035.pdf 16. Nikolaidis, N., Pitas, I.: 3-D Image Processing Algorithms. John Wiley, New York (2001) 17. Simpson, E.H.: Mensurement of diversity. Nature 163, 688 (1949) 18. Hill, M.O.: Diversity and evenness: a unifying notation and its consequences. Ecology 54, 427–432 (1973) 19. Ahumada, J., Ayres, D.: Curva de ranqueamento de esp´ecies, ´ındice de diversidade - simpson (2007), http://wiki.teamnetwork.org/wiki/pages/viewpage.action?pageId=9638 20. Ricklefs, R.E.: Estrutura da comunidade, 3rd edn., ch. 22, pp. 344–346. Guanabara Koogan, Rio de Janeiro (1997)
822
C.A. da Silva et al.
21. Lyons, D.J., Dunworth, P.M., Tilbury, D.W.: Simpson’s diversity index (2008), http://www.countrysideinfo.co.uk/simpsons.htm 22. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Kluwer Academic Publishers, Dordrecht (1998) 23. Zhuang, L., Dai, H.: Parameter Optimization of Kernel-based One-class Classifier on Imbalance Learning. Journal of Computers 1 (2006) 24. Sch¨ olkopf, B., Smola, A.: Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002) 25. Manevitz, L., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001) 26. Sch¨ olkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating the Support of a High-Dimensional Distribution. Neural Computation 13, 1443– 1471 (2001) 27. Tran, Q., Zhang, Q., Li, X.: Evolving training model method for one-class SVM. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 3 (2003) 28. Chang, C., Lin, C.: LIBSVM: A Library for Support Vector Machines 80, 604–611 (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm
Author Index
Abdala, Daniel Duarte 361 Adda, Mo 148 Antunes, Cl´ audia 266 Appice, Annalisa 427 Archetti, F. 594 Arosio, G. 594 Awwad Shiekh Hasan, Bashar
96
Barakat, Nahla 311 Barinova, Olga 178 Benabdeslem, Khalid 322, 413 Bi, Yaxin 569 Blanzieri, Enrico 295 Borrajo, Daniel 760 Bouchon-Meunier, Bernadette 704 Bouthinon, Dominique 465 Bradley, Andrew P. 311 Bramer, Max 148 Braz Junior, Geraldo 810 Broekmans, Frank J. 787 Bunke, Horst 617 Cakaloglu, Serhat 505 Calliess, Jan-P. 494 Camacho, Rui 802 Carroll, Douglas 280 Caruso, Costantina 427 Ceci, Michelangelo 427 Celepcikay, Oner Ulvi 719 Changuel, Sahar 704 Chen, Jong-Chen 603 Czarnowski, Ireneusz 347
Fauser, Bart C.J.M. 787 Feelders, Ad 787 Fern´ andez, Fernando 760 Fern´ andez, Susana 760 Fernandes, Carlos 802 Fersini, E. 594 France, Stephen 280 Fu, Huirong 53 Fu, Yongjian 53 Fumarola, Fabio 427 Gad, Walaa K. 663 Gan, John Q. 96 Gao, Yunjun 236 Giacinto, Giorgio 163 Gon¸calves, Teresa 632 G¨ urgen, Fikret 749 Habich, Dirk 221 Hahmann, Martin 221 Haraguchi, Makoto 479 H¨ oppner, F. 385 Hu, Xuegang 236 Imiya, Atsushi 372 Isaksson, Charlie 440 J¸edrzejowicz, Piotr 347 Jiang, Xiaoyi 361 Jing, Hongfang 647 Jorge, Alipio Mario 191
Dashevskiy, Mikhail 775 da Silva, Cleriston Araujo 810 de Groote, Arend 787 de Paiva, Anselmo Cardoso 540, 810 Derelio˘ glu, G¨ ulnur 749 de Sousa, Jorge Freire 191 Duangsoithong, Rakkrit 206 Dunham, Margaret H. 440
Kamel, Mohamed S. 663 Kehlenbeck, Alexander 398, 689 Kobayashi, Takao 137 Kompatsiaris, Ioannis 525 Kononenko, Igor 678 Kopylov, A. 16 Kovalev, Vassili 554 Kurasova, Olga 29
Eick, Christoph F. 719 Eijkemans, Marinus J.C. Elghazel, Haytham 322
Labroche, Nicolas 704 Lee, Wei-Chung 603 Lehner, Wolfgang 221
787
824
Author Index
Li, Miaomiao 82 Li, Peipei 236 Liang, Qianhui 236 Lin, Kuo-Ming 603 Liu, Qingzhong 582 Liu, Xinwang 82 Loglisci, Corrado 251, 427 Luo, Zhiyuan 775 Malerba, Donato 251, 427 Malucelli, Andreia 802 Manzano, David 760 Mej´ıa-Lavalle, Manuel 454 Mendes, Ana Cristina 266 Mendes-Moreira, Jo˜ ao 191 Messina, E. 594 Moed, M. 123 Molyt˙e, Alma 29 Moreland, Katherine 44, 516 Mottl, V. 16 M¨ uller, Sascha 337 M¨ uller-Schloer, Christian 337 Netto, Stelmo Magalh˜ aes Barros 810 Nikolopoulos, Spiros 525 Nikovski, Daniel 734 Nunes, Andr´e Pereira 540 Nunes, Rodolfo Acatauass´ u 810 Okay, Nesrin 749 Okubo, Yoshiaki 479 Oliveira, Eug´enio 398, 689 Ordonez, Carlos 719 Papadopoulos, Georgios Th. Patras, Ioannis 525 Prus, Aliaksandr 554
525
Qiao, Mengyu 582 Quaresma, Paulo 632 Qureshi, Taimur 68 Rahman, Md. Anishur 802 Ramachandran, Ganesan 734 Reinaldo, Francisco 802 Renooij, Silja 787 Riesen, Kaspar 617 Roli, Fabio 163 Rosenthal, Frank 221
Sakai, Tomoya 372 S´ anchez Vivar, Atl´ antida 454 Sarmento, Lu´ıs 398, 689 Segata, Nicola 295 Seredin, O. 16 Shen, Xuhui 569 Shimizu, Ikuko 137 Silva, Arist´ ofanes Corrˆea 540, 810 ˇ Sikonja, Marko Robnik 678 Smaoui, Tarek 337 Smirnov, E.N. 123 Snoussi, Jihene 413 Soares, Carlos 191 Soldano, Henry 465 Stahl, Frederic 148 ˇ Strumbelj, Erik 678 Sung, Andrew H. 582 Tronci, Roberto Truemper, Klaus Ungar, Lyle
163 1, 44, 516
398, 689
Valente, Carmine 427 van der Gaag, Linda C. 787 Vankevich, Pavel 554 Ventos, V´eronique 465 Vetrov, Dmitry 178 Volk, Peter Benjamin 221 Wang, Bin 647 Wang, Feng 107 Windeatt, Terry 206 Wu, Shou-zhe 603 Xiong, Pan 569 Xu, Yan 647 Yang, Yahui 647 Yin, Jianping 82 Y¨ uksek, Kemal 505 Zhan, Yubin 82 Zhang, Guomin 82 Zhang, Hongbin 107 Zhu, En 82 Zhu, Ye 53 Zighed, Djamel A. 68